Course Outline (Part 31)

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

Python has a built-in package called re, which can be used to work with Regular Expressions.


1. re module

To use regular expressions, import the module:

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")

2. Key Methods

The re module offers a set of functions that allows us to search a string for a match.

findall()

Returns a list containing all matches. If no matches are found, an empty list is returned.

import re
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x) # ['ai', 'ai']

Searches the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence is returned.

import re
txt = "The rain in Spain"
x = re.search("\s", txt) # Search for white-space
print("The first white-space character is located in position:", x.start())

split()

Returns a list where the string has been split at each match.

import re
txt = "The rain in Spain"
x = re.split("\s", txt) # Split at each white-space
print(x) # ['The', 'rain', 'in', 'Spain']

sub()

Replaces the matches with the text of your choice.

import re
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x) # The9rain9in9Spain

3. Metacharacters

Metacharacters are characters with a special meaning:

CharacterDescriptionExample
[]A set of characters"[a-m]"
\Signals a special sequence (can also be used to escape special characters)"\d"
.Any character (except newline character)"he..o"
^Starts with"^hello"
$Ends with"planet$"
*Zero or more occurrences"he.*o"
+One or more occurrences"he.+o"
?Zero or one occurrences"he.?o"
{}Exactly the specified number of occurrences"al{2}o"
|Either or"falls|stays"
()Capture and group

4. Special Sequences

A special sequence is a \ followed by one of the characters in the list below:

  • \d: Returns a match where the string contains digits (numbers from 0-9)
  • \w: Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
  • \s: Returns a match where the string contains a white space character
  • \D, \W, \S: The exact opposites of the above.

5. Grouping

You can extract parts of the match by grouping them with parentheses ().

import re
txt = "John Doe, age: 30"
match = re.search(r"(\w+) (\w+), age: (\d+)", txt)

if match:
    print("First Name:", match.group(1)) # John
    print("Last Name:", match.group(2))  # Doe
    print("Age:", match.group(3))        # 30

(The r before the string literal defines a raw string, which prevents Python from parsing the backslashes as escape characters).

Discussion

Loading comments...