Regular Expressions

When it comes to searching a string pattern, using regular expression is the most preferred way of implementation. Regular expression (a.k.a. regex, RE) allows you to search a sequence of characters within a textual data by matching a given search pattern. Regex syntax can be written using one of the two main methodologies; Posix or Perl. Python uses syntax similar to Perl and you need to import the 're' module to use regular expressions.

Regex is typically used for the following functionality:

  • Search and replace a substring or the entire string with another string (replace/substitute functionality)
  • Get all the hidden substrings within a string that match a pattern (search/find/findall functionality)

Let us now understand simple concepts of applying RE for both these requirements

Regular Expression For Replace (substitute)

Let us say you want to replace certain substrings with certain other substrings within a text. You can use the 'replace' method of the string, but you will soon see the power of regex with a few examples below.

Here is a string you are trying to replace using regex

import re
my_text = 'catorangecatdogapplekiwi'
re.sub('cat', 'CAT', my_text)

Output:
CATorangeCATdogapplekiwi

As you can see, this expression returns a string with all the lower case 'cat' replaced with upper case 'CAT'. Of course you could have had the exact same output implemented using the 'replace' method of string also.

However, if you want to replace multiple such words, then regex becomes very convenient. Let us say you want to replace all 'cat' OR 'dog' substrings with a '-', then your regex would be:


import re
my_text = 'catorangecatdogapplekiwi'
re.sub('cat|dog', '-', my_text)

Output:
-orange--applekiwi

By using the OR ('|') operator between substrings, you can search and replace any number of substrings by a single expression a.k.a pattern.

This flexibility and ease is just the beginning, here is a complete rundown on some of the notations and expressions you could use in the above pattern.

Other Regex Patterns

Spl Char Example Output Comments
|
re.sub('cat | dog', '-', 'catorangedog')
-orange- OR operator. All 'cat' OR 'dog' substrings are replaced with '-'
* re.sub('ca*', '-', 'cadcaaaatorangedog') -d-torangedog Wild card character (*) matches and replaces zero or more occurrences of the preceding character; in this example character 'a'
? re.sub('ca?', '-', 'cdcaaaatorangedog') -d-aaatorangedog Question mark indicates zero or 1 occurrence of the preceding character, in this example character 'a'
+ re.sub('ca+', '-', 'cdcaaaatorangedog') cd-torangedog Plus symbol indicates 1 or more occurrences of the preceding character; in this example character 'a'
. re.sub('ca.', '-', 'catcaporangedogca') --orangedogca The dot (.) operator matches any character in the dot position
[ ] re.sub('ca[pt]', '-', 'catcancaporangedogca') -can-orangedogca A square bracket expression matches a single character that is contained within the brackets; in this example cat and cap or replaced but not 'can'
^ inside brackets
re.sub('ca[^ pt]', '-', 'catcancaporangedogca')
cat-caporangedogca
^ (caret symbol) when used within brackets will apply exclusions of characters within the brackets; in this example cat and cap are excluded from replace and 'can' is replaced
^ re.sub('^cat', '-', 'catdog') -dog Adding ^ matches the 'cat' substring if it is the start of the string, otherwise not
()
re.sub('cat(dog | fish)', '-', 'catdogcaticecatfish')
-catice- To add a subpattern within a pattern parenthesis is used
$ re.sub('cat$', '-', 'catdogcaticecat') catdogcatice- $ looks for pattern at the end of the string
{ m } re.sub('cat{2}', '-', 'catcattdogcap') cat-dogcap Curly brace is used to specify the number of times the given pattern should repeat; in this example, it replaces the string which has 2 times 't'
{ m } re.sub('(cat){2}', '-', 'catcattdogcap') -tdogcap Using parenthesis to search for 2 sub patterns (cat substring)
{ m,n } re.sub('(cat){3,4}', '-', 'catcatdogcatcatcatcat') catcatdog- replace from 'm' to 'n' repetitions of the preceding RE; substrings of 3 to 4 cat substrings (which ever is higher is used) are replaced in this example.
{m,n}?
re.sub('(cat){3,4}?', '-', 'catcatdogcatcatcatcat')
catcatdog-cat Same as before except this is a non-greedy one in which the lower number of RE (3) is replaced

Range Notations

You can specify range of characters using '-'. For example to include all characters from A-Z you could use [A-Z] and here are some examples of using range

Special Characters Example Output Comments
[A-C]
re.sub(r'[A-C]', "-", "AaCZD1B")
-a-ZD1- Replaces A, B and C with a '-'
[0-4]
re.sub(r'[0-4]', "-", "Aa1CZD8C")
Aa-CZD8C Replaces 0, 1, 2, 3, 4 with a '-'

Typically you would see the below ranges

  • [A-Z] - all upper case letters
  • [a-z] - all lower case letters
  • [0-9] - all numerical values

Significance of using backslash '\'

While you have already understood that '\' is used for escaping the regular meaning of a literal in Strings, in RE it has another meaning and that is what you will learn now

Pattern Example Output Comments
\d
re.sub('\d', "-", "1ABa3")
--ABa- Matches any digit. Similar to using [0-9]
\D
re.sub('\D', "-", "1ABa3")
1---3 Matches any non - digit. Similar to using [^0-9]
\s
re.sub('\s', "-", " 1AB a3\n")
-1AB-a3- Matches single space, tab, newline characters and is equivalent to using [\t\n\r\f\v]
\w
re.sub('\w', "-", "1 A$a3\n\r")
- -$--\n\r
Matches any alphanumeric character and is equivalent to using [A-Za-z0-9_]
\W
re.sub('\W', "-", "1 A$a3\n\r")
1-A-a3--
Matches any non-alphanumeric character and is equivalent to using [^A-Za-z0-9_]

Using 'r' to represent 'raw' string with pattern

RE is represented as a String. While RE has a special meaning for backslash literal \, a String can also be constructed with backslash literal that represents a particular character. This is where the confusion arises. To escape a string's use of \ you can escape the \ with another backslash before it like:\\ This could get too unwieldy and to avoid this confusion, you can use 'r' letter before the RE to represent the raw Regex string in which it will not apply the String rules for any '\' literal but applies only RE rules.

Let us take an example; \b represents a backspace in String. However \b represents empty space in RE. Now if we construct a regex with '\b' then it applies the String rules. However if you now create a string with 'r' preceding it, or if you escape the '\' with another '\' then it applies RE rules.


import re
print('the word is \bpython') # \b is backspace in string. check the output
s1 = re.sub('\bpython', "P", "it is easy to learn python") 
print(s1) # python string not replaced
s2 = re.sub('\\bpython', "P", "it is easy to learn python")
print(s2) # python string is replaced
s3 = re.sub(r'\bpython', 'P', "it is easy to learn python")
print(s3) # python string is replaced

Output:
the word ispython
it is easy to learn python
it is easy to learn P
it is easy to learn P

Compile to get Pattern object

While you can use the pattern directly in all the relevant functions of the re module, you can also compile the pattern to get a Pattern object. If you are constantly using the same pattern elsewhere in your program, keeping a compiled Pattern object will improve efficiency of your program

Here is the same example shown above but using a compile step.


import re
pattern = re.compile('cat|dog')
my_text = 'catorangecatdogapplekiwi'
new_str = re.sub(pattern, '-', my_text)
print(new_str)
words = re.findall(pattern, my_text)
print(words)

Output:
-orange--applekiwi
['cat', 'cat', 'dog']

Advantage of Groups

You can enclose an RE pattern within a pair of parenthesis to form a group; similar to algebraic expressions. When you create a group, you can later reference the string matching an RE group to be used later in the pattern matching.

Let us take an example in which you need to get the content between starting and ending of a given tag element. In the given example below, you are required to pull out text between starting and ending of elements without children; tag1 element and tag3 element. Note that tag2 has tag3 as a child and hence you need to ignore this element.


import re
my_str = '<tag1>abc</tag1><tag2>xyz<tag3>123</tag3></tag2>'
pattern = re.compile(r'<([a-z1-9]*)>([^<]*)</\1>')
re.findall(pattern,my_str)

Output:
[('tag1', 'abc'), ('tag3', '123')]

The given regex has two groups; basically two sets of patterns between parenthesis. Group numbers start at 1 and so, \1 references the string captured by the first group And to this captured tag name, you are adding the '/' to get the ending tag of the same name.

You also see another group that references the text between the starting and ending tag. When you apply the findall method on the given string, you will get all the matches against the various groups as tuples and thus the result contains the tag and the text between the tag as the match.

Match object

A Match object is returned with certain methods. Here are some methods that return a match object

  • re.match
  • re.search
  • re.finditer
  • re.fullmatch

You can replace 're' above with the 'Pattern' object and the result is the same

If the RE is matched a match object is returned else None is returned. With a match object, you can find the group (string and substrings that matched etc.), and many other information. Refer documentation: https://docs.python.org/3/library/re.html#match-objects

To search for a character pattern, you can use the search method in re module. This method scans through the string looking for the first location where the RE pattern produces a match. If found, it returns a match object otherwise it returns None if no position in the string matches the pattern.

match = re.search('(cat){3,4}?', 'catcatdogcatcatcatcat')
match.start()

Output:
9

References

results matching ""

    No results matching ""