Regular expressions
A regular expression, or regex, is a small pattern that describes a set of strings. Instead of saying “the characters at positions 20 through 31,” you say “three digits, a dash, two digits, a dash, four digits,” and the computer finds every place that matches, wherever it sits. Python’s built-in re module speaks this language, and so does pandas.
A first pattern¶
Suppose we want every Social Security number in a sentence. An SSN looks like three digits, two digits, four digits, joined by dashes. We write that as a pattern and hand it to re.findall, which returns every match it finds.
import re
text = "My SSN is 123-45-6789, or maybe it is 321-45-6789."
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)['123-45-6789', '321-45-6789']Read the pattern piece by piece. [0-9] means any one digit. The {3} after it means exactly three of those. The dashes are literal characters that have to appear as written. Put together, the pattern matches anything shaped like an SSN, and findall pulls out both of them.
A few building blocks cover most patterns you will write:
| Pattern | Meaning |
|---|---|
\d | any digit (same as [0-9]) |
\w | any word character: letter, digit, or underscore |
\s | any whitespace |
. | any character at all |
{3} | exactly three of the thing before it |
+ | one or more |
* | zero or more |
[...] | any one character from the set in brackets |
(...) | a capture group (more on these below) |
Try it yourself¶
Patterns are far easier to learn by experiment than by reading. The explorer below is a small version of regex101.com that runs right here. Type a pattern in the top box and the sample text lights up wherever it matches. Capture groups, the parts you wrap in parentheses, each get their own color and show up in the table underneath.
The sample text is a few log lines, the same messy data from the previous section. The starting pattern pulls the date and time out of the square brackets.
from utils_regex import run_regex_demo
run_regex_demo()Some patterns worth trying in the explorer:
\d+to highlight every run of digits.\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}to grab the IP address at the start of each line."(GET|POST) (\S+)"to capture the request method and path. Turn onIGNORECASEand see what changes.
Capture groups¶
Matching is useful, but often we want the pieces inside a match, not just the whole thing. Wrapping part of a pattern in parentheses makes it a capture group, and re.findall then returns the captured pieces.
text = "I will meet you at 08:30:00 pm tomorrow"
pattern = r"(\d\d):(\d\d):(\d\d)"
matches = re.findall(pattern, text)
matches[('08', '30', '00')]Each match comes back as a tuple of its groups, one entry per pair of parentheses. We can unpack them straight into named variables.
hour, minute, second = matches[0]
print("hour:", hour)
print("minute:", minute)
print("second:", second)hour: 08
minute: 30
second: 00