String methods in pandas

Before we reach for regular expressions, it is worth seeing how far ordinary string methods take us. pandas gives every text column a .str accessor that applies Python’s string operations to the whole column at once. Most everyday cleaning, lowercasing, trimming spaces, swapping one substring for another, needs nothing more than this.

import pandas as pd

A join that should work, but doesn’t¶

Here are two small tables. One lists counties and their state; the other lists counties and their population. They share a County column, so joining them should be easy.

states = pd.read_csv("data/county_and_state.csv")
populations = pd.read_csv("data/county_and_population.csv")

display(states)
display(populations)

Loading...

Look closely at the County values. One table writes De Witt County, the other writes DeWitt. One writes Lewis and Clark County, the other Lewis & Clark. To us these are obviously the same places. To pandas they are different strings, so the merge finds nothing to match on and returns an empty table.

states.merge(populations, on="County")

Loading...

Canonicalizing with the `.str` accessor¶

The cure is to put both columns into one standard form, then join on that. We lowercase everything, drop spaces and punctuation, and remove the filler words county and parish. Each step is a .str method, and because they each return a Series we can chain them together.

def canonicalize_county(county):
    return (
        county.str.lower()
        .str.replace(" ", "")
        .str.replace("&", "and")
        .str.replace(".", "", regex=False)
        .str.replace("county", "")
        .str.replace("parish", "")
    )

canonicalize_county(states["County"])

0              dewitt
1         lacquiparle
2       lewisandclark
3    stjohnthebaptist
Name: County, dtype: object

Both tables now agree on a single spelling, so we add the canonical form as a new column and merge on it.

states["canonical"] = canonicalize_county(states["County"])
populations["canonical"] = canonicalize_county(populations["County"])

states.merge(populations, on="canonical")

Loading...

When position-based slicing breaks down¶

Canonicalization handled messy categories. A harder job is pulling a field out of a longer string. Here are a few lines from a web server log.

with open("data/log.txt") as f:
    log_lines = f.readlines()

log_lines

['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n',
 '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"\n',
 '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']

Say we want the timestamp inside the square brackets. A first instinct is to slice at fixed character positions. We can find the right positions for the first line by trial and error.

log_lines[0][20:31]

'26/Jan/2014'

The same positions on the next line give garbage, because the text before the bracket is a different length.

log_lines[1][20:31]

'/Feb/2005:1'

Fixed offsets only work when fields sit in fixed places, which real text almost never guarantees. A more robust idea is to split on the brackets themselves, wherever they happen to be.

first = log_lines[0]
inside_brackets = first.split("[")[1].split("]")[0]
inside_brackets

'26/Jan/2014:10:47:58 -0800'

That works, and you will see split-based cleaning in plenty of real pipelines. But splitting once per delimiter gets clumsy fast: to break 26/Jan/2014:10:47:58 -0800 into day, month, year, hour, minute, second, and timezone you would split on /, then :, then a space, juggling the pieces at each step. There is a tool built for exactly this kind of pattern, and it is where we go next.

A join that should work, but doesn’t¶

Canonicalizing with the .str accessor¶

When position-based slicing breaks down¶

Canonicalizing with the `.str` accessor¶