Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Summary

Text wrangling is the work of turning messy strings into something you can analyze. This chapter built up two toolkits for it.

String methods came first. The .str accessor applies Python’s string operations to a whole column, and chaining a few of them, lowercase, strip, replace, is enough to canonicalize most categorical text. That is what let two tables with mismatched county spellings finally join.

Regular expressions came second, for the jobs that fixed substrings cannot reach. A regex describes a pattern rather than an exact string, so it can find a date wherever it sits in a line, pull apart a timestamp into its pieces, or recode thousands of free-text descriptions by keyword. In pandas, .str.extract, .str.findall, .str.replace, and .str.contains all take a regex and run it across an entire column.

Which tool when

Where this leads

Clean text feeds everything downstream. The visualization chapters assume your categories are already canonicalized, and the modeling chapters assume your features are already extracted. The restaurant case study showed the whole arc in miniature: wrangle the text, count the keywords, then let the analysis reveal how violations track with inspection scores.