download notebook
view notebook w/ solutions
Advanced string search
files needed = ('callcenterdatacurrent.csv')
We have seen some basic methods for searching text using str.contain()
or str.find()
and passing strings to match. That works great when we know what we are searching for and the text we are searching has some order to it. In this notebook, we introduce regular expressions. Regular expressions (regex) use a special syntax that allows for powerful searching. For example, a regular expression could find the first four instances of numbers within a string (regardless of their values) or search for two capitalized letters followed by a comma.
Regular expressions are not unique to pandas or python. They were developed as part of the theoretical basis for computing languages and are quite interesting in their own right. They can also be a bit confusing.
Our goal is to learn a few regular expressions that might be helpful in the context of processing text data for analysis. We will only scratch the surface here. If you are interested in learning more, the internet is full of great tutorials. regex101.com is a fantastic sandbox for practicing.
Milwaukee call center data
The City of Milwaukee open data portal includes data sets on many aspects of the city. Today we will work with some of the call center data. These data are incidents handled by the city's call center and cover things like missed garbage pick up, litter complaints, and burned out street lights.
We will extract the calls regarding fire damage and compute some statistics about the cost of fires in the city. The estimates of fire damage are embedded in the text description and will need to be extracted.
I've posted the data that I downloaded from the portal so that we are all working from the same dataset but you could download the freshest data, too. Your results would then be a bit different than mine.
import pandas as pd
import matplotlib.pyplot as plt
# We can specify that multiple columns are dates
calls = pd.read_csv('callcenterdatacurrent.csv', parse_dates=[0, 3])
calls.sample(5)
Those column names could use some work.
calls.columns = ['created', 'address', 'title', 'closed', 'desc']
What time span does the data cover? Since we made the 'created' and 'closed' columns datetime, we can use .min()
and .max()
to find out.
print('first call: {}'.format(calls['created'].min()))
print('last call: {}'.format(calls['created'].max()))
The fire damage calls have the title 'Fire Damage.'
fire = calls[calls['title']=='Fire Damage'].copy()
fire.shape
fire.head(20)
We can now see that there are two damage estimates reports. The first is the damage to the building. The second is the damage to the contents of the building. Some things that might cause difficulties:
- 'bldg' but also 'bldg.'
- There are not always 'cont loss' entries
- There may be spelling mistakes
- There maybe inconsistent use of dollar signs or commas
Ugh. There is a reason why people who can do this stuff are paid well.
Regular expressions.
What follows is a simple overview of regex and some examples to help us understand how they work. You should read through the re
package documentation HOWTO (up to "The Backslash Plague") before proceeding.
We split the set of characters into two parts: metacharacters and everything else. Metacharacters have special meanings. The metacharacters are
. ^ $ * + ? { } [ ] \ | ( )
We will only cover a few of the metacharacters in this notebook.
If we were working in python (and not pandas) we would use the re package. We will use pandas methods such as .findall()
and .extract()
with regular expressions—pandas already knows how to interpret regular expressions. Many of the string methods in pandas accept either a string or a regex.
Let's take a look at the first text element.
# Compare this to the jupyter nb output table above. The nb had problems with formatting.
fire.loc[71, 'desc']
Passing regex to pandas
We use the raw string notation which means we do not need to worry about things like \ being interpreted as a special character. Raw strings take the form r'some text', so the only difference between writing a raw string and a regular string is the 'r' in front.
We put our expression(s) in parentheses.
For example, if I want to find the word 'loss' I write
r'(loss)'
The r
and apostrophes are for the raw string, the ()
encapsulate the regex, and the regex itself is loss
.
Matching characters
Extract occurrences of 'loss'.
.findall()
will do this for the entire column. I am just going to print out a few.
fire['desc'].str.findall(r'(loss)').head(7)
We are given a Series with the extracted text in a list. This is a Series where each element of a column is a list. Returning a list makes is easy to have rows with different numbers of occurrences.
What happens if the extract fails?
fire['desc'].str.findall(r'(Bucky)').head(1)
Character classes
We use []
to specify a set or class of characters to match. Suppose I wanted a a, b,
or c
.
# 'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'
fire['desc'].str.findall(r'([abc])').head(1)
We can also specify ranges of letter or numbers.
fire['desc'].str.findall(r'([a-c])').head(1)
There are some special pre-defined sequences that are used often.
\w
any alphanumeric character: equivalent to[a-zA-Z0-9_]
\d
any numeric digit: equivalent to[0-9_]
the complements are
* \W
anything but alphanumeric character: equivalent to [^a-zA-Z0-9_]
* \D
anything but numeric digits: equivalent to [^0-9_]
The ^
character is the complement operator.
# 'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'
fire['desc'].str.findall(r'(\w)').head(1)
fire['desc'].str.findall(r'(\d)').head(1)
Repeating characters
- The
*
means look for the preceding character(s) zero or more times - The
+
means look for the preceding character(s) one or more times
The *
means that those characters may show up many times or not at all. The +
means that the characters have to show up at least once.
# 'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'
print(fire['desc'].str.findall(r'(\d*)').head(1))
print(fire['desc'].str.findall(r'(\d+)').head(1))
Notice the difference between *
and +
.
-
The
*
expression returns many blank "hits." These are the characters leading up to the first1
. Each one of those characters is not a number, but that is still considered a "hit" because zero decimal characters is considered a match. -
The
+
expression returns only the clumps of numbers, since characters with zero numbers are excluded. Each clump stops when it hits a nonnumeral: the,
, the.
, and the\
.
Fire damage
We have just scratched the surface of regex and you might still feel a bit uncertain about how they work. That's okay—so does everyone else. Try searching for 'regex' on stackexchange and see how many questions pop up. I find sites like https://regex101.com/ very useful. You can enter a regex and a test string and it will help you see all the matches.
Let's use regex to extract the damages from our data.
fire['desc'].str.findall(r'(\d+,\d+.\d+)').head(5)
fire.loc[71,'desc']
fire.loc[539,'desc']
I'm looking for substrings of the form y,xxx.zz
BUT,
- the
y
variable may have 1, 2, or 3 digits or not none at all - the
x
variable may have 1, 2, or 3 digits
My regex is
'\d+,\d+.\d+'
This is looking for at least one number and then all other numbers up to the ,
then more numbers up to the .
and then the rest of the numbers. The +
ensures that I am only getting numbers but is general enough to capture both 14
and 193
.
Note that this approach could be troublesome if there are other numbers in the text, such as a phone number. We need to always check our results to see if we need to further clean the data. Each dataset is unhappy in its own way.
Did we get everything? Nope. Look at index 539. There is only one entry. The contents loss entry does not have a comma in it so it is not being found. Let's fix this.
Optional characters
We use the ?
character to say that the preceding character or class is optional. Let's make the comma optional.
fire['desc'].str.findall(r'(\d+,?\d+.\d+)').head(8)
We now have caught the contents damage for index 539.
fire.loc[2684,'desc']
Practice
A problem remains: In index 2684, for example, we have only one entry. Here is the entry:
'bldg. loss $6,503.00\ncause-under investigation'
There is no 'contents damage' reported. How would we know from looking at our results that the 6,503.00 is the building damage entry? Perhaps there was not a building damage entry but only a contents damage entry?
- Modify the
findall()
regex statement we have been using to return the text immediately before the number. This way, we will know which kind of damage is being reported. The output for 2684 would be:
'bldg. loss $6,503.00'
This way, we know that we are finding the building damage report.
Try something like:
.str.findall(r'(bldg. loss \$?\d+,?\d+.\d+)')
Key thing to note: $
is a regular expression 'meta-character', so to treat it like a part of the string (as we want to do here), we have to escape it first using a backslash.
How many rows return an empty list?
- Some of the row returning no values are errors. The problem is that some do not have
bldg.
, but havebldg
.
Use the ?
to make the .
optional and rerun your search.
I found 17 observations that do not have building damage reported.
- Use a statement similar to that in part 2 to extract strings of the form
'cont loss $29,990.00'
You probably need to deal with inconsistent punctuation and notation. Try manually checking the entries that do not return a value. I found 27 observations that do not have contents damage reported.
- Now repeat steps 2 and 3 but save the output into new columns of the
fire
DataFrame. Name them bld_loss and cnt_loss.
Rather than use .findall()
, use .extract()
. .extract()
will return the first occurrence of the regex. Since we only have at most one occurrence per observation, this is good. The benefit of .extract()
over .findall()
is that .extract()
returns a string rather than a list.
Something like:
python
fire['bld_loss'] = fire['desc'].str.extract(r'(bldg.? loss \$?\d+,?\d+.\d+)')
-
From 'bld_loss' and 'cnt_loss' extract only the numeric parts so that we have data we can work with. I used
.extract()
again with the appropriate regex. -
Convert your extracted values to numeric types. You will need to deal with the commas.
-
Compute some summary statistics. What was the average damage to buildings and contents? What are the max and min?
If you finish early: Create some histograms visualize the data.