download notebook
view notebook w/o solutions

Advanced string search

files needed = ('callcenterdatacurrent.csv')

We have seen some basic methods for searching text using str.contain() or str.find() and passing strings to match. That works great when we know what we are searching for and the text we are searching has some order to it. In this notebook, we introduce regular expressions. Regular expressions (regex) use a special syntax that allows for powerful searching. For example, a regular expression could find the first four instances of numbers within a string (regardless of their values) or search for two capitalized letters followed by a comma.

Regular expressions are not unique to pandas or python. They were developed as part of the theoretical basis for computing languages and are quite interesting in their own right. They can also be a bit confusing.

Our goal is to learn a few regular expressions that might be helpful in the context of processing text data for analysis. We will only scratch the surface here. If you are interested in learning more, the internet is full of great tutorials. regex101.com is a fantastic sandbox for practicing.

Milwaukee call center data

The City of Milwaukee open data portal includes data sets on many aspects of the city. Today we will work with some of the call center data. These data are incidents handled by the city's call center and cover things like missed garbage pick up, litter complaints, and burned out street lights.

We will extract the calls regarding fire damage and compute some statistics about the cost of fires in the city. The estimates of fire damage are embedded in the text description and will need to be extracted.

I've posted the data that I downloaded from the portal so that we are all working from the same dataset but you could download the freshest data, too. Your results would then be a bit different than mine.

import pandas as pd
import matplotlib.pyplot as plt
# Pandas had some trouble with the last column, so I am specifying 
# that is should be treated as text.
calls = pd.read_csv('callcenterdatacurrent.csv', dtype={4:'str'}, parse_dates=[0, 3])
calls.sample(5)
CREATIONDATE OBJECTDESC TITLE CLOSEDDATETIME CASECLOSUREREASONDESCRIPTION
128627 2020-01-16 21:28:13 2962 N MARYLAND AV Street/Alley Lighting Outage Web 2020-01-18 11:07:39 Record number 3393912 is a request for Electri...
53968 2020-06-29 00:00:00 2449 N MARTIN L KING JR DR, MILWAUKEE, WISCONS... Missed Collection: Recycling NaT The church has not been gathering since March,...
8200 2020-09-25 00:00:00 NaN Miscellaneous Information Request 2020-09-25 00:00:00 STATUS REQUEST
30873 2020-08-17 00:00:00 2823 N 20TH ST, MILWAUKEE, WISCONSIN, 53206-1605 Sanitation Inspector Notification NaT Requesting alley inspection stating there is d...
84486 2020-04-28 11:23:30 927 N 27TH ST Debris/Litter/Mess Web 2020-05-08 21:58:46 Closed

Those column names could use some work.

calls.columns = ['created', 'address', 'title', 'closed', 'desc']

What time span does the data cover? Since we made the 'created' and 'closed' columns datetime, we can use .min() and .max() to find out.

print('first call: {}'.format(calls['created'].min()))
print('last call: {}'.format(calls['created'].max()))
first call: 2020-01-01 01:22:53
last call: 2020-10-13 00:00:00

The fire damage calls have the title 'Fire Damage.'

fire = calls[calls['title']=='Fire Damage'].copy()
fire.shape
(116, 5)
fire.head(20)
created address title closed desc
71 2020-10-13 10536 W GLENBROOK CT, MILWAUKEE, WISCONSIN, 53... Fire Damage NaT bldg loss $14,540.00\ncont loss $29,990.00\nca...
251 2020-10-13 4702 W TRIPOLI AV, MILWAUKEE, WISCONSIN, 53220... Fire Damage NaT bldg. loss $193,449.94\ncont loss $107,886.20\...
539 2020-10-12 2423 N 20TH ST, MILWAUKEE, WISCONSIN, 53206-1552 Fire Damage NaT bldg. loss $3,720.87\ncont loss $1383.41\ncaus...
631 2020-10-12 2921 W WALNUT ST, MILWAUKEE, WISCONSIN, 53208-... Fire Damage NaT bldg loss $79,453.71\ncont loss $9,846.89\ncau...
758 2020-10-12 3539 N VEL R PHILLIPS AV, MILWAUKEE, WISCONSIN... Fire Damage NaT bldg loss $2,586.00\ncont loss $4,808.00\ncaus...
2684 2020-10-07 2455 N FRATNEY ST, MILWAUKEE, WISCONSIN, 53212... Fire Damage 2020-10-07 bldg. loss $6,503.00\ncause-under investigation
2685 2020-10-07 2455 N FRATNEY ST, MILWAUKEE, WISCONSIN, 53212... Fire Damage 2020-10-07 fire debris on property, needs cleaning up
2866 2020-10-07 3766 N 86TH ST, MILWAUKEE, WISCONSIN, 53222-2830 Fire Damage 2020-10-08 bldg loss $5,48.26\ncause-under investigation
4437 2020-10-05 7233 N 38TH ST, 1, MILWAUKEE, WISCONSIN, 53209... Fire Damage 2020-10-05 bldg. loss $139,411.00\ncont loss $9,758.00\nc...
4885 2020-10-02 2438 N 36TH ST, MILWAUKEE, WISCONSIN, 53210-3039 Fire Damage 2020-10-08 bldg loss $61,675.00\ncont loss $22,932.00\nca...
4999 2020-10-02 3220 N 91ST ST, MILWAUKEE, WISCONSIN, 53222-3620 Fire Damage 2020-10-02 bldg. loss $12,753.00\ncont loss $948.00\ncaus...
5186 2020-10-02 6344 N 101ST ST, MILWAUKEE, WISCONSIN, 53225-0000 Fire Damage 2020-10-07 Fire happened on July 10th
5502 2020-10-01 2423 N 20TH ST, MILWAUKEE, WISCONSIN, 53206-1552 Fire Damage 2020-10-05 bldg. loss $3,720.00\ncont loss $1,383.00\ncau...
8315 2020-09-25 2214 W NEIL PL, MILWAUKEE, WISCONSIN, 53209-5055 Fire Damage 2020-09-28 bldg loss $115,492.00\ncont loss $28,626.00\nc...
8428 2020-09-25 2948 N 24TH ST, MILWAUKEE, WISCONSIN, 53206-1111 Fire Damage 2020-10-02 bldg. loss $46,659.00\ncont loss $21,973.00\nc...
8455 2020-09-25 3063 S 7TH ST, MILWAUKEE, WISCONSIN, 53215-3931 Fire Damage 2020-09-25 bldg loss $5,000.00\ncont loss $5,000.00\ncaus...
9280 2020-09-24 4035 N 45TH ST, MILWAUKEE, WISCONSIN, 53216-1518 Fire Damage 2020-10-02 bldg. loss $65,855.93\ncont loss $79,576.40\nc...
9656 2020-09-23 1407 N MARTIN L KING JR DR, MILWAUKEE, WISCONS... Fire Damage 2020-09-23 bldg. loss $5,000.00\ncont loss $2,500.00\ncau...
9694 2020-09-23 1924 W GALENA ST, MILWAUKEE, WISCONSIN, 53205-... Fire Damage 2020-09-30 bldg loss $100,000.00\ncont loss $25,000.00\nc...
10155 2020-09-23 5137 N 22ND ST, MILWAUKEE, WISCONSIN, 53209-5604 Fire Damage 2020-09-23 fire alarm report from 09/21/2020, finally got...

We can now see that there are two damage estimates reports. The first is the damage to the building. The second is the damage to the contents of the building. Some things that might cause difficulties:

  1. 'bldg' but also 'bldg.'
  2. There are not always 'cont loss' entries.
  3. There may be spelling mistakes
  4. There maybe inconsistent use of dollar signs or commas

Ugh. There is a reason why people who can do this stuff are paid well.

Regular expressions.

What follows is a simple overview of regex and some examples to help us understand how they work. You should read through the re package documentation HOWTO (up to "The Backslash Plague") before proceeding.

We split the set of characters into two parts: metacharacters and everything else. Metacharacters have special meanings. The metacharacters are

. ^ $ *  + ? { } [ ] \ | ( )

We will only cover a few of the metacharacters in this notebook.

If we were working in python (and not pandas) we would use the re package. We will use pandas methods such as .findall() and .extract() with regular expressions—pandas already knows how to interpret regular expressions. Many of the string methods in pandas accept either a string or a regex.

Let's take a look at the first text element.

# Compare this to the jupyter nb output table above. The nb had problems with formatting. 

fire.loc[71, 'desc']
'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'

Passing regex to pandas

We use the raw string notation which means we do not need to worry about things like \ being interpreted as a special character. Raw strings take the form r'some text', so the only difference between writing a raw string and a regular string is the 'r' in front.

We put our expression(s) in parentheses.

For example, if I want to find the word 'loss' I write

r'(loss)'

The r and apostrophes are for the raw string, the () encapsulate the regex, and the regex itself is loss.

Matching characters

Extract occurrences of 'loss'.

.findall() will do this for the entire column. I am just going to print out a few.

fire['desc'].str.findall(r'(loss)').head(7)
71      [loss, loss]
251     [loss, loss]
539     [loss, loss]
631     [loss, loss]
758     [loss, loss]
2684          [loss]
2685              []
Name: desc, dtype: object

We are given a Series with the extracted text in a list. This is a Series where each element of a column is a list. Returning a list makes is easy to have rows with different numbers of occurrences.

What happens if the extract fails?

fire['desc'].str.findall(r'(Bucky)').head(1)
71    []
Name: desc, dtype: object

Character classes

We use [] to specify a set or class of characters to match. Suppose I wanted a a, b, or c.

# 'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'

fire['desc'].str.findall(r'([abc])').head(1)
71    [b, c, c, a, a, a, a]
Name: desc, dtype: object

We can also specify ranges of letter or numbers.

fire['desc'].str.findall(r'([a-c])').head(1)
71    [b, c, c, a, a, a, a]
Name: desc, dtype: object

There are some special pre-defined sequences that are used often.

  • \w any alphanumeric character: equivalent to [a-zA-Z0-9_]
  • \d any numeric digit: equivalent to [0-9_]

the complements are * \W anything but alphanumeric character: equivalent to [^a-zA-Z0-9_] * \D anything but numeric digits: equivalent to [^0-9_]

The ^ character is the complement operator.

# 'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'

fire['desc'].str.findall(r'(\w)').head(1)
71    [b, l, d, g, l, o, s, s, 1, 4, 5, 4, 0, 0, 0, ...
Name: desc, dtype: object
fire['desc'].str.findall(r'(\d)').head(1)
71    [1, 4, 5, 4, 0, 0, 0, 2, 9, 9, 9, 0, 0, 0]
Name: desc, dtype: object

Repeating characters

  • The * means look for the preceding character(s) zero or more times
  • The + means look for the preceding character(s) one or more times

The * means that those characters may show up many times or not at all. The + means that the characters have to show up at least once.

# 'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'

print(fire['desc'].str.findall(r'(\d*)').head(1))
print(fire['desc'].str.findall(r'(\d+)').head(1))
71    [, , , , , , , , , , , 14, , 540, , 00, , , , ...
Name: desc, dtype: object
71    [14, 540, 00, 29, 990, 00]
Name: desc, dtype: object

Notice the difference between * and +.

  • The * expression returns many blank "hits." These are the characters leading up to the first 1. Each one of those characters is not a number, but that is still considered a "hit" because zero decimal characters is considered a match.

  • The + expression returns only the clumps of numbers, since characters with zero numbers are excluded. Each clump stops when it hits a nonnumeral: the ,, the ., and the \.

Fire damage

We have just scratched the surface of regex and you might still feel a bit uncertain about how they work. That's okay—so does everyone else. Try searching for 'regex' on stackexchange and see how many questions pop up. I find sites like https://regex101.com/ very useful. You can enter a regex and a test string and it will help you see all the matches.

Let's use regex to extract the damages from our data.

fire['desc'].str.findall(r'(\d+,\d+.\d+)').head(5)
71       [14,540.00, 29,990.00]
251    [193,449.94, 107,886.20]
539                  [3,720.87]
631       [79,453.71, 9,846.89]
758        [2,586.00, 4,808.00]
Name: desc, dtype: object
fire.loc[71,'desc']
'bldg loss $14,540.00\ncont loss $29,990.00\ncause-under investigation\nremarks-small interior fire'
fire.loc[539,'desc']
'bldg. loss $3,720.87\ncont loss $1383.41\ncause-under investigation'

I'm looking for substrings of the form y,xxx.zz BUT,

  • the y variable may have 1, 2, or 3 digits or not none at all
  • the x variable may have 1, 2, or 3 digits

My regex is

'\d+,\d+.\d+'

This is looking for at least one number and then all other numbers up to the , then more numbers up to the . and then the rest of the numbers. The + ensures that I am only getting numbers but is general enough to capture both 14 and 193.

Note that this approach could be troublesome if there are other numbers in the text, such as a phone number. We need to always check our results to see if we need to further clean the data. Each dataset is unhappy in its own way.

Did we get everything? Nope. Look at index 539. There is only one entry. The contents loss entry does not have a comma in it so it is not being found. Let's fix this.

Optional characters

We use the ? character to say that the preceding character or class is optional. Let's make the comma optional.

fire['desc'].str.findall(r'(\d+,?\d+.\d+)').head(8)
71        [14,540.00, 29,990.00]
251     [193,449.94, 107,886.20]
539          [3,720.87, 1383.41]
631        [79,453.71, 9,846.89]
758         [2,586.00, 4,808.00]
2684                  [6,503.00]
2685                          []
2866                   [5,48.26]
Name: desc, dtype: object

We now have caught the contents damage for index 539.

fire.loc[2684,'desc']
'bldg. loss $6,503.00\ncause-under investigation'

Practice

A problem remains: In index 2684, for example, we have only one entry. Here is the entry:

'bldg. loss $6,503.00\ncause-under investigation'

There is no 'contents damage' reported. How would we know from looking at our results that the 6,503.00 is the building damage entry? Perhaps there was not a building damage entry but only a contents damage entry?

  1. Modify the findall() regex statement we have been using to return the text immediately before the number. This way, we will know which kind of damage is being reported. The output for 2684 would be:
'bldg. loss $6,503.00'

This way, we know that we are finding the building damage report.

Try something like:

.str.findall(r'(bldg. loss \$?\d+,?\d+.\d+)')

How many rows return an empty list?

bld = fire['desc'].str.findall(r'(bldg. loss \$?\d+,?\d+.\d+)')
bld.head(8)
71                            []
251     [bldg. loss $193,449.94]
539       [bldg. loss $3,720.87]
631                           []
758                           []
2684      [bldg. loss $6,503.00]
2685                          []
2866                          []
Name: desc, dtype: object
# This is how I checked the observations in which my regex found nothing.

fire[bld.map(len)==0]['desc'].shape
(58,)
  1. Some of the row returning no values are errors. The problem is that some do not have bldg., but have bldg.

Use the ? to make the . optional and rerun your search.

I found 17 observations that do not have building damage reported.

bld = fire['desc'].str.findall(r'(bldg.? loss \$?\d+,?\d+.\d+)')
fire[bld.map(len)==0]['desc'].shape
(17,)
  1. Use a statement similar to that in part 2 to extract strings of the form
'cont loss $29,990.00'

You probably need to deal with inconsistent punctuation and notation. Try manually checking the entries that do not return a value. I found 27 observations that do not have contents damage reported.

# My debugging steps
# Start with the same regex I used in 1, but change bldg to cont
# cnt = fire['desc'].str.findall(r'(cont.? loss \$?\d+,?\d+.\d+)')

# loss is misspelled in 46401. So take any letters that follow 'cont. l'. Use \w+
#cnt = fire['desc'].str.findall(r'(cont.? l\w+ \$?\d+,?\d+.\d+)')

# No space before $ in 51321. Use ? to make it optional.
cnt = fire['desc'].str.findall(r'(cont.? l\w+ ?\$?\d+,?\d+.\d+)')

cnt.head()
71      [cont loss $29,990.00]
251    [cont loss $107,886.20]
539       [cont loss $1383.41]
631      [cont loss $9,846.89]
758      [cont loss $4,808.00]
Name: desc, dtype: object
fire[cnt.map(len)==0]['desc']
2684       bldg. loss $6,503.00\ncause-under investigation
2685            fire debris on property, needs cleaning up
2866         bldg loss $5,48.26\ncause-under investigation
5186                            Fire happened on July 10th
11215    bldg. loss $75,000.00\ncause-intentional\nrema...
16264    fire damage just today, fire department was ou...
16996          raw sewage coming into basement of building
22711      bldg loss $53,569.00\ncause-under investigation
27090    bldg loss $5,120.65\ncause-under investigation...
28141    Fire damage to house, put in request for fire ...
29210    Attic Fire has taken place.  There is damage t...
31017    bldg. loss $21,478.00\ncause-incense ignited s...
34274      bldg loss $83,927.00\ncause-under investigation
36772    bldg loss $10,727.14\ncause-radiant heat from ...
37621                Unreported fire damage from 7/8/2020?
51056    Yesterday (07/04/2020), my grandmother’s nei...
53530                                         Fire damage,
53571    bldg. loss $107,538.00\ncause-under investigat...
56554                  Fire damage, condemnation candidate
56827    Garage was set on fire, am requesting fire ala...
58354                                  bldg loss $5,231.00
60749    Under investigation.\nRemarks - Fire and smoke...
64496    The property grass and weeds have not been cut...
68662                       Please inspect for fire damage
68730                       Please inspect for fire damage
69091                       Please inspect for fire damage
73763    To whom it may concern. I wish to stay anonymo...
Name: desc, dtype: object
  1. Now repeat steps 2 and 3 but save the output into new columns of the fire DataFrame. Name them bld_loss and cnt_loss.

Rather than use .findall(), use .extract(). .extract() will return the first occurrence of the regex. Since we only have at most one occurrence per observation, this is good. The benefit of .extract() over .findall() is that .extract() returns a string rather than a list.

Something like:

```python fire['bld_loss'] = fire['desc'].str.extract(r'(bldg.? loss \$?\d+,?\d+.\d+)')

```python
fire['bld_loss'] = fire['desc'].str.extract(r'(bldg.? loss \$?\d+,?\d+.\d+)')
fire['cnt_loss'] = fire['desc'].str.extract(r'(c\w+.? l\w+ ?\$?\d+,?\d+.\d+)')

fire.head()
created address title closed desc bld_loss cnt_loss
71 2020-10-13 10536 W GLENBROOK CT, MILWAUKEE, WISCONSIN, 53... Fire Damage NaT bldg loss $14,540.00\ncont loss $29,990.00\nca... bldg loss $14,540.00 cont loss $29,990.00
251 2020-10-13 4702 W TRIPOLI AV, MILWAUKEE, WISCONSIN, 53220... Fire Damage NaT bldg. loss $193,449.94\ncont loss $107,886.20\... bldg. loss $193,449.94 cont loss $107,886.20
539 2020-10-12 2423 N 20TH ST, MILWAUKEE, WISCONSIN, 53206-1552 Fire Damage NaT bldg. loss $3,720.87\ncont loss $1383.41\ncaus... bldg. loss $3,720.87 cont loss $1383.41
631 2020-10-12 2921 W WALNUT ST, MILWAUKEE, WISCONSIN, 53208-... Fire Damage NaT bldg loss $79,453.71\ncont loss $9,846.89\ncau... bldg loss $79,453.71 cont loss $9,846.89
758 2020-10-12 3539 N VEL R PHILLIPS AV, MILWAUKEE, WISCONSIN... Fire Damage NaT bldg loss $2,586.00\ncont loss $4,808.00\ncaus... bldg loss $2,586.00 cont loss $4,808.00
  1. From 'bld_loss' and 'cnt_loss' extract only the numeric parts so that we have data we can work with. I used .extract() again with the appropriate regex.
fire['bld_loss'] = fire['bld_loss'].str.extract(r'(\d+,?\d+.\d+)')
fire['cnt_loss'] = fire['cnt_loss'].str.extract(r'(\d+,?\d+.\d+)')
  1. Convert your extracted values to numeric types. You will need to deal with the commas.
fire['bld_loss'] = fire['bld_loss'].str.replace(',', '').astype(float)
fire['cnt_loss'] = fire['cnt_loss'].str.replace(',', '').astype(float)
fire.dtypes
created     datetime64[ns]
address             object
title               object
closed      datetime64[ns]
desc                object
bld_loss           float64
cnt_loss           float64
dtype: object
  1. Compute some summary statistics. What was the average damage to buildings and contents? What are the max and min?

If you finish early: Create some histograms visualize the data.

fire[['bld_loss', 'cnt_loss']].describe().applymap("{0:,.0f}".format)
bld_loss cnt_loss
count 99 89
mean 117,945 39,912
std 297,512 67,763
min 548 20
25% 15,004 8,687
50% 55,446 20,000
75% 113,648 38,482
max 2,633,953 531,035
# I am not including two very large fires in the histogram.

fire['bld_loss'].sort_values().dropna().tail(2)
62206    1307226.0
58606    2633953.0
Name: bld_loss, dtype: float64
fig, ax = plt.subplots(figsize=(12,5))

ax.hist(fire.loc[fire['bld_loss']<=600000,'bld_loss'].dropna(), 
        color='silver', rwidth=0.8, 
        bins=15, align='mid')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_xlabel('fire damage to building in dollars')
ax.set_ylabel('number of occurrences')
ax.set_title('Distribution of fire damage reports in Milwaukee, YTD 2020')

ax.text(350000, 10, 
        'Not reported in this figure:\n  one report with \$1.3 mil. damage\n  one report with \$2.6 mil. damage',
       bbox=dict(boxstyle='square', fc='white'))


plt.show()

png