download notebook

Coding practice #2: Due midnight, October 13

files needed = ('banks_and_branches.csv', 'GDPCA.csv')

Answer the questions below in a jupyter notebook. You can simply add cells to this notebook and enter your answers. When you are finished, upload the completed notebook to canvas.

My office hours are Tuesdays 9:00AM-10:00AM and Tuesdays 3:30PM-4:30PM in 7444 Soc Sciences. Satyen's are Mondays 3:00PM-4:00PM in 6413 Soc Sciences and Mitchell's are Thursdays 3:00PM-4:00PM in 7308 Soc Sciences.

You should feel free to discuss the coding practice with your classmates, but the work you turn in should be your own.

Cite any code that is not yours: badgerdata.org/pages/citing-code

Read the class policy on AI: badgerdata.org/pages/ai

Exercise 0: Last, First

Replace 'Last, First' above with your actual name. Enter it as your last name, first name.

Exercise 1: Subsets

Let's practice pulling subsets out of a data frame. We subset a lot. Our goal is to build some muscle memory, so that every time we need to subset the data, we do not need to go look up how to do it.

To this end, first try the exercises below without consulting your notes or the internet. Sort out where you need to improve and keep practicing!

A. Create a DataFrame from the following dict.

data_dict = {'soda':['coke', 'diet coke', 'sprite', 'pepsi', 'mug', 'mt. dew'],
            'cals':[140, 1, 90, 150, 130, 170],
            'sodium':[45, 40, 65, 30, 65, 60],
            'corp': ['coca cola', 'coca cola', 'coca cola', 'pepsico', 'pepsico', 'pepsico']}

B. Print a DataFrame containing only sodas with more than 10 calories.

C. Print a DataFrame containing only sodas with more than 10 calories and less than 100 calories.

D. Print a DataFrame containing only data for coke, pepsi, and mug. Use the .isin() method.

E. Set the index of the DataFrame to 'soda'. Print your DataFrame.

F. Use .loc[] to print a DataFrame containing only coke, pepsi, and mug.

G. Print out the average sodium for pepsico products that have more than 135 calories.

H. Print out the number of pepsico products with sodium above 60 mg.

I. Print out the calories in a coke.

Exercise 2: Importing data

The file 'banks_and_branches.csv' (on the course website for download) contains data on the number of commercial banking institutions, branches, and offices in the United States at the end of each year between 1934 and 2017.

[The data are from Table CB01, which is maintained by the Federal Deposit Insurance Corporation (FDIC). FDIC data can be downloaded from https://www.fdic.gov/open/datatools.html. You are encouraged to browse around their website to see what's there — this might be a useful source for your final project.]

Your task is to import this data into Python and get it into good shape.

Part (a):

Import the data from 'banks_and_branches.csv' into a pandas data frame called banking_data. If you open the .csv file in Excel or a similar program, you'll notice that there are notes in several of the cells at the beginning of the file. Rather than delete these manually, you should use the header option in pandas.read_csv() to use the correct row for the column names.

Part (b):

  1. Convert each of the column names to lower case using a list comprehension.
  2. We don't need the 'branches' column—you might notice that the variable 'offices' is equal to 'institutions' plus 'branches'. Delete the 'branches' column.
  3. Rename the 'institutions' column 'banks'.
  4. Print the data types of each column.

Part (c):

You'll notice that some of the columns are listed as objects and not integers.

  1. Convert these columns to integers. Note that you'll need to remove the thousands separator (',') from each number first. [Have a look at pandas' str.replace() method and the .astype() method.] Print the data types of each column to confirm that they are all integers.

  2. Create a variable containing the average number of offices per bank in each year. Add a new column called 'offices_per_bank' to banking_data with this information.

  3. Save your DataFrame to a CSV file named 'banking_data.csv'. We will use this file for coding practice #3. (You should not upload this file to canvas. Only upload the jupyter notebook.)

banking_data.head()
banking_data.to_csv('banking_data.csv')

Exercise 3

Let's work with U.S. GDP. The data come from FRED. The goal it to brush up on our data cleaning skills and compute a few summary statistics.

Part (a):

  1. The file GDPCA.csv contains annual real GDP. Read it into a DataFrame and create a series of growth rates.

  2. Something went wrong, right? There is a gremlin in the data. Find a way to use read_csv() to turn the bad data into NaNs. Now compute the growth rates. Print the first 5 rows.

Part (b)

  1. Compute the mean gdp growth rate for 1960–1985
  2. Compute the standard deviation of the gdp growth rate for 1960–1985
  3. Repeat parts 1. and 2. for the period 1985–2007
  4. Print out your results in a few neatly formatted sentences. Comment on the differences between the two periods.

Part (c)

  1. Recompute the mean and standard deviation for 1985-2021.
  2. How have things changed?

Exercise 4: Cake cutting (Challenging)

Source: This exercise is a modified version of one of Google's own coding exams, used to recruit new software developers each year.

The setup: Happy Birthday! We brought some cake to celebrate! You take one look at your cake, and see the writing on it is representable as a string, abcabcabcabc.

How to make everyone at your party happy: cut the cake such that everyone gets (1) the same number of letters, (2) the same sequence of letters as everyone else, and (3) you've cut your cake into the maximum number of pieces possible. So, for this cake, we can obviously split it into four pieces, each containing abc. Note: sequences are not interchangeable, so bca is not the same as abc.

Your task: given an arbitrary cake, find the maximum number of pieces we can cut it into, i.e. the number of friends you can share your cake with.

Here are the steps. You work out the code.

  1. Find the potential ways you can cut the cake. You cannot have fractions of letters. For example. Suppose the cake was 12 letters long cake = 'abccbaabccba'. You could cut it up into 1, 2, 3, 4, 6, or 12 pieces. Notice that you can divide 12 by each of these numbers and not have a remainder. Store your potential cake-piece lengths in a list named piece_lengths.

  2. For each potential piece length in piece_lengths:

  3. Cut the first piece from the cake. (Remember: you can slice a string.) Given our example, the first piece would be 'a', since the first potential piece length is 1. The second potential length is 2, so that first piece would be 'ab'.
  4. Create a string that is made up of the piece you cut repeated until it has the same length as the cake. In our example, we would repeat 'a' 12 times: test_cake = 'aaaaaaaaaaaa'. We would repeat 'ab' six times. test_cake = 'abababababab'.
  5. Compare your test cake to the cake. If they are the same, then this is a successful cutting strategy. Keep this piece length—save it to a list. If the cake and test cake are not the same, move on to the next potential piece length. In our example, 6 is the only piece length that works. In general, there may be many piece lengths that work.

  6. From your list of acceptable piece lengths, choose the shortest. Report the length of the piece and the number of pieces you can cut the cake into.

Test your algorithm with

test_1 = 'abccbaabccba'
test_2 = 'abcabcabcabc'
test_3 = 'abcdabcdabcdabcd'