download notebook

Coding practice #4: Due midnight, November 17

files needed = ('airline_products_2017.csv')

Answer the questions below in a jupyter notebook. You can simply add cells to this notebook and enter your answers. When you are finished, upload the completed notebook to canvas.

My office hours are Tuesdays 9:00AM-10:00AM and Tuesdays 3:30PM-4:30PM in 7444 Soc Sciences. Satyen's are Mondays 3:00PM-4:00PM in 6413 Soc Sciences and Mitchell's are Thursdays 3:00PM-4:00PM in 7308 Soc Sciences.

You should feel free to discuss the coding practice with your classmates, but the work you turn in should be your own.

Cite any code that is not yours: badgerdata.org/pages/citing-code

Read the class policy on AI: badgerdata.org/pages/ai

Exercise 0: Last, First

Replace 'Last, First' above with your actual name. Enter it as: last name, first name.

Exercise 1: groupby and more bar charts

The file 'airline_products_2017.csv' contains some data used in the first chapter of Dennis's dissertation. The data are taken from the Airline Origin & Destination Survey (DB1B) but has been substantially cleaned. Thanks Dennis! [Dennis was the TA for this class in 2018. He is now working at Bates White Consulting. The course TAs have contributed a lot to the class.]

In particular, the data contain information on a sample of airline itineraries for flights departing from one of seven airports in the San Francisco Bay region and arriving at one of the other large cities in the United States in the second quarter of 2017. Each observation contains information on the origin airport, destination airport, airline, nonstop or connecting itinerary type, average one-way fare in dollars, and distance between the origin and destination (in miles).

In this exercise, we will make a simple scatterplot and then repeatedly use the .groupby() method to create several bar charts. Follow the instructions below.

Part (a):

  1. Load the data as a Pandas data frame and keep only the nonstop flights.
  2. Make a scatterplot of distance (x axis) against prices (y axis).
  3. Make the scatterpoints blue circles that are not filled in.

You'll notice a strange gap in the distance data between 1,000 and 1,500 miles—it's called "flyover country" for a reason!

Part (b):

  1. Use the .groupby() method to obtain the median fare for nonstop flights for each airline. Print the median fares to the screen.

  2. Make a bar chart displaying the median fare for American Airlines, Delta Air Lines, Southwest Airlines, and United Airlines. Give the chart appropriate labels etc. and make it look nice.

Contrary to intuition, Delta appears to have much lower fares than Southwest!

Part (c):

The bar chart above might be misleading, because different airlines may fly to different cities that are different distances from San Francisco (a composition effect), so we might be confusing variation in fares across airlines with variation in fares for flights of different distances. Let's try to fix this:

  1. On the original data frame of non-stop flights, use the .cut() method to create distance bins 500 miles in width (starting at 0 and ending at 3000).

  2. Use the .groupby() method to obtain the median airfare by (distance-bin, airline) pair.

  3. Then reset the index on the resulting series. Your data should be long at this point.

  4. Keep the observations for the 4 airlines discussed above: American Airlines, Delta Air Lines, Southwest Airlines, and United Airlines.

  5. Covert the distance bin column to a string instead of an Interval object type. [try astype( )]

  6. Now let's create two horizontal bar charts using the subplot method. The first shows the median fare by airline for trips between 0 and 500 miles. The second does so for trips between 2,000 and 2,500 miles.

The horizontal axis should show the median fare, the vertical axis should have the airline names as labels. Make the other aspects of the figure look nice. My figure looks like this.

The resulting chart should show that Delta has the lowest fares for the flights between 0 and 500 miles but the highest fares for flights between 2,000 and 2,500 miles.

Part (d): Challenging

  1. Create a single grouped (horizontal) bar chart with price in dollars on the x axis and the names of the airlines on the y axis. Plot six bars for each airline, grouped by airline. My figure looks like this). I am not happy with my colors—I need to work on it some more.

I used seaborn to do this but you are welcome to use whatever method you prefer.