Economics + Data

download notebook
view notebook w/o solutions

Reading and writing data with Pandas

files needed = (gdp_components.csv, gdp_parts.csv, debt.xlsx)

We have seen some of the basic things we can do with Pandas. In doing so, we created some simple DataFrames from dicts. That was simple, but it is almost never how we create DataFrames in the wild.

Most data live in files, often as comma-separated values or as MS Excel workbooks, either on our computers or in the cloud. In this notebook, we will review ways to get data into (and out of) Pandas. Our topics are

The file system
Reading CSV files
Reading Excel files

Reading from your computer

Let's start by getting files from our own computers. We start by loading Pandas. We are also loading the os package. os means 'operating system' and it contains functions that help us navigate the file structure of our computers.

import pandas as pd     # load the pandas package and call it pd
import os               # The package name is already short enough. No need to rename it.

If you have not already, move the gdp_components.csv file to your U:\ drive and put it in the same folder that holds this notebook. We expect this file to contain U.S. GDP and its major components. Your directory structure might look something like

U:\
|
+-- Data_Class
|   +-- pandas_io_finished.ipynb
|   +-- gdp_components.csv
|   +-- gdp_parts.csv
|   +-- debt.xlsx

I am running the pandas_io_finished.ipynb notebook inside the Data_Class folder and my data file is in the same folder.

Let's get this data file into a pandas DataFrame. We use pd.read_csv().

# read_csv is a part of Pandas, so we need the pd. 
gdp = pd.read_csv('gdp_components.csv')       

print(type(gdp))

<class 'pandas.core.frame.DataFrame'>

This looks successful. .read_csv() takes a string with the file name and creates a DataFrame. Let's take a look at the data.

print(gdp)

    DATE       GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
0   1929    104.556    17.170     9.622     5.939     5.556
1   1930     92.160    11.428    10.273     4.444     4.121
2   1931     77.391     6.549    10.169     2.906     2.905
3   1932     59.522     1.819     8.946     1.975     1.932
4   1933     57.154     2.276     8.875     1.987     1.929
..   ...        ...       ...       ...       ...       ...
84  2013  16784.851  2826.013  3132.409  2273.428  2764.210
85  2014  17521.747  3038.931  3167.041  2371.027  2879.284
86  2015  18219.297  3211.971  3234.210  2265.047  2786.461
87  2016  18707.189  3169.887  3290.979  2217.576  2738.146
88  2017  19485.394  3367.965  3374.444  2350.175  2928.596

[89 rows x 6 columns]

If we print a large DataFrame, print() gives us a truncated view—the middle rows are missing. We can use the .head() and .tail() methods of DataFrame to peek at just the first or last few rows.

# Show the first 4 rows.
print(gdp.head(4))

   DATE     GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
0  1929  104.556  17.170   9.622   5.939   5.556
1  1930   92.160  11.428  10.273   4.444   4.121
2  1931   77.391   6.549  10.169   2.906   2.905
3  1932   59.522   1.819   8.946   1.975   1.932

If you do not pass .head() or .tail() an argument, it defaults to 5 rows.

print(gdp.tail(2))

    DATE       GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
87  2016  18707.189  3169.887  3290.979  2217.576  2738.146
88  2017  19485.394  3367.965  3374.444  2350.175  2928.596

The index isn't very sensible. This is time series data (the unit of observation is a year), so the date seems like a good index. How do we set the index?

# We could use 'inplace = True' if we didn't need a copy.
gdp_new_index = gdp.set_index('DATE')   

print(gdp_new_index.head())

         GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                         
1929  104.556  17.170   9.622   5.939   5.556
1930   92.160  11.428  10.273   4.444   4.121
1931   77.391   6.549  10.169   2.906   2.905
1932   59.522   1.819   8.946   1.975   1.932
1933   57.154   2.276   8.875   1.987   1.929

gdp_new_index

	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
DATE
1929	104.556	17.170	9.622	5.939	5.556
1930	92.160	11.428	10.273	4.444	4.121
1931	77.391	6.549	10.169	2.906	2.905
1932	59.522	1.819	8.946	1.975	1.932
1933	57.154	2.276	8.875	1.987	1.929
...	...	...	...	...	...
2013	16784.851	2826.013	3132.409	2273.428	2764.210
2014	17521.747	3038.931	3167.041	2371.027	2879.284
2015	18219.297	3211.971	3234.210	2265.047	2786.461
2016	18707.189	3169.887	3290.979	2217.576	2738.146
2017	19485.394	3367.965	3374.444	2350.175	2928.596

89 rows × 5 columns

We can also set the index as we read in the file. Let's take a look at the read_csv() function.

pd.read_csv?

I'm seeing a lot of good stuff here. index_col, usecols, header, sep,... some stuff I don't know about, too. When reading in messy files, these extra arguments may come in handy.

Let's give index_col a try.

# Treat the CSV like a DataFrame. Count cols staring with 0
gdp_2 = pd.read_csv('gdp_components.csv', index_col = 0)

gdp_2.head()

	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
DATE
1929	104.556	17.170	9.622	5.939	5.556
1930	92.160	11.428	10.273	4.444	4.121
1931	77.391	6.549	10.169	2.906	2.905
1932	59.522	1.819	8.946	1.975	1.932
1933	57.154	2.276	8.875	1.987	1.929

Navigating your file structure

We dumped our file into our current working directory so we could just ask for the file name gdp_components.csv in read_csv(). What is our current working directory (cwd)?

path_to_cwd = os.getcwd()           # getcwd() is part of the os package we imported earlier
print(path_to_cwd)

U:\Data_Class

When we gave read_csv() the string 'gpd_components.csv', it looked in our cwd for the file. Let's try something more complicated. Go into your Data_Class folder and create a new folder called Data_Files. Move the file 'gdp_parts.csv' into the Data_Files folder.

My folder structure looks like this. Data_Class is the folder I keep all my files in for this class.

U:\
|
+-- ado
+-- Anaconda
+-- Data_Class
|   +-- pandas_2_io.ipynb
|   +-- gdp_components.csv
|   +-- Data_Files
|   |   +-- gdp_parts.csv
|
+-- Desktop
+-- Documents
+-- python
+-- R

# This looks for gdp_components_moved.csv in the current working directory.
gdp_moved = pd.read_csv('gdp_parts.csv')

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

Input In [12], in <cell line: 2>()
      1 # This looks for gdp_components_moved.csv in the current working directory.
----> 2 gdp_moved = pd.read_csv('gdp_parts.csv')


File C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)


File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    665 kwds_defaults = _refine_defaults_read(
    666     dialect,
    667     delimiter,
   (...)
    676     defaults={"delimiter": ","},
    677 )
    678 kwds.update(kwds_defaults)
--> 680 return _read(filepath_or_buffer, kwds)


File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:575, in _read(filepath_or_buffer, kwds)
    572 _validate_names(kwds.get("names", None))
    574 # Create the parser.
--> 575 parser = TextFileReader(filepath_or_buffer, **kwds)
    577 if chunksize or iterator:
    578     return parser


File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:933, in TextFileReader.__init__(self, f, engine, **kwds)
    930     self.options["has_index_names"] = kwds["has_index_names"]
    932 self.handles: IOHandles | None = None
--> 933 self._engine = self._make_engine(f, self.engine)


File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:1217, in TextFileReader._make_engine(self, f, engine)
   1213     mode = "rb"
   1214 # error: No overload variant of "get_handle" matches argument types
   1215 # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
   1216 # , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1217 self.handles = get_handle(  # type: ignore[call-overload]
   1218     f,
   1219     mode,
   1220     encoding=self.options.get("encoding", None),
   1221     compression=self.options.get("compression", None),
   1222     memory_map=self.options.get("memory_map", False),
   1223     is_text=is_text,
   1224     errors=self.options.get("encoding_errors", "strict"),
   1225     storage_options=self.options.get("storage_options", None),
   1226 )
   1227 assert self.handles is not None
   1228 f = self.handles.handle


File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\common.py:789, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    784 elif isinstance(handle, str):
    785     # Check whether the filename is to be opened in binary mode.
    786     # Binary mode does not support 'encoding' and 'newline'.
    787     if ioargs.encoding and "b" not in ioargs.mode:
    788         # Encoding
--> 789         handle = open(
    790             handle,
    791             ioargs.mode,
    792             encoding=ioargs.encoding,
    793             errors=errors,
    794             newline="",
    795         )
    796     else:
    797         # Binary mode
    798         handle = open(handle, ioargs.mode)


FileNotFoundError: [Errno 2] No such file or directory: 'gdp_parts.csv'

Of course this doesn't work. The file is not in our cwd. It's good see what that kind of error message looks like. We need to pass csv_read() the path to the file. The path is the hierarchy of folders that contains the file. In my case, the path is

U:\Data_Class\Data_Files

Note that there is a \ each time we list a new folder.

On Windows: When we specify a file path, we escape the \ by using a second backslash in front of it.

'U:\\Data_Class\\Data_Files\\gdp_components_moved.csv'

On a Mac, you need to use the forward slash / and you do not need a backslash in front of it.

'/Users/username/Data_Class/gdp_components_moved.csv'

gdp_moved = pd.read_csv('U:\\Data_Class\\Data_Files\\gdp_parts.csv', index_col=0)
gdp_moved.head()

	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
DATE
1929	104.556	17.170	9.622	5.939	5.556
1930	92.160	11.428	10.273	4.444	4.121
1931	77.391	6.549	10.169	2.906	2.905
1932	59.522	1.819	8.946	1.975	1.932
1933	57.154	2.276	8.875	1.987	1.929

We can manipulate strings to get to this, too. This approach might be useful if you needed to read in many files from the same place. (Maybe using a for loop and a list of file names?)

path_to_cwd = os.getcwd()
file_name = 'gdp_parts.csv'
path_to_data_file = path_to_cwd + '\\Data_Files\\' +  file_name  # Note the double \ characters
print(path_to_data_file)

U:\Data_Class\Data_Files\gdp_parts.csv

gdp_moved = pd.read_csv(path_to_data_file, index_col=0)
gdp_moved.head()

	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
DATE
1929	104.556	17.170	9.622	5.939	5.556
1930	92.160	11.428	10.273	4.444	4.121
1931	77.391	6.549	10.169	2.906	2.905
1932	59.522	1.819	8.946	1.975	1.932
1933	57.154	2.276	8.875	1.987	1.929

Practice: Reading CSVs

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

Try out the .to_csv() method of DataFrame. Save gdp_parts as 'gdp_parts_2.csv' in your cwd. [You can use ? if you need help.]

gdp_moved.to_csv('gpd_parts_2.csv')

Use to_csv() again to save gdp_parts to the Data_Files folder. Name it 'gdp_parts_3.csv'

gdp_moved.to_csv('U:\\Data_Class\\Data_Files\\gdp_parts_3.csv')

Are your files in the correct places?

Isn't this supposed to be practice reading in CSV files? Right. Let's do some of that.

Use gdp_parts_3.csv to create a DataFrame named gdp_growth. Set the index to the dates. Print out the first 10 years of data.

gdp_growth = pd.read_csv('U:\\Data_Class\\Data_Files\\gdp_parts_3.csv', index_col=0)
print(gdp_growth.head(10))

         GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                         
1929  104.556  17.170   9.622   5.939   5.556
1930   92.160  11.428  10.273   4.444   4.121
1931   77.391   6.549  10.169   2.906   2.905
1932   59.522   1.819   8.946   1.975   1.932
1933   57.154   2.276   8.875   1.987   1.929
1934   66.800   4.296  10.721   2.561   2.239
1935   74.241   7.370  11.151   2.769   2.982
1936   84.830   9.391  13.398   3.007   3.154
1937   93.003  12.967  13.119   4.039   3.961
1938   87.352   7.944  14.170   3.811   2.845

Rename 'GDPA' to 'gdp' and rename 'GCEA' to 'gov'

gdp_growth.rename(columns={'GDPA':'gdp', 'GCEA':'gov'}, inplace=True)
print(gdp_growth.head())

          gdp   GPDIA     gov  EXPGSA  IMPGSA
DATE                                         
1929  104.556  17.170   9.622   5.939   5.556
1930   92.160  11.428  10.273   4.444   4.121
1931   77.391   6.549  10.169   2.906   2.905
1932   59.522   1.819   8.946   1.975   1.932
1933   57.154   2.276   8.875   1.987   1.929

Reading Excel spreadsheets

Reading spreadsheets isn't much different than reading csv files. But, since workbooks are more complicated than csv files, we have a few more options to consider. Many of the options we will explore using workbooks are also available when reading csv files.

If you haven't already, copy over 'debt.xlsx' to your cwd. Let's open it in Excel and have a look at it...

There's a lot going on here: missing data, some #N/A stuff, and several header rows. Let's get to work.

debt = pd.read_excel('debt.xlsx')
debt.head(15)

	FRED Graph Observations	Unnamed: 1	Unnamed: 2	Unnamed: 3
0	Federal Reserve Economic Data	NaN	NaN	NaN
1	Link: https://fred.stlouisfed.org	NaN	NaN	NaN
2	Help: https://fred.stlouisfed.org/help-faq	NaN	NaN	NaN
3	Economic Research Division	NaN	NaN	NaN
4	Federal Reserve Bank of St. Louis	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN
6	GDPA	Gross Domestic Product, Billions of Dollars, A...	NaN	NaN
7	GFDEBTN	Federal Debt: Total Public Debt, Millions of D...	NaN	NaN
8	DGS10	10-Year Treasury Constant Maturity Rate, Perce...	NaN	NaN
9	NaN	NaN	NaN	NaN
10	Frequency: Annual	NaN	NaN	NaN
11	observation_date	GDPA	GFDEBTN	DGS10
12	1929-01-01 00:00:00	104.556	NaN	NaN
13	1930-01-01 00:00:00	92.16	NaN	NaN
14	1931-01-01 00:00:00	77.391	NaN	NaN

# Use the 'header' option to specify the row to use as the column names (zero based, as usual).

debt = pd.read_excel('debt.xlsx', header = 12)

print(debt)

   observation_date       GDPA      GFDEBTN     DGS10
0        1929-01-01    104.556          NaN       NaN
1        1930-01-01     92.160          NaN       NaN
2        1931-01-01     77.391          NaN       NaN
3        1932-01-01     59.522          NaN       NaN
4        1933-01-01     57.154          NaN       NaN
..              ...        ...          ...       ...
85       2014-01-01  17521.747  17799837.00  2.539560
86       2015-01-01  18219.297  18344212.75  2.138287
87       2016-01-01  18707.189  19549200.50  1.837440
88       2017-01-01  19485.394  20107155.25  2.329480
89       2018-01-01        NaN          NaN       NaN

[90 rows x 4 columns]

That's looking good. Notice that Pandas added NaN for the missing data and for those #N\A entries. We will have to deal with those at some point. The header parameter is part of read_csv(), too.

We didn't specify which sheet in the workbook to load, so Pandas took the first one. We can ask for sheets by name.

debt_q = pd.read_excel('debt.xlsx', header=12, sheet_name='quarterly')
print(debt_q)

    observation_date     GFDEBTN     DGS10        GDP
0         1947-01-01         NaN       NaN    243.164
1         1947-04-01         NaN       NaN    245.968
2         1947-07-01         NaN       NaN    249.585
3         1947-10-01         NaN       NaN    259.745
4         1948-01-01         NaN       NaN    265.742
..               ...         ...       ...        ...
281       2017-04-01  19844554.0  2.260952  19359.123
282       2017-07-01  20244900.0  2.241429  19588.074
283       2017-10-01  20492747.0  2.371452  19831.829
284       2018-01-01  21089643.0  2.758525  20041.047
285       2018-04-01  21195070.0  2.920625  20411.924

[286 rows x 4 columns]

We can ask for just a subset of the columns when reading in a file (csv or xlsx). Use the usecols argument. This takes either integers or Excel column letters.

# Take the first and third columns of sheet 'quarterly'

interest_rates = pd.read_excel('debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,2])  
interest_rates.head()

	observation_date	DGS10
0	1947-01-01	NaN
1	1947-04-01	NaN
2	1947-07-01	NaN
3	1947-10-01	NaN
4	1948-01-01	NaN

Practice: Reading Excel

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.

Read in the quarterly data from 'debt.xlsx' and keep only the columns with the date, gdp, and GFDEBTN. Try to do it all using arguments to .read_excel(). Name your new DataFrame fed_debt.

Print out the first 3 rows of fed_debt.

fed_debt = pd.read_excel('debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,1,3])
fed_debt.head(3)

	observation_date	GFDEBTN	GDP
0	1947-01-01	NaN	243.164
1	1947-04-01	NaN	245.968
2	1947-07-01	NaN	249.585

# Setting the index in .read_excel()
fed_debt = pd.read_excel('debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,1,3], index_col=0)
fed_debt.head(3)

	GFDEBTN	GDP
observation_date
1947-01-01	NaN	243.164
1947-04-01	NaN	245.968
1947-07-01	NaN	249.585

Oops, I wanted to set the observation_date to the index. Go back and add that to your solution to 1.
What is 'GFDEBTN'? It is the federal debt, in millions. Rename this variable to 'DEBT'

fed_debt.rename(columns={'GFDEBTN':'DEBT'}, inplace=True)
fed_debt.head()

	DEBT	GDP
observation_date
1947-01-01	NaN	243.164
1947-04-01	NaN	245.968
1947-07-01	NaN	249.585
1947-10-01	NaN	259.745
1948-01-01	NaN	265.742

Create a variable name debt_ratio that is the debt-to-GDP ratio. Debt is in millions and gdp is in billions. Adjust accordingly.

fed_debt['debt_ratio'] = (fed_debt['DEBT']/1000)/fed_debt['GDP']
print(fed_debt)

                        DEBT        GDP  debt_ratio
observation_date                                   
1947-01-01               NaN    243.164         NaN
1947-04-01               NaN    245.968         NaN
1947-07-01               NaN    249.585         NaN
1947-10-01               NaN    259.745         NaN
1948-01-01               NaN    265.742         NaN
...                      ...        ...         ...
2017-04-01        19844554.0  19359.123    1.025075
2017-07-01        20244900.0  19588.074    1.033532
2017-10-01        20492747.0  19831.829    1.033326
2018-01-01        21089643.0  20041.047    1.052322
2018-04-01        21195070.0  20411.924    1.038367

[286 rows x 3 columns]

There are a lot of missing debt values. Did Pandas throw an error? No. Pandas knows (in some cases) how to work around missing data.

Summarize the debt_ratio variable. What is its max level? Its min?

print(fed_debt['debt_ratio'].describe())

count    210.000000
mean       0.564994
std        0.227520
min        0.306033
25%        0.355102
50%        0.555767
75%        0.641648
max        1.052562
Name: debt_ratio, dtype: float64