download notebook
view notebook w/o solutions
Reading and writing data with Pandas
files needed = (gdp_components.csv, gdp_parts.csv, debt.xlsx)
We have seen some of the basic things we can do with Pandas. In doing so, we created some simple DataFrames from dicts. That was simple, but it is almost never how we create DataFrames in the wild.
Most data live in files, often as comma-separated values or as MS Excel workbooks, either on our computers or in the cloud. In this notebook, we will review ways to get data into (and out of) Pandas. Our topics are
- The file system
- Reading CSV files
- Reading Excel files
Reading from your computer
Let's start by getting files from our own computers. We start by loading Pandas. We are also loading the os package. os
means 'operating system' and it contains functions that help us navigate the file structure of our computers.
import pandas as pd # load the pandas package and call it pd
import os # The package name is already short enough. No need to rename it.
If you have not already, move the gdp_components.csv
file to your U:\ drive and put it in the same folder that holds this notebook. We expect this file to contain U.S. GDP and its major components. Your directory structure might look something like
U:\
|
+-- Data_Class
| +-- pandas_io_finished.ipynb
| +-- gdp_components.csv
| +-- gdp_parts.csv
| +-- debt.xlsx
I am running the pandas_io_finished.ipynb
notebook inside the Data_Class folder and my data file is in the same folder.
Let's get this data file into a pandas DataFrame. We use pd.read_csv()
.
# read_csv is a part of Pandas, so we need the pd.
gdp = pd.read_csv('gdp_components.csv')
print(type(gdp))
<class 'pandas.core.frame.DataFrame'>
This looks successful. .read_csv()
takes a string with the file name and creates a DataFrame. Let's take a look at the data.
print(gdp)
DATE GDPA GPDIA GCEA EXPGSA IMPGSA
0 1929 104.556 17.170 9.622 5.939 5.556
1 1930 92.160 11.428 10.273 4.444 4.121
2 1931 77.391 6.549 10.169 2.906 2.905
3 1932 59.522 1.819 8.946 1.975 1.932
4 1933 57.154 2.276 8.875 1.987 1.929
.. ... ... ... ... ... ...
84 2013 16784.851 2826.013 3132.409 2273.428 2764.210
85 2014 17521.747 3038.931 3167.041 2371.027 2879.284
86 2015 18219.297 3211.971 3234.210 2265.047 2786.461
87 2016 18707.189 3169.887 3290.979 2217.576 2738.146
88 2017 19485.394 3367.965 3374.444 2350.175 2928.596
[89 rows x 6 columns]
If we print a large DataFrame, print()
gives us a truncated view—the middle rows are missing. We can use the .head()
and .tail()
methods of DataFrame to peek at just the first or last few rows.
# Show the first 4 rows.
print(gdp.head(4))
DATE GDPA GPDIA GCEA EXPGSA IMPGSA
0 1929 104.556 17.170 9.622 5.939 5.556
1 1930 92.160 11.428 10.273 4.444 4.121
2 1931 77.391 6.549 10.169 2.906 2.905
3 1932 59.522 1.819 8.946 1.975 1.932
If you do not pass .head()
or .tail()
an argument, it defaults to 5 rows.
print(gdp.tail(2))
DATE GDPA GPDIA GCEA EXPGSA IMPGSA
87 2016 18707.189 3169.887 3290.979 2217.576 2738.146
88 2017 19485.394 3367.965 3374.444 2350.175 2928.596
The index isn't very sensible. This is time series data (the unit of observation is a year), so the date seems like a good index. How do we set the index?
# We could use 'inplace = True' if we didn't need a copy.
gdp_new_index = gdp.set_index('DATE')
print(gdp_new_index.head())
GDPA GPDIA GCEA EXPGSA IMPGSA
DATE
1929 104.556 17.170 9.622 5.939 5.556
1930 92.160 11.428 10.273 4.444 4.121
1931 77.391 6.549 10.169 2.906 2.905
1932 59.522 1.819 8.946 1.975 1.932
1933 57.154 2.276 8.875 1.987 1.929
gdp_new_index
GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|
DATE | |||||
1929 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1930 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
1931 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
1932 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
1933 | 57.154 | 2.276 | 8.875 | 1.987 | 1.929 |
... | ... | ... | ... | ... | ... |
2013 | 16784.851 | 2826.013 | 3132.409 | 2273.428 | 2764.210 |
2014 | 17521.747 | 3038.931 | 3167.041 | 2371.027 | 2879.284 |
2015 | 18219.297 | 3211.971 | 3234.210 | 2265.047 | 2786.461 |
2016 | 18707.189 | 3169.887 | 3290.979 | 2217.576 | 2738.146 |
2017 | 19485.394 | 3367.965 | 3374.444 | 2350.175 | 2928.596 |
89 rows × 5 columns
We can also set the index as we read in the file. Let's take a look at the read_csv() function.
pd.read_csv?
I'm seeing a lot of good stuff here. index_col
, usecols
, header
, sep
,... some stuff I don't know about, too. When reading in messy files, these extra arguments may come in handy.
Let's give index_col
a try.
# Treat the CSV like a DataFrame. Count cols staring with 0
gdp_2 = pd.read_csv('gdp_components.csv', index_col = 0)
gdp_2.head()
GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|
DATE | |||||
1929 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1930 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
1931 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
1932 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
1933 | 57.154 | 2.276 | 8.875 | 1.987 | 1.929 |
Navigating your file structure
We dumped our file into our current working directory so we could just ask for the file name gdp_components.csv
in read_csv()
. What is our current working directory (cwd)?
path_to_cwd = os.getcwd() # getcwd() is part of the os package we imported earlier
print(path_to_cwd)
U:\Data_Class
When we gave read_csv()
the string 'gpd_components.csv', it looked in our cwd for the file. Let's try something more complicated. Go into your Data_Class folder and create a new folder called Data_Files. Move the file 'gdp_parts.csv' into the Data_Files folder.
My folder structure looks like this. Data_Class
is the folder I keep all my files in for this class.
U:\
|
+-- ado
+-- Anaconda
+-- Data_Class
| +-- pandas_2_io.ipynb
| +-- gdp_components.csv
| +-- Data_Files
| | +-- gdp_parts.csv
|
+-- Desktop
+-- Documents
+-- python
+-- R
# This looks for gdp_components_moved.csv in the current working directory.
gdp_moved = pd.read_csv('gdp_parts.csv')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [12], in <cell line: 2>()
1 # This looks for gdp_components_moved.csv in the current working directory.
----> 2 gdp_moved = pd.read_csv('gdp_parts.csv')
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
305 if len(args) > num_allow_args:
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
665 kwds_defaults = _refine_defaults_read(
666 dialect,
667 delimiter,
(...)
676 defaults={"delimiter": ","},
677 )
678 kwds.update(kwds_defaults)
--> 680 return _read(filepath_or_buffer, kwds)
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:575, in _read(filepath_or_buffer, kwds)
572 _validate_names(kwds.get("names", None))
574 # Create the parser.
--> 575 parser = TextFileReader(filepath_or_buffer, **kwds)
577 if chunksize or iterator:
578 return parser
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:933, in TextFileReader.__init__(self, f, engine, **kwds)
930 self.options["has_index_names"] = kwds["has_index_names"]
932 self.handles: IOHandles | None = None
--> 933 self._engine = self._make_engine(f, self.engine)
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:1217, in TextFileReader._make_engine(self, f, engine)
1213 mode = "rb"
1214 # error: No overload variant of "get_handle" matches argument types
1215 # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
1216 # , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1217 self.handles = get_handle( # type: ignore[call-overload]
1218 f,
1219 mode,
1220 encoding=self.options.get("encoding", None),
1221 compression=self.options.get("compression", None),
1222 memory_map=self.options.get("memory_map", False),
1223 is_text=is_text,
1224 errors=self.options.get("encoding_errors", "strict"),
1225 storage_options=self.options.get("storage_options", None),
1226 )
1227 assert self.handles is not None
1228 f = self.handles.handle
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\common.py:789, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
784 elif isinstance(handle, str):
785 # Check whether the filename is to be opened in binary mode.
786 # Binary mode does not support 'encoding' and 'newline'.
787 if ioargs.encoding and "b" not in ioargs.mode:
788 # Encoding
--> 789 handle = open(
790 handle,
791 ioargs.mode,
792 encoding=ioargs.encoding,
793 errors=errors,
794 newline="",
795 )
796 else:
797 # Binary mode
798 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'gdp_parts.csv'
Of course this doesn't work. The file is not in our cwd. It's good see what that kind of error message looks like. We need to pass csv_read()
the path to the file. The path is the hierarchy of folders that contains the file. In my case, the path is
U:\Data_Class\Data_Files
Note that there is a \
each time we list a new folder.
On Windows: When we specify a file path, we escape the \
by using a second backslash in front of it.
'U:\\Data_Class\\Data_Files\\gdp_components_moved.csv'
On a Mac, you need to use the forward slash /
and you do not need a backslash in front of it.
'/Users/username/Data_Class/gdp_components_moved.csv'
gdp_moved = pd.read_csv('U:\\Data_Class\\Data_Files\\gdp_parts.csv', index_col=0)
gdp_moved.head()
GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|
DATE | |||||
1929 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1930 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
1931 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
1932 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
1933 | 57.154 | 2.276 | 8.875 | 1.987 | 1.929 |
We can manipulate strings to get to this, too. This approach might be useful if you needed to read in many files from the same place. (Maybe using a for loop and a list of file names?)
path_to_cwd = os.getcwd()
file_name = 'gdp_parts.csv'
path_to_data_file = path_to_cwd + '\\Data_Files\\' + file_name # Note the double \ characters
print(path_to_data_file)
U:\Data_Class\Data_Files\gdp_parts.csv
gdp_moved = pd.read_csv(path_to_data_file, index_col=0)
gdp_moved.head()
GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|
DATE | |||||
1929 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1930 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
1931 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
1932 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
1933 | 57.154 | 2.276 | 8.875 | 1.987 | 1.929 |
Practice: Reading CSVs
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.
- Try out the
.to_csv()
method of DataFrame. Savegdp_parts
as 'gdp_parts_2.csv' in your cwd. [You can use?
if you need help.]
gdp_moved.to_csv('gpd_parts_2.csv')
- Use
to_csv()
again to savegdp_parts
to the Data_Files folder. Name it 'gdp_parts_3.csv'
gdp_moved.to_csv('U:\\Data_Class\\Data_Files\\gdp_parts_3.csv')
Are your files in the correct places?
Isn't this supposed to be practice reading in CSV files? Right. Let's do some of that.
- Use gdp_parts_3.csv to create a DataFrame named gdp_growth. Set the index to the dates. Print out the first 10 years of data.
gdp_growth = pd.read_csv('U:\\Data_Class\\Data_Files\\gdp_parts_3.csv', index_col=0)
print(gdp_growth.head(10))
GDPA GPDIA GCEA EXPGSA IMPGSA
DATE
1929 104.556 17.170 9.622 5.939 5.556
1930 92.160 11.428 10.273 4.444 4.121
1931 77.391 6.549 10.169 2.906 2.905
1932 59.522 1.819 8.946 1.975 1.932
1933 57.154 2.276 8.875 1.987 1.929
1934 66.800 4.296 10.721 2.561 2.239
1935 74.241 7.370 11.151 2.769 2.982
1936 84.830 9.391 13.398 3.007 3.154
1937 93.003 12.967 13.119 4.039 3.961
1938 87.352 7.944 14.170 3.811 2.845
- Rename 'GDPA' to 'gdp' and rename 'GCEA' to 'gov'
gdp_growth.rename(columns={'GDPA':'gdp', 'GCEA':'gov'}, inplace=True)
print(gdp_growth.head())
gdp GPDIA gov EXPGSA IMPGSA
DATE
1929 104.556 17.170 9.622 5.939 5.556
1930 92.160 11.428 10.273 4.444 4.121
1931 77.391 6.549 10.169 2.906 2.905
1932 59.522 1.819 8.946 1.975 1.932
1933 57.154 2.276 8.875 1.987 1.929
Reading Excel spreadsheets
Reading spreadsheets isn't much different than reading csv files. But, since workbooks are more complicated than csv files, we have a few more options to consider. Many of the options we will explore using workbooks are also available when reading csv files.
If you haven't already, copy over 'debt.xlsx' to your cwd. Let's open it in Excel and have a look at it...
There's a lot going on here: missing data, some #N/A stuff, and several header rows. Let's get to work.
debt = pd.read_excel('debt.xlsx')
debt.head(15)
FRED Graph Observations | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | |
---|---|---|---|---|
0 | Federal Reserve Economic Data | NaN | NaN | NaN |
1 | Link: https://fred.stlouisfed.org | NaN | NaN | NaN |
2 | Help: https://fred.stlouisfed.org/help-faq | NaN | NaN | NaN |
3 | Economic Research Division | NaN | NaN | NaN |
4 | Federal Reserve Bank of St. Louis | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN |
6 | GDPA | Gross Domestic Product, Billions of Dollars, A... | NaN | NaN |
7 | GFDEBTN | Federal Debt: Total Public Debt, Millions of D... | NaN | NaN |
8 | DGS10 | 10-Year Treasury Constant Maturity Rate, Perce... | NaN | NaN |
9 | NaN | NaN | NaN | NaN |
10 | Frequency: Annual | NaN | NaN | NaN |
11 | observation_date | GDPA | GFDEBTN | DGS10 |
12 | 1929-01-01 00:00:00 | 104.556 | NaN | NaN |
13 | 1930-01-01 00:00:00 | 92.16 | NaN | NaN |
14 | 1931-01-01 00:00:00 | 77.391 | NaN | NaN |
# Use the 'header' option to specify the row to use as the column names (zero based, as usual).
debt = pd.read_excel('debt.xlsx', header = 12)
print(debt)
observation_date GDPA GFDEBTN DGS10
0 1929-01-01 104.556 NaN NaN
1 1930-01-01 92.160 NaN NaN
2 1931-01-01 77.391 NaN NaN
3 1932-01-01 59.522 NaN NaN
4 1933-01-01 57.154 NaN NaN
.. ... ... ... ...
85 2014-01-01 17521.747 17799837.00 2.539560
86 2015-01-01 18219.297 18344212.75 2.138287
87 2016-01-01 18707.189 19549200.50 1.837440
88 2017-01-01 19485.394 20107155.25 2.329480
89 2018-01-01 NaN NaN NaN
[90 rows x 4 columns]
That's looking good. Notice that Pandas added NaN for the missing data and for those #N\A entries. We will have to deal with those at some point. The header parameter is part of read_csv()
, too.
We didn't specify which sheet in the workbook to load, so Pandas took the first one. We can ask for sheets by name.
debt_q = pd.read_excel('debt.xlsx', header=12, sheet_name='quarterly')
print(debt_q)
observation_date GFDEBTN DGS10 GDP
0 1947-01-01 NaN NaN 243.164
1 1947-04-01 NaN NaN 245.968
2 1947-07-01 NaN NaN 249.585
3 1947-10-01 NaN NaN 259.745
4 1948-01-01 NaN NaN 265.742
.. ... ... ... ...
281 2017-04-01 19844554.0 2.260952 19359.123
282 2017-07-01 20244900.0 2.241429 19588.074
283 2017-10-01 20492747.0 2.371452 19831.829
284 2018-01-01 21089643.0 2.758525 20041.047
285 2018-04-01 21195070.0 2.920625 20411.924
[286 rows x 4 columns]
We can ask for just a subset of the columns when reading in a file (csv or xlsx). Use the usecols
argument. This takes either integers or Excel column letters.
# Take the first and third columns of sheet 'quarterly'
interest_rates = pd.read_excel('debt.xlsx', header=12, sheet_name='quarterly', usecols=[0,2])
interest_rates.head()
observation_date | DGS10 | |
---|---|---|
0 | 1947-01-01 | NaN |
1 | 1947-04-01 | NaN |
2 | 1947-07-01 | NaN |
3 | 1947-10-01 | NaN |
4 | 1948-01-01 | NaN |
Practice: Reading Excel
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.
- Read in the quarterly data from 'debt.xlsx' and keep only the columns with the date, gdp, and GFDEBTN. Try to do it all using arguments to
.read_excel()
. Name your new DataFramefed_debt
.
Print out the first 3 rows of fed_debt
.
fed_debt = pd.read_excel('debt.xlsx', header=12, sheet_name='quarterly', usecols=[0,1,3])
fed_debt.head(3)
observation_date | GFDEBTN | GDP | |
---|---|---|---|
0 | 1947-01-01 | NaN | 243.164 |
1 | 1947-04-01 | NaN | 245.968 |
2 | 1947-07-01 | NaN | 249.585 |
# Setting the index in .read_excel()
fed_debt = pd.read_excel('debt.xlsx', header=12, sheet_name='quarterly', usecols=[0,1,3], index_col=0)
fed_debt.head(3)
GFDEBTN | GDP | |
---|---|---|
observation_date | ||
1947-01-01 | NaN | 243.164 |
1947-04-01 | NaN | 245.968 |
1947-07-01 | NaN | 249.585 |
- Oops, I wanted to set the observation_date to the index. Go back and add that to your solution to 1.
- What is 'GFDEBTN'? It is the federal debt, in millions. Rename this variable to 'DEBT'
fed_debt.rename(columns={'GFDEBTN':'DEBT'}, inplace=True)
fed_debt.head()
DEBT | GDP | |
---|---|---|
observation_date | ||
1947-01-01 | NaN | 243.164 |
1947-04-01 | NaN | 245.968 |
1947-07-01 | NaN | 249.585 |
1947-10-01 | NaN | 259.745 |
1948-01-01 | NaN | 265.742 |
- Create a variable name debt_ratio that is the debt-to-GDP ratio. Debt is in millions and gdp is in billions. Adjust accordingly.
fed_debt['debt_ratio'] = (fed_debt['DEBT']/1000)/fed_debt['GDP']
print(fed_debt)
DEBT GDP debt_ratio
observation_date
1947-01-01 NaN 243.164 NaN
1947-04-01 NaN 245.968 NaN
1947-07-01 NaN 249.585 NaN
1947-10-01 NaN 259.745 NaN
1948-01-01 NaN 265.742 NaN
... ... ... ...
2017-04-01 19844554.0 19359.123 1.025075
2017-07-01 20244900.0 19588.074 1.033532
2017-10-01 20492747.0 19831.829 1.033326
2018-01-01 21089643.0 20041.047 1.052322
2018-04-01 21195070.0 20411.924 1.038367
[286 rows x 3 columns]
There are a lot of missing debt values. Did Pandas throw an error? No. Pandas knows (in some cases) how to work around missing data.
- Summarize the debt_ratio variable. What is its max level? Its min?
print(fed_debt['debt_ratio'].describe())
count 210.000000
mean 0.564994
std 0.227520
min 0.306033
25% 0.355102
50% 0.555767
75% 0.641648
max 1.052562
Name: debt_ratio, dtype: float64