Economics + Data

Fifteen-minute Friday #5

files needed = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv')

Fifteen-minute Fridays are brief, nongraded workbooks that provide some extra practice and introduce new topics for self-guided study. This week we are working on

More on file paths
Using a loop to read many files

The solutions to the practice problems are at the end of the notebook.

Relative file paths

A common way to organize your projects is to keep your data files in a folder separate from your code and your output. This means that you will need to specify where your files are located.

Get your current working directory. We did this in class using the os package.
Inside your current working directory, create a folder named 'data_files'.
Move the files 'data1.csv', 'data2.csv', 'data3.csv', and 'data4.csv' into the 'data_files' folder.

Absolute paths

In class, we used the cwd + the file name to access files not in our current working directory. For example, I could use

fname = os.getcwd()+'\\'+'data_files'+'\\'+'data1.csv'

and the variable fname would hold a string with the path to the file 'data1.csv'. Writing the path this way gives us the absolute path to the file.

Use the absolute path approach to read data1.csv into a DataFrame.

Relative paths

Suppose you moved these files to another machine. Depending on how that machine is set up, the absolute path may not work any longer. For example, the drive letter (C:\, D:\, etc. ) you are working on may have changed. Now your code is broken!

We can construct a relative path to avoid these problems. A relative path specifies a path relative to the current working directory. We are using a relative path when we do not include any file path. For example, in

df = pd.read_csv('data1.csv')

read_csv() is looking for the file 'data1.csv' in you current working directory. It will not find the file there, because the file is in the 'data_files' folder! Using a relative path, we start from our current working directory and go from there. So the file path in

df = pd.read_csv('data_files\\data1.csv')

tells read_csv() to look for the file inside a folder named 'data_files' in the our current working directory.

Use the relative path approach to read 'data1.csv' into a DataFrame.
Create a folder named 'inner_folder' inside your data_files folder.
Move 'data2.csv' into the 'inner_folder'.
Use a relative path to read 'data2.csv' into a DataFrame.

What if we need to reference a file in a folder that is not in our current working directory? We use .. to tell the computer to look one folder-level "up".

Let's try it out. Close this notebook and move it inside 'inner_folder'. Then reopen the notebook and come back to this line. Go ahead, I'll wait here.

Now, use a 'run all above' to execute your notebook. The absolute path you created in 4. is now broken.
What is your current working directory now?

Suppose we want open 'data1.csv' which is not in our current working directory, but in the directory "above." We would use

df = pd.read_csv('..\\data1.csv')

The .. moves up one directory, so read_csv() is looking inside the 'data_files' folder for the file.

Open 'data1.csv' using a relative path.

Using a loop to read files

One last problem to finish off your week...

Move 'data2.csv' back into 'data_files'. All four data files should be in the 'data_files' folder and this notebook should be in the 'inner_folder' folder.
Use a loop and some string manipulation to create four DataFrames from the four data files. Assign the DataFrames to a dict with the keys df1, df2, df3, and df4.

Relative file paths: Solutions

# 1. Get the current working directory.
import os
os.getcwd()

# 4. Read in the file with an absolute path.
import pandas as pd
df = pd.read_csv(os.getcwd()+'\\'+'data_files'+'\\'+'data1.csv')
df.head(2)

# 5. Read in the file with a relative path.
df = pd.read_csv('data_files\\data1.csv')

# 8. Read in the file with a relative path through two levels of directories.  
df = pd.read_csv('data_files\\inner_folder\\data2.csv')

# 11. Move up one directory to read in a file with a relative path.
df = pd.read_csv('..\\data1.csv')

Loops to read in files: Solutions

# 2. Loop over the files and create a dict of DataFrames.

data = {}
for i in range(1,5):
    data[('df'+str(i))] = pd.read_csv('..\\data'+str(i)+'.csv')