Today's agenda:
Be sure to include your group member names on the first page of your report.
We have been lucky to have the sscc and winstat to use in class. Soon, you will be moving on from this class, and the University, so it makes sense to set up a local python installation.
On winstat, we have been working with the Anaconda distribution. If you recall from earlier in the semester, python is a standard set of functions and data types that can be extended by installing packages. Anaconda is a distribution: a bundle of python and a set of packages that are useful for data analysis and numerical computing. A list of the packages is here. This includes many of the packages we have been using: numpy, pandas, matplotlib, seaborn, scikit-learn...
To install Anaconda on your laptops:
If step 5 worked, then you are ready to code. I recommend that you move your files from winstat to your local machine at some point so you have a local copy.
GitHub is a web-based service that hosts (stores) files. It implements a powerful version control system called Git and has other features like bug tracking, wikis, etc.
There are a lot of useful things you can do with GitHub — most of which are outside the scope of our class. If you continue on and venture deeper into developing code (particularly if you are doing it with others) you will want to learn more about these features.
For our purpose, we want to take advantage of GitHub's ability to host jupyter notebooks. A jupyter notebook is just a text file. (Try opening one in notepad.) The jupyter notebook software interprets the text file and renders what we see on the screen. When we upload our notebooks to GitHub, they will be rendered for others to see when they go to your GitHub repository. You cannot run the notebook, but you can view it.
Setting up GitHub:
You can create as many repositories as you like.
You now have a place you can upload your project and other work. Linking to the notebook on GitHub is an easy way to share your accomplishments with others: graduate schools, potential employers, your mom and dad...
Let's take a minute to reflect on what you knew when we first met on September 6. On that day, we looked at some visualizations. How does your approach to these figures today differ from your approach three months ago?
Go to https://www.cnn.com/election/2020/results/state/wisconsin/president and focus on the presidential results
It's always a good idea to take a moment at the end of a class or project to take an inventory of what we have done or what skills we have learned. You will often be asked to make summaries like this on resumes and reports. Writing the summary when it is all fresh in your head is easier.
Skills acquired
You are probably more comfortable with some of these skills more than others. You can always go back and look through the jupyter notebooks to brush up.
Having experience with commonly-used datasets is valuable. Some datasets we have worked with this semester include:
Links to these datasets can be found at http://badgerdata.org/pages/data-sets.
Many of you worked with and learned about other datasets for your projects. Add those data sets to the list above!
This class has (hopefully) provided you with a set of skills and a window into many different things you can do with a computer, python, and some data. Suppose you thought this course was fun, or useful, or both. Where could you go from here? The sky is the limit. You could spend the rest of your life learning about this kind of stuff, but here are some thoughts.
We developed a pretty good understanding of how to use python. We did not spend much time learning how to 'code': Deeply understanding object-oriented programming; structuring complex code; coding as part of a team; code optimization.
To be honest, most data analysis does not require you to be an expert coder. Being a better programmer, though, is helpful. If you would like to learn more about writing code, a programming course could be a good idea. You could do this here at UW, at Madison College, or as a self-guided program online or from a book.
Some benefits: Being a better coder means that you will likely write more efficient code, which becomes more important as the datasets get larger. You will also become better at writing reusable code, so that you save time and effort. It is also worth noting that once you have a good sense of how to program, learning a new language (R, Java, Ruby...) is much easier.
We spent a lot of time learning how to wrangle data. Real-world data are messy, and must be beaten into shape before we can do anything useful with it. Our analytic tools were mostly lifted from the econometrics courses you have already taken. We learned how to implement them in python.
If analytic tools interest you, think about picking up some more of them. The economics department offers several quantitative courses. You can learn more about them here. The Internet is full of data-analytic tutorials. Panel data? Simulation-based econometrics? Financial statistics? Think about what kinds of data you want to work with and what kinds of questions you want to ask. Then go find the tools.
Some benefits: Having a broad set of skills will allow you to tackle a wide variety of questions. If you want to become an expert in a particular field, drill down into the techniques that are most applicable.
Our focus was on python, pandas, and matplotlib/seaborn/geopandas. Data is often locked away in databases that require structured query language (SQL) type interfaces. Learning some SQL will allow you to get at that data — even if it is just so that you can download it and load it into pandas.
Learning how to read and write javascript object notation (JSON) encoded data will make it easier to snag data from the web. You can do this in pandas. Grab a cup of coffee and google 'json pandas tutorial.'
'Big data' tools like Hadoop are aimed at working with really large datasets distributed across a cluster of computers. This is a pretty advanced skill, but there is no reason to think you couldn't learn how to use it if you put in the effort.
Some benefits: The more types of data you can access, the more you can do.