Visualizing Correlation Table - Data Analysis with Python 3 and Pandas
Hello and welcome to part 4 of the data analysis with Python and Pandas series. We're going to be continuing our work with the minimum wage dataset and our correlation table. Where we left off:
Alaska | Arkansas | California | Colorado | Connecticut | Delaware | District of Columbia | Federal (FLSA) | Guam | Hawaii | ... | Pennsylvania | Puerto Rico | Rhode Island | South Dakota | Utah | Vermont | Washington | West Virginia | Wisconsin | Wyoming | |
Alaska | 1.000000 | -0.377934 | 0.717653 | -0.129332 | 0.680886 | 0.258997 | 0.321785 | 0.637679 | 0.787915 | 0.672620 | ... | 0.610814 | -0.038118 | 0.652353 | -0.326316 | -0.020462 | 0.663558 | 0.411593 | 0.044814 | 0.702570 | 0.727932 |
Arkansas | -0.377934 | 1.000000 | -0.234367 | 0.135749 | 0.047580 | 0.016125 | 0.266889 | 0.117245 | 0.039593 | 0.204801 | ... | 0.159923 | 0.232186 | 0.003498 | 0.800116 | 0.194680 | 0.087429 | -0.072343 | 0.420819 | 0.000470 | -0.250592 |
California | 0.717653 | -0.234367 | 1.000000 | 0.483313 | 0.876215 | 0.479197 | 0.596865 | 0.371966 | 0.492052 | 0.519241 | ... | 0.429061 | 0.512712 | 0.780916 | -0.036787 | 0.392898 | 0.877922 | 0.754085 | 0.371765 | 0.584067 | 0.722617 |
Colorado | -0.129332 | 0.135749 | 0.483313 | 1.000000 | 0.402020 | 0.566304 | 0.673371 | -0.232035 | -0.192616 | 0.069800 | ... | -0.136195 | 0.657364 | 0.429852 | 0.399137 | 0.622330 | 0.448485 | 0.612637 | 0.533623 | 0.011501 | 0.130053 |
Connecticut | 0.680886 | 0.047580 | 0.876215 | 0.402020 | 1.000000 | 0.552613 | 0.652488 | 0.487750 | 0.632073 | 0.621503 | ... | 0.531769 | 0.626712 | 0.802485 | 0.105707 | 0.302538 | 0.898469 | 0.715691 | 0.400099 | 0.585790 | 0.814971 |
5 rows A-- 39 columns
Now, we can graph this with matplotlib
. If you do not have it, you need to do a pip install matplotlib
. Matplotlib has a nifty graphing function called matshow
that we can use:
It wouldn't be Matplotlib, however, if we didnt need to do some customization.
Again, I will just do the customization. If you would like to learn more about Matplotlib, check out the data visualization series.
Our simple abbreviations aren't cutting it. We need something better. A quick google search found me https://www.infoplease.com/state-abbreviations-and-state-postal-codes
, which contains a table.
Guess what can read tables from the internet? Pandas can! You can use pd.read_html(URL)
and pandas will search for any tables to populate a list of dfs with. Just remember, pd.read_html
will return a list of dfs, not just one df.
State/District | Abbreviation | Postal Code | |
0 | Alabama | Ala. | AL |
1 | Alaska | Alaska | AK |
2 | Arizona | Ariz. | AZ |
3 | Arkansas | Ark. | AR |
4 | California | Calif. | CA |
Often sources decide to disable access, or disappear, so I may want to save this dataframe both for myself and to share with you all in case they stop allowing robot access! Saving a dataframe in pandas is easy:
Bring back:
Unnamed: 0 | State/District | Abbreviation | Postal Code | |
0 | 0 | Alabama | Ala. | AL |
1 | 1 | Alaska | Alaska | AK |
2 | 2 | Arizona | Ariz. | AZ |
3 | 3 | Arkansas | Ark. | AR |
4 | 4 | California | Calif. | CA |
So what happened? Well, we saved and loaded with the "index," which has created duplication. A CSV file has no idea about indexes, so pandas will by default just load in all of the data as columns, and then assign a new index. We can do things like saving with no index, we can opt to save specific columns only, and we can load in and specify an index on load. For example, this time, let's save just the specific columns we're after:
Then, we can do:
Postal Code | |
State/District | |
Alabama | AL |
Alaska | AK |
Arizona | AZ |
Arkansas | AR |
California | CA |
Any time you're unsure about what you can do, you should always check out the Pandas Docs. They are great to just scroll through, just to learn more about what you can do, but also to learn about various parameters and methods that exist that you might otherwise not realize.
For example, while we're mainly working with CSVs here, we can work with many other formats: SQL, json, HDF5, BigQuery, and much much more!
Back to the task at hand, we are trying to use the Postal Codes for our abbreviations:
We can see here that it's a dict that maps to the dict we actually want, so we can reference the codes with:
Now we can re-do our labels with:
Okay. Fine, we have to hack this one in ourselves!
Hmm, we might have to revisit the territories, but:
Okay good enough! Back to our graph now!
Cool! We've covered quite a bit again, but hopefully that was interesting, and we got to to begin to combine datasets, if only to inform our column names.
... but more cool things happen when we can combine datasets moreso for their data! While correlation is not causation, we can still gleam interesting things from it! Plus... we get to make cool graphs, so why not? In the next tutorial, we're going to explore the relationships of minimum wage to unemployment, and maybe even toss in political persuasions of those states while we're at it!
Last updated