Combining multiple datasets - Data Analysis with Python 3 and Pandas
Hello and welcome to part 5 of the data analysis with Python and Pandas series. In this tutorial, we're going to build off the lasdt and experiment with combining datasets to see if we can find more relationships.
We've been working with the Kaggle Minimum Wage by State dataset, and now I am curious to bring in the Unemployment by County dataset. Sometimes, it's argued that increasing minimum wage makes it so that employing people is more costly, and, as a result, employment may drop.
Year | Month | State | County | Rate | |
0 | 2015 | February | Mississippi | Newton County | 6.1 |
1 | 2015 | February | Mississippi | Panola County | 9.4 |
2 | 2015 | February | Mississippi | Monroe County | 7.9 |
3 | 2015 | February | Mississippi | Hinds County | 6.1 |
4 | 2015 | February | Mississippi | Kemper County | 10.6 |
Now, we'd want to map the minimum wage by state to this. In many states, there are varying minimum wages by city. We'll just have to accept that this isn't going to be perfect. Let's now load in our minimum wage data:
Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | District of Columbia | Federal (FLSA) | ... | Tennessee | Texas | U.S. Virgin Islands | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
Year | |||||||||||||||||||||
1968 | 0.0 | 15.12 | 3.37 | 1.12 | 11.88 | 7.20 | 10.08 | 9.00 | 9.00 | 8.28 | ... | 0.0 | 0.00 | NaN | 7.20 | 10.08 | 0.0 | 11.52 | 7.20 | 9.00 | 8.64 |
1969 | 0.0 | 14.33 | 3.19 | 1.07 | 11.26 | 6.83 | 9.56 | 8.53 | 8.53 | 7.85 | ... | 0.0 | 0.00 | NaN | 6.83 | 9.56 | 0.0 | 10.92 | 6.83 | 8.53 | 8.19 |
1970 | 0.0 | 13.54 | 3.02 | 7.09 | 10.64 | 6.45 | 10.32 | 8.06 | 10.32 | 8.38 | ... | 0.0 | 0.00 | NaN | 6.45 | 10.32 | 0.0 | 10.32 | 6.45 | 8.38 | 8.38 |
1971 | 0.0 | 12.99 | 2.89 | 6.80 | 10.20 | 6.18 | 9.89 | 7.73 | 9.89 | 8.04 | ... | 0.0 | 0.00 | NaN | 6.18 | 9.89 | 0.0 | 9.89 | 6.18 | 8.04 | 8.04 |
1972 | 0.0 | 12.57 | 2.80 | 7.19 | 9.88 | 5.99 | 11.08 | 9.58 | 9.58 | 9.58 | ... | 0.0 | 8.38 | NaN | 7.19 | 9.58 | 0.0 | 9.58 | 7.19 | 8.68 | 8.98 |
5 rows A-- 55 columns
Oh right, missing data. Let's throw this out from our analysis:
Alaska | Arkansas | California | Colorado | Connecticut | Delaware | District of Columbia | Federal (FLSA) | Guam | Hawaii | ... | Pennsylvania | Puerto Rico | Rhode Island | South Dakota | Utah | Vermont | Washington | West Virginia | Wisconsin | Wyoming | |
Year | |||||||||||||||||||||
1968 | 15.12 | 1.12 | 11.88 | 7.20 | 10.08 | 9.00 | 9.00 | 8.28 | 9.00 | 9.00 | ... | 8.28 | 3.10 | 10.08 | 3.06 | 7.20 | 10.08 | 11.52 | 7.20 | 9.00 | 8.64 |
1969 | 14.33 | 1.07 | 11.26 | 6.83 | 9.56 | 8.53 | 8.53 | 7.85 | 8.53 | 8.53 | ... | 7.85 | 2.94 | 9.56 | 2.90 | 6.83 | 9.56 | 10.92 | 6.83 | 8.53 | 8.19 |
1970 | 13.54 | 7.09 | 10.64 | 6.45 | 10.32 | 8.06 | 10.32 | 8.38 | 10.32 | 10.32 | ... | 8.38 | 2.77 | 10.32 | 6.45 | 6.45 | 10.32 | 10.32 | 6.45 | 8.38 | 8.38 |
1971 | 12.99 | 6.80 | 10.20 | 6.18 | 9.89 | 7.73 | 9.89 | 8.04 | 9.89 | 9.89 | ... | 8.04 | 2.66 | 9.89 | 6.18 | 6.18 | 9.89 | 9.89 | 6.18 | 8.04 | 8.04 |
1972 | 12.57 | 7.19 | 9.88 | 5.99 | 11.08 | 9.58 | 9.58 | 9.58 | 11.38 | 9.58 | ... | 9.58 | 3.89 | 9.58 | 5.99 | 7.19 | 9.58 | 9.58 | 7.19 | 8.68 | 8.98 |
5 rows A-- 39 columns
So now...what do we do? This seems like it might be a bit more complex than the other things we've done.
Well, our very end goal is to see if there's any relationship between unemployment and the minimum wage. To do this, we'd like to be able to just call .corr
or .cov
on some dataframe. Seems to me like it'd be most convenient then to do this on our unemp_county
dataframe. We'd really just like to add a new column, called minimum wage, then populate that column based on that ROW's state!
Logically, we're going to need to go, row by row, check the state, and set the minimum wage column to that state's value in the other dataframe. It's often going to be the case that you have some custom task like this. In this case, we're mapping some values from one dataframe to another, but maybe another time it wont even be a dataframe to another, it could be some sensor value, some sort of custom user input, or even something that will require even further calculations on. Who knows, but Pandas is extremely flexible and we can map functions to columns, based on row values. Let's see how!
First, let's just create a function that would handle this:
Then for example we could do:
Now, we map!
We can use this method to map just about any function with as many parameters as we want to a column. This method will basically always work, but wont necessarily be the most efficient. Often, we can use .map
or .apply
insted to a column, or some other built-in methods, but the above is always an option.
Year | Month | State | County | Rate | min_wage | |
0 | 2015 | February | Mississippi | Newton County | 6.1 | NaN |
1 | 2015 | February | Mississippi | Panola County | 9.4 | NaN |
2 | 2015 | February | Mississippi | Monroe County | 7.9 | NaN |
3 | 2015 | February | Mississippi | Hinds County | 6.1 | NaN |
4 | 2015 | February | Mississippi | Kemper County | 10.6 | NaN |
Year | Month | State | County | Rate | min_wage | |
885543 | 2009 | November | Maine | Somerset County | 10.5 | 8.46 |
885544 | 2009 | November | Maine | Oxford County | 10.5 | 8.46 |
885545 | 2009 | November | Maine | Knox County | 7.5 | 8.46 |
885546 | 2009 | November | Maine | Piscataquis County | 11.3 | 8.46 |
885547 | 2009 | November | Maine | Aroostook County | 9.0 | 8.46 |
Rate | min_wage | |
Rate | 1.000000 | 0.153047 |
min_wage | 0.153047 | 1.000000 |
Rate | min_wage | |
Rate | 9.687873 | 0.651586 |
min_wage | 0.651586 | 1.874228 |
Interesting. It looks like there's a slightly positive relationship (correlation) between the unemployment rate and minimum wage, but also a pretty strong covariance, signaling to us that these two things do tend to vary together. It just looks like, while they definitely vary together, the actual impact of one on the other isn't very substantial. Plus, we'd have to ask next which comes first. The increased unemployment, or the minimum wage increases.
Finally, I'd like to look at election data by county and see if there's a relationship between voting, minimum wage, and unemployment. To do this, I will pull from 2016 US presidential vote by county.
county | fips | cand | st | pct_report | votes | total_votes | pct | lead | |
0 | NaN | US | Donald Trump | US | 0.9951 | 60350241.0 | 127592176.0 | 0.472993 | Donald Trump |
1 | NaN | US | Hillary Clinton | US | 0.9951 | 60981118.0 | 127592176.0 | 0.477938 | Donald Trump |
2 | NaN | US | Gary Johnson | US | 0.9951 | 4164589.0 | 127592176.0 | 0.032640 | Donald Trump |
3 | NaN | US | Jill Stein | US | 0.9951 | 1255968.0 | 127592176.0 | 0.009844 | Donald Trump |
4 | NaN | US | Evan McMullin | US | 0.9951 | 451636.0 | 127592176.0 | 0.003540 | Donald Trump |
5 | NaN | US | Darrell Castle | US | 0.9951 | 180877.0 | 127592176.0 | 0.001418 | Donald Trump |
6 | NaN | US | Gloria La Riva | US | 0.9951 | 48308.0 | 127592176.0 | 0.000379 | Donald Trump |
7 | NaN | US | Rocky De La Fuente | US | 0.9951 | 32120.0 | 127592176.0 | 0.000252 | Donald Trump |
8 | NaN | US | None of these candidates | US | 0.9951 | 28824.0 | 127592176.0 | 0.000226 | Donald Trump |
9 | NaN | US | Richard Duncan | US | 0.9951 | 23501.0 | 127592176.0 | 0.000184 | Donald Trump |
10 | NaN | US | Dan Vacek | US | 0.9951 | 13546.0 | 127592176.0 | 0.000106 | Donald Trump |
11 | NaN | US | Alyson Kennedy | US | 0.9951 | 11456.0 | 127592176.0 | 0.000090 | Donald Trump |
12 | NaN | US | Mike Smith | US | 0.9951 | 8911.0 | 127592176.0 | 0.000070 | Donald Trump |
13 | NaN | US | Chris Keniston | US | 0.9951 | 6617.0 | 127592176.0 | 0.000052 | Donald Trump |
14 | NaN | US | Lynn Kahn | US | 0.9951 | 5565.0 | 127592176.0 | 0.000044 | Donald Trump |
This data starts with the entire US aggregate data, but then breaks down by state and county, as well as candidate. Let's include the top 10 candidates. To grab their names:
Year | Month | State | County | Rate | min_wage | |
0 | 2015 | February | Mississippi | Newton County | 6.1 | NaN |
1 | 2015 | February | Mississippi | Panola County | 9.4 | NaN |
2 | 2015 | February | Mississippi | Monroe County | 7.9 | NaN |
3 | 2015 | February | Mississippi | Hinds County | 6.1 | NaN |
4 | 2015 | February | Mississippi | Kemper County | 10.6 | NaN |
Now, for county_2015
, we'd like to convert the State
to the all-caps abbreviation that our pres16
is using. We can do that using our abbreviations that we used before:
Postal Code | |
State/District | |
Alabama | AL |
Alaska | AK |
Arizona | AZ |
Arkansas | AR |
California | CA |
Year | Month | State | County | Rate | min_wage | |
2797 | 2015 | February | ME | Somerset County | 8.4 | 7.92 |
2798 | 2015 | February | ME | Oxford County | 6.8 | 7.92 |
2799 | 2015 | February | ME | Knox County | 6.1 | 7.92 |
2800 | 2015 | February | ME | Piscataquis County | 7.0 | 7.92 |
2801 | 2015 | February | ME | Aroostook County | 7.2 | 7.92 |
Well that was magical, wasn't it?!
In the case of singe-parmeter functions, we can just use a .map
. Or...as you just saw here, if you want to map a key to a value using a dict, you can do the same thing, and just say you want to map the dictionary. Cool, huh?
Now let's map the county's candidate percentages to this. To do this, we have quite a few columns, and really, everything just needs to match up by county and state.
Since pres16
is longer, we'll map that to county_15
, where there are matches. Instead of a map, however, we'll combine with a join
. To do this, let's index both of these. They are indexed by state
AND county
. So, we'll name these both the same, and then index as such.
County | fips | cand | State | pct_report | votes | total_votes | pct | lead | |
0 | NaN | US | Donald Trump | US | 0.9951 | 60350241.0 | 127592176.0 | 0.472993 | Donald Trump |
1 | NaN | US | Hillary Clinton | US | 0.9951 | 60981118.0 | 127592176.0 | 0.477938 | Donald Trump |
2 | NaN | US | Gary Johnson | US | 0.9951 | 4164589.0 | 127592176.0 | 0.032640 | Donald Trump |
3 | NaN | US | Jill Stein | US | 0.9951 | 1255968.0 | 127592176.0 | 0.009844 | Donald Trump |
4 | NaN | US | Evan McMullin | US | 0.9951 | 451636.0 | 127592176.0 | 0.003540 | Donald Trump |
pct | ||
County | State | |
NaN | US | 0.472993 |
CA | 0.330641 |
Year | Month | Rate | min_wage | ||
County | State | ||||
Newton County | MS | 2015 | February | 6.1 | NaN |
Panola County | MS | 2015 | February | 9.4 | NaN |
Month | Rate | min_wage | pct | ||
County | State | ||||
Major County | OK | February | 2.6 | 2.11 | 0.864960 |
Pottawatomie County | OK | February | 4.5 | 2.11 | 0.701342 |
Johnston County | OK | February | 6.5 | 2.11 | 0.770057 |
Jefferson County | OK | February | 5.0 | 2.11 | 0.812367 |
Beaver County | OK | February | 2.8 | 2.11 | 0.888243 |
Rate | min_wage | pct | |
Rate | 1.000000 | 0.186689 | -0.085985 |
min_wage | 0.186689 | 1.000000 | -0.325036 |
pct | -0.085985 | -0.325036 | 1.000000 |
Rate | min_wage | pct | |
Rate | 5.743199 | 0.683870 | -0.031771 |
min_wage | 0.683870 | 2.336451 | -0.076602 |
pct | -0.031771 | -0.076602 | 0.023772 |
Curiously, min_wage appears to have a negative correlation with the pct vote for Trump, so as minimum wage rises, people are less likely to vote for Trump, for example. That makes sense since higher minimum wages are typically a Democratic agenda, though it is fairly curious that, while correlated, they do not attempt to vary together. Interesting.
Alright, I think that, at this point, we've covered quite a bit about Pandas. There will always be more and more to show, but, for that, I would suggest you just read through the docs and begin to just try doing things yourself.
The next thing that I would like to show is using Pandas as your pre-processing software for data for machine learning, which is what we're going to be doing in the next tutorial!
Last updated