Combining multiple datasets - Data Analysis with Python 3 and Pandas

gh

Hello and welcome to part 5 of the data analysis with Python and Pandas series. In this tutorial, we're going to build off the lasdt and experiment with combining datasets to see if we can find more relationships.

We've been working with the Kaggle Minimum Wage by State dataset, and now I am curious to bring in the Unemployment by County dataset. Sometimes, it's argued that increasing minimum wage makes it so that employing people is more costly, and, as a result, employment may drop.

import pandas as pd

unemp_county = pd.read_csv("datasets/unemployment-by-county-us/output.csv")
unemp_county.head()

Year

Month

State

County

Rate

0

2015

February

Mississippi

Newton County

6.1

1

2015

February

Mississippi

Panola County

9.4

2

2015

February

Mississippi

Monroe County

7.9

3

2015

February

Mississippi

Hinds County

6.1

4

2015

February

Mississippi

Kemper County

10.6

Now, we'd want to map the minimum wage by state to this. In many states, there are varying minimum wages by city. We'll just have to accept that this isn't going to be perfect. Let's now load in our minimum wage data:

df = pd.read_csv("datasets/minwage.csv")

act_min_wage = pd.DataFrame()

for name, group in df.groupby("State"):
    if act_min_wage.empty:
        act_min_wage = group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name})
    else:
        act_min_wage = act_min_wage.join(group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name}))

act_min_wage.head()

Alabama

Alaska

Arizona

Arkansas

California

Colorado

Connecticut

Delaware

District of Columbia

Federal (FLSA)

...

Tennessee

Texas

U.S. Virgin Islands

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Year

1968

0.0

15.12

3.37

1.12

11.88

7.20

10.08

9.00

9.00

8.28

...

0.0

0.00

NaN

7.20

10.08

0.0

11.52

7.20

9.00

8.64

1969

0.0

14.33

3.19

1.07

11.26

6.83

9.56

8.53

8.53

7.85

...

0.0

0.00

NaN

6.83

9.56

0.0

10.92

6.83

8.53

8.19

1970

0.0

13.54

3.02

7.09

10.64

6.45

10.32

8.06

10.32

8.38

...

0.0

0.00

NaN

6.45

10.32

0.0

10.32

6.45

8.38

8.38

1971

0.0

12.99

2.89

6.80

10.20

6.18

9.89

7.73

9.89

8.04

...

0.0

0.00

NaN

6.18

9.89

0.0

9.89

6.18

8.04

8.04

1972

0.0

12.57

2.80

7.19

9.88

5.99

11.08

9.58

9.58

9.58

...

0.0

8.38

NaN

7.19

9.58

0.0

9.58

7.19

8.68

8.98

5 rows A-- 55 columns

Oh right, missing data. Let's throw this out from our analysis:

import numpy as np

act_min_wage = act_min_wage.replace(0, np.NaN).dropna(axis=1)
act_min_wage.head()

Alaska

Arkansas

California

Colorado

Connecticut

Delaware

District of Columbia

Federal (FLSA)

Guam

Hawaii

...

Pennsylvania

Puerto Rico

Rhode Island

South Dakota

Utah

Vermont

Washington

West Virginia

Wisconsin

Wyoming

Year

1968

15.12

1.12

11.88

7.20

10.08

9.00

9.00

8.28

9.00

9.00

...

8.28

3.10

10.08

3.06

7.20

10.08

11.52

7.20

9.00

8.64

1969

14.33

1.07

11.26

6.83

9.56

8.53

8.53

7.85

8.53

8.53

...

7.85

2.94

9.56

2.90

6.83

9.56

10.92

6.83

8.53

8.19

1970

13.54

7.09

10.64

6.45

10.32

8.06

10.32

8.38

10.32

10.32

...

8.38

2.77

10.32

6.45

6.45

10.32

10.32

6.45

8.38

8.38

1971

12.99

6.80

10.20

6.18

9.89

7.73

9.89

8.04

9.89

9.89

...

8.04

2.66

9.89

6.18

6.18

9.89

9.89

6.18

8.04

8.04

1972

12.57

7.19

9.88

5.99

11.08

9.58

9.58

9.58

11.38

9.58

...

9.58

3.89

9.58

5.99

7.19

9.58

9.58

7.19

8.68

8.98

5 rows A-- 39 columns

So now...what do we do? This seems like it might be a bit more complex than the other things we've done.

Well, our very end goal is to see if there's any relationship between unemployment and the minimum wage. To do this, we'd like to be able to just call .corr or .cov on some dataframe. Seems to me like it'd be most convenient then to do this on our unemp_county dataframe. We'd really just like to add a new column, called minimum wage, then populate that column based on that ROW's state!

Logically, we're going to need to go, row by row, check the state, and set the minimum wage column to that state's value in the other dataframe. It's often going to be the case that you have some custom task like this. In this case, we're mapping some values from one dataframe to another, but maybe another time it wont even be a dataframe to another, it could be some sensor value, some sort of custom user input, or even something that will require even further calculations on. Who knows, but Pandas is extremely flexible and we can map functions to columns, based on row values. Let's see how!

First, let's just create a function that would handle this:

def get_min_wage(year, state):
    try:
        return act_min_wage.loc[year][state]
    except:
        return np.NaN

Then for example we could do:

get_min_wage(2012, "Colorado")
8.33

Now, we map!

%%time
# time will give us the total time to perform some cell's operation.

unemp_county['min_wage'] = list(map(get_min_wage, unemp_county['Year'], unemp_county['State']))
CPU times: user 1min 27s, sys: 126 ms, total: 1min 27s
Wall time: 1min 27s

We can use this method to map just about any function with as many parameters as we want to a column. This method will basically always work, but wont necessarily be the most efficient. Often, we can use .map or .apply insted to a column, or some other built-in methods, but the above is always an option.

unemp_county.head()

Year

Month

State

County

Rate

min_wage

0

2015

February

Mississippi

Newton County

6.1

NaN

1

2015

February

Mississippi

Panola County

9.4

NaN

2

2015

February

Mississippi

Monroe County

7.9

NaN

3

2015

February

Mississippi

Hinds County

6.1

NaN

4

2015

February

Mississippi

Kemper County

10.6

NaN

unemp_county.tail()

Year

Month

State

County

Rate

min_wage

885543

2009

November

Maine

Somerset County

10.5

8.46

885544

2009

November

Maine

Oxford County

10.5

8.46

885545

2009

November

Maine

Knox County

7.5

8.46

885546

2009

November

Maine

Piscataquis County

11.3

8.46

885547

2009

November

Maine

Aroostook County

9.0

8.46

unemp_county[['Rate','min_wage']].corr()

Rate

min_wage

Rate

1.000000

0.153047

min_wage

0.153047

1.000000

unemp_county[['Rate','min_wage']].cov()

Rate

min_wage

Rate

9.687873

0.651586

min_wage

0.651586

1.874228

Interesting. It looks like there's a slightly positive relationship (correlation) between the unemployment rate and minimum wage, but also a pretty strong covariance, signaling to us that these two things do tend to vary together. It just looks like, while they definitely vary together, the actual impact of one on the other isn't very substantial. Plus, we'd have to ask next which comes first. The increased unemployment, or the minimum wage increases.

Finally, I'd like to look at election data by county and see if there's a relationship between voting, minimum wage, and unemployment. To do this, I will pull from 2016 US presidential vote by county.

pres16 = pd.read_csv("datasets/pres16results.csv")
pres16.head(15)

county

fips

cand

st

pct_report

votes

total_votes

pct

lead

0

NaN

US

Donald Trump

US

0.9951

60350241.0

127592176.0

0.472993

Donald Trump

1

NaN

US

Hillary Clinton

US

0.9951

60981118.0

127592176.0

0.477938

Donald Trump

2

NaN

US

Gary Johnson

US

0.9951

4164589.0

127592176.0

0.032640

Donald Trump

3

NaN

US

Jill Stein

US

0.9951

1255968.0

127592176.0

0.009844

Donald Trump

4

NaN

US

Evan McMullin

US

0.9951

451636.0

127592176.0

0.003540

Donald Trump

5

NaN

US

Darrell Castle

US

0.9951

180877.0

127592176.0

0.001418

Donald Trump

6

NaN

US

Gloria La Riva

US

0.9951

48308.0

127592176.0

0.000379

Donald Trump

7

NaN

US

Rocky De La Fuente

US

0.9951

32120.0

127592176.0

0.000252

Donald Trump

8

NaN

US

None of these candidates

US

0.9951

28824.0

127592176.0

0.000226

Donald Trump

9

NaN

US

Richard Duncan

US

0.9951

23501.0

127592176.0

0.000184

Donald Trump

10

NaN

US

Dan Vacek

US

0.9951

13546.0

127592176.0

0.000106

Donald Trump

11

NaN

US

Alyson Kennedy

US

0.9951

11456.0

127592176.0

0.000090

Donald Trump

12

NaN

US

Mike Smith

US

0.9951

8911.0

127592176.0

0.000070

Donald Trump

13

NaN

US

Chris Keniston

US

0.9951

6617.0

127592176.0

0.000052

Donald Trump

14

NaN

US

Lynn Kahn

US

0.9951

5565.0

127592176.0

0.000044

Donald Trump

This data starts with the entire US aggregate data, but then breaks down by state and county, as well as candidate. Let's include the top 10 candidates. To grab their names:

top_candidates = pres16.head(10)['cand'].values
print(top_candidates)
['Donald Trump' 'Hillary Clinton' 'Gary Johnson' 'Jill Stein'
 'Evan McMullin' 'Darrell Castle' 'Gloria La Riva' 'Rocky De La Fuente'
 ' None of these candidates' 'Richard Duncan']
#county_2015 = unemp_county[ (unemp_county['Year']==2015 and unemp_county["Month"]=="February") ]



county_2015 = unemp_county[ (unemp_county['Year']==2015) & (unemp_county["Month"]=="February")]
county_2015.head()

Year

Month

State

County

Rate

min_wage

0

2015

February

Mississippi

Newton County

6.1

NaN

1

2015

February

Mississippi

Panola County

9.4

NaN

2

2015

February

Mississippi

Monroe County

7.9

NaN

3

2015

February

Mississippi

Hinds County

6.1

NaN

4

2015

February

Mississippi

Kemper County

10.6

NaN

Now, for county_2015, we'd like to convert the State to the all-caps abbreviation that our pres16 is using. We can do that using our abbreviations that we used before:

state_abbv = pd.read_csv("datasets/state_abbv.csv", index_col=0)
state_abbv.head()

Postal Code

State/District

Alabama

AL

Alaska

AK

Arizona

AZ

Arkansas

AR

California

CA

state_abbv_dict = state_abbv.to_dict()['Postal Code']
county_2015['State'] = county_2015['State'].map(state_abbv_dict)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
county_2015.tail()

Year

Month

State

County

Rate

min_wage

2797

2015

February

ME

Somerset County

8.4

7.92

2798

2015

February

ME

Oxford County

6.8

7.92

2799

2015

February

ME

Knox County

6.1

7.92

2800

2015

February

ME

Piscataquis County

7.0

7.92

2801

2015

February

ME

Aroostook County

7.2

7.92

Well that was magical, wasn't it?!

In the case of singe-parmeter functions, we can just use a .map. Or...as you just saw here, if you want to map a key to a value using a dict, you can do the same thing, and just say you want to map the dictionary. Cool, huh?

Now let's map the county's candidate percentages to this. To do this, we have quite a few columns, and really, everything just needs to match up by county and state.

print(len(county_2015))
print(len(pres16))
2802
18475

Since pres16 is longer, we'll map that to county_15, where there are matches. Instead of a map, however, we'll combine with a join. To do this, let's index both of these. They are indexed by state AND county. So, we'll name these both the same, and then index as such.

pres16.rename(columns={"county": "County", "st": "State"}, inplace=True)
pres16.head()

County

fips

cand

State

pct_report

votes

total_votes

pct

lead

0

NaN

US

Donald Trump

US

0.9951

60350241.0

127592176.0

0.472993

Donald Trump

1

NaN

US

Hillary Clinton

US

0.9951

60981118.0

127592176.0

0.477938

Donald Trump

2

NaN

US

Gary Johnson

US

0.9951

4164589.0

127592176.0

0.032640

Donald Trump

3

NaN

US

Jill Stein

US

0.9951

1255968.0

127592176.0

0.009844

Donald Trump

4

NaN

US

Evan McMullin

US

0.9951

451636.0

127592176.0

0.003540

Donald Trump

for df in [county_2015, pres16]:
    df.set_index(["County", "State"], inplace=True)
pres16 = pres16[pres16['cand']=="Donald Trump"]
pres16 = pres16[['pct']]
pres16.dropna(inplace=True)
pres16.head(2)

pct

County

State

NaN

US

0.472993

CA

0.330641

county_2015.head(2)

Year

Month

Rate

min_wage

County

State

Newton County

MS

2015

February

6.1

NaN

Panola County

MS

2015

February

9.4

NaN

all_together = county_2015.merge(pres16, on=["County", "State"])
all_together.dropna(inplace=True)
all_together.drop("Year", axis=1, inplace=True)
all_together.head()

Month

Rate

min_wage

pct

County

State

Major County

OK

February

2.6

2.11

0.864960

Pottawatomie County

OK

February

4.5

2.11

0.701342

Johnston County

OK

February

6.5

2.11

0.770057

Jefferson County

OK

February

5.0

2.11

0.812367

Beaver County

OK

February

2.8

2.11

0.888243

all_together.corr()

Rate

min_wage

pct

Rate

1.000000

0.186689

-0.085985

min_wage

0.186689

1.000000

-0.325036

pct

-0.085985

-0.325036

1.000000

all_together.cov()

Rate

min_wage

pct

Rate

5.743199

0.683870

-0.031771

min_wage

0.683870

2.336451

-0.076602

pct

-0.031771

-0.076602

0.023772

Curiously, min_wage appears to have a negative correlation with the pct vote for Trump, so as minimum wage rises, people are less likely to vote for Trump, for example. That makes sense since higher minimum wages are typically a Democratic agenda, though it is fairly curious that, while correlated, they do not attempt to vary together. Interesting.

Alright, I think that, at this point, we've covered quite a bit about Pandas. There will always be more and more to show, but, for that, I would suggest you just read through the docs and begin to just try doing things yourself.

The next thing that I would like to show is using Pandas as your pre-processing software for data for machine learning, which is what we're going to be doing in the next tutorial!

Last updated