Visualizing Correlation Table - Data Analysis with Python 3 and Pandas

Hello and welcome to part 4 of the data analysis with Python and Pandas series. We're going to be continuing our work with the minimum wage dataset and our correlation table. Where we left off:

import pandas as pd
import numpy as np

df = pd.read_csv("datasets/minwage.csv")

act_min_wage = pd.DataFrame()

for name, group in df.groupby("State"):
    if act_min_wage.empty:
        act_min_wage = group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name})
    else:
        act_min_wage = act_min_wage.join(group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name}))

act_min_wage.head()

min_wage_corr = act_min_wage.replace(0, np.NaN).dropna(axis=1).corr()

min_wage_corr.head()

Alaska

Arkansas

California

Colorado

Connecticut

Delaware

District of Columbia

Federal (FLSA)

Guam

Hawaii

...

Pennsylvania

Puerto Rico

Rhode Island

South Dakota

Utah

Vermont

Washington

West Virginia

Wisconsin

Wyoming

Alaska

1.000000

-0.377934

0.717653

-0.129332

0.680886

0.258997

0.321785

0.637679

0.787915

0.672620

...

0.610814

-0.038118

0.652353

-0.326316

-0.020462

0.663558

0.411593

0.044814

0.702570

0.727932

Arkansas

-0.377934

1.000000

-0.234367

0.135749

0.047580

0.016125

0.266889

0.117245

0.039593

0.204801

...

0.159923

0.232186

0.003498

0.800116

0.194680

0.087429

-0.072343

0.420819

0.000470

-0.250592

California

0.717653

-0.234367

1.000000

0.483313

0.876215

0.479197

0.596865

0.371966

0.492052

0.519241

...

0.429061

0.512712

0.780916

-0.036787

0.392898

0.877922

0.754085

0.371765

0.584067

0.722617

Colorado

-0.129332

0.135749

0.483313

1.000000

0.402020

0.566304

0.673371

-0.232035

-0.192616

0.069800

...

-0.136195

0.657364

0.429852

0.399137

0.622330

0.448485

0.612637

0.533623

0.011501

0.130053

Connecticut

0.680886

0.047580

0.876215

0.402020

1.000000

0.552613

0.652488

0.487750

0.632073

0.621503

...

0.531769

0.626712

0.802485

0.105707

0.302538

0.898469

0.715691

0.400099

0.585790

0.814971

5 rows A-- 39 columns

Now, we can graph this with matplotlib. If you do not have it, you need to do a pip install matplotlib. Matplotlib has a nifty graphing function called matshow that we can use:

import matplotlib.pyplot as plt

plt.matshow(min_wage_corr)
plt.show()

It wouldn't be Matplotlib, however, if we didnt need to do some customization.

Again, I will just do the customization. If you would like to learn more about Matplotlib, check out the data visualization series.

import matplotlib.pyplot as plt

labels = [c[:2] for c in min_wage_corr.columns]  # get abbv state names.

fig = plt.figure(figsize=(12,12))  # figure so we can add axis
ax = fig.add_subplot(111)  # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn)  # display the matrix
ax.set_xticks(np.arange(len(labels)))  # show them all!
ax.set_yticks(np.arange(len(labels)))  # show them all!
ax.set_xticklabels(labels)  # set to be the abbv (vs useless #)
ax.set_yticklabels(labels)  # set to be the abbv (vs useless #)

plt.show()

Our simple abbreviations aren't cutting it. We need something better. A quick google search found me https://www.infoplease.com/state-abbreviations-and-state-postal-codes, which contains a table.

Guess what can read tables from the internet? Pandas can! You can use pd.read_html(URL) and pandas will search for any tables to populate a list of dfs with. Just remember, pd.read_html will return a list of dfs, not just one df.

# pip install lxml html5lib bs4
# on mac, run: /Applications/Python\ 3.7/Install\ Certificates.command


import requests

web = requests.get("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
dfs = pd.read_html(web.text)


import pandas as pd

# https://www.infoplease.com/state-abbreviations-and-state-postal-codes

dfs = pd.read_html("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
for df in dfs:
    print(df.head())  # one is states, the other territory
  State/District Abbreviation Postal Code
0        Alabama         Ala.          AL
1         Alaska       Alaska          AK
2        Arizona        Ariz.          AZ
3       Arkansas         Ark.          AR
4     California       Calif.          CA
  Territory/Associate Abbreviation Postal Code
0      American Samoa          NaN          AS
1                Guam         Guam          GU
2    Marshall Islands          NaN          MH
3          Micronesia          NaN          FM
4   Northern Marianas          NaN          MP
state_abbv = dfs[0]

state_abbv.head()

State/District

Abbreviation

Postal Code

0

Alabama

Ala.

AL

1

Alaska

Alaska

AK

2

Arizona

Ariz.

AZ

3

Arkansas

Ark.

AR

4

California

Calif.

CA

Often sources decide to disable access, or disappear, so I may want to save this dataframe both for myself and to share with you all in case they stop allowing robot access! Saving a dataframe in pandas is easy:

state_abbv.to_csv("datasets/state_abbv.csv")

Bring back:

state_abbv = pd.read_csv("datasets/state_abbv.csv")
state_abbv.head()

Unnamed: 0

State/District

Abbreviation

Postal Code

0

0

Alabama

Ala.

AL

1

1

Alaska

Alaska

AK

2

2

Arizona

Ariz.

AZ

3

3

Arkansas

Ark.

AR

4

4

California

Calif.

CA

So what happened? Well, we saved and loaded with the "index," which has created duplication. A CSV file has no idea about indexes, so pandas will by default just load in all of the data as columns, and then assign a new index. We can do things like saving with no index, we can opt to save specific columns only, and we can load in and specify an index on load. For example, this time, let's save just the specific columns we're after:

state_abbv[["State/District", "Postal Code"]].to_csv("datasets/state_abbv.csv", index=False)  # index in this case is worthless

Then, we can do:

state_abbv = pd.read_csv("datasets/state_abbv.csv", index_col=0)
state_abbv.head()

Postal Code

State/District

Alabama

AL

Alaska

AK

Arizona

AZ

Arkansas

AR

California

CA

Any time you're unsure about what you can do, you should always check out the Pandas Docs. They are great to just scroll through, just to learn more about what you can do, but also to learn about various parameters and methods that exist that you might otherwise not realize.

For example, while we're mainly working with CSVs here, we can work with many other formats: SQL, json, HDF5, BigQuery, and much much more!

Back to the task at hand, we are trying to use the Postal Codes for our abbreviations:

abbv_dict = state_abbv.to_dict()
abbv_dict
{'Postal Code': {'Alabama': 'AL',
  'Alaska': 'AK',
  'Arizona': 'AZ',
  'Arkansas': 'AR',
  'California': 'CA',
  'Colorado': 'CO',
  'Connecticut': 'CT',
  'Delaware': 'DE',
  'District of Columbia': 'DC',
  'Florida': 'FL',
  'Georgia': 'GA',
  'Hawaii': 'HI',
  'Idaho': 'ID',
  'Illinois': 'IL',
  'Indiana': 'IN',
  'Iowa': 'IA',
  'Kansas': 'KS',
  'Kentucky': 'KY',
  'Louisiana': 'LA',
  'Maine': 'ME',
  'Maryland': 'MD',
  'Massachusetts': 'MA',
  'Michigan': 'MI',
  'Minnesota': 'MN',
  'Mississippi': 'MS',
  'Missouri': 'MO',
  'Montana': 'MT',
  'Nebraska': 'NE',
  'Nevada': 'NV',
  'New Hampshire': 'NH',
  'New Jersey': 'NJ',
  'New Mexico': 'NM',
  'New York': 'NY',
  'North Carolina': 'NC',
  'North Dakota': 'ND',
  'Ohio': 'OH',
  'Oklahoma': 'OK',
  'Oregon': 'OR',
  'Pennsylvania': 'PA',
  'Rhode Island': 'RI',
  'South Carolina': 'SC',
  'South Dakota': 'SD',
  'Tennessee': 'TN',
  'Texas': 'TX',
  'Utah': 'UT',
  'Vermont': 'VT',
  'Virginia': 'VA',
  'Washington': 'WA',
  'West Virginia': 'WV',
  'Wisconsin': 'WI',
  'Wyoming': 'WY'}}

We can see here that it's a dict that maps to the dict we actually want, so we can reference the codes with:

abbv_dict = abbv_dict['Postal Code']
abbv_dict
{'Alabama': 'AL',
 'Alaska': 'AK',
 'Arizona': 'AZ',
 'Arkansas': 'AR',
 'California': 'CA',
 'Colorado': 'CO',
 'Connecticut': 'CT',
 'Delaware': 'DE',
 'District of Columbia': 'DC',
 'Florida': 'FL',
 'Georgia': 'GA',
 'Hawaii': 'HI',
 'Idaho': 'ID',
 'Illinois': 'IL',
 'Indiana': 'IN',
 'Iowa': 'IA',
 'Kansas': 'KS',
 'Kentucky': 'KY',
 'Louisiana': 'LA',
 'Maine': 'ME',
 'Maryland': 'MD',
 'Massachusetts': 'MA',
 'Michigan': 'MI',
 'Minnesota': 'MN',
 'Mississippi': 'MS',
 'Missouri': 'MO',
 'Montana': 'MT',
 'Nebraska': 'NE',
 'Nevada': 'NV',
 'New Hampshire': 'NH',
 'New Jersey': 'NJ',
 'New Mexico': 'NM',
 'New York': 'NY',
 'North Carolina': 'NC',
 'North Dakota': 'ND',
 'Ohio': 'OH',
 'Oklahoma': 'OK',
 'Oregon': 'OR',
 'Pennsylvania': 'PA',
 'Rhode Island': 'RI',
 'South Carolina': 'SC',
 'South Dakota': 'SD',
 'Tennessee': 'TN',
 'Texas': 'TX',
 'Utah': 'UT',
 'Vermont': 'VT',
 'Virginia': 'VA',
 'Washington': 'WA',
 'West Virginia': 'WV',
 'Wisconsin': 'WI',
 'Wyoming': 'WY'}

Now we can re-do our labels with:

labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.
-----------------------------------------------------------------
KeyError                        Traceback (most recent call last)
<ipython-input-18-d53be929efcb> in <module>()
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

<ipython-input-18-d53be929efcb> in <listcomp>(.0)
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

KeyError: 'Federal (FLSA)'

Okay. Fine, we have to hack this one in ourselves!

abbv_dict['Federal (FLSA)'] = "FLSA"
labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.
-----------------------------------------------------------------
KeyError                        Traceback (most recent call last)
<ipython-input-20-d53be929efcb> in <module>()
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

<ipython-input-20-d53be929efcb> in <listcomp>(.0)
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

KeyError: 'Guam'

Hmm, we might have to revisit the territories, but:

abbv_dict['Guam'] = "GU"
abbv_dict['Puerto Rico'] = "PR"
labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

Okay good enough! Back to our graph now!

fig = plt.figure(figsize=(12,12))  # figure so we can add axis
ax = fig.add_subplot(111)  # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn)  # display the matrix
ax.set_xticks(np.arange(len(labels)))  # show them all!
ax.set_yticks(np.arange(len(labels)))  # show them all!
ax.set_xticklabels(labels)  # set to be the abbv (vs useless #)
ax.set_yticklabels(labels)  # set to be the abbv (vs useless #)

plt.show()

Cool! We've covered quite a bit again, but hopefully that was interesting, and we got to to begin to combine datasets, if only to inform our column names.

... but more cool things happen when we can combine datasets moreso for their data! While correlation is not causation, we can still gleam interesting things from it! Plus... we get to make cool graphs, so why not? In the next tutorial, we're going to explore the relationships of minimum wage to unemployment, and maybe even toss in political persuasions of those states while we're at it!

Last updated