# Visualizing Correlation Table - Data Analysis with Python 3 and Pandas

Hello and welcome to part 4 of the data analysis with Python and Pandas series. We're going to be continuing our work with the minimum wage dataset and our correlation table. Where we left off:

```python
import pandas as pd
import numpy as np

df = pd.read_csv("datasets/minwage.csv")

act_min_wage = pd.DataFrame()

for name, group in df.groupby("State"):
    if act_min_wage.empty:
        act_min_wage = group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name})
    else:
        act_min_wage = act_min_wage.join(group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name}))

act_min_wage.head()

min_wage_corr = act_min_wage.replace(0, np.NaN).dropna(axis=1).corr()

min_wage_corr.head()
```

|             | Alaska    | Arkansas  | California | Colorado  | Connecticut | Delaware | District of Columbia | Federal (FLSA) | Guam      | Hawaii   | ... | Pennsylvania | Puerto Rico | Rhode Island | South Dakota | Utah      | Vermont  | Washington | West Virginia | Wisconsin | Wyoming   |
| ----------- | --------- | --------- | ---------- | --------- | ----------- | -------- | -------------------- | -------------- | --------- | -------- | --- | ------------ | ----------- | ------------ | ------------ | --------- | -------- | ---------- | ------------- | --------- | --------- |
| Alaska      | 1.000000  | -0.377934 | 0.717653   | -0.129332 | 0.680886    | 0.258997 | 0.321785             | 0.637679       | 0.787915  | 0.672620 | ... | 0.610814     | -0.038118   | 0.652353     | -0.326316    | -0.020462 | 0.663558 | 0.411593   | 0.044814      | 0.702570  | 0.727932  |
| Arkansas    | -0.377934 | 1.000000  | -0.234367  | 0.135749  | 0.047580    | 0.016125 | 0.266889             | 0.117245       | 0.039593  | 0.204801 | ... | 0.159923     | 0.232186    | 0.003498     | 0.800116     | 0.194680  | 0.087429 | -0.072343  | 0.420819      | 0.000470  | -0.250592 |
| California  | 0.717653  | -0.234367 | 1.000000   | 0.483313  | 0.876215    | 0.479197 | 0.596865             | 0.371966       | 0.492052  | 0.519241 | ... | 0.429061     | 0.512712    | 0.780916     | -0.036787    | 0.392898  | 0.877922 | 0.754085   | 0.371765      | 0.584067  | 0.722617  |
| Colorado    | -0.129332 | 0.135749  | 0.483313   | 1.000000  | 0.402020    | 0.566304 | 0.673371             | -0.232035      | -0.192616 | 0.069800 | ... | -0.136195    | 0.657364    | 0.429852     | 0.399137     | 0.622330  | 0.448485 | 0.612637   | 0.533623      | 0.011501  | 0.130053  |
| Connecticut | 0.680886  | 0.047580  | 0.876215   | 0.402020  | 1.000000    | 0.552613 | 0.652488             | 0.487750       | 0.632073  | 0.621503 | ... | 0.531769     | 0.626712    | 0.802485     | 0.105707     | 0.302538  | 0.898469 | 0.715691   | 0.400099      | 0.585790  | 0.814971  |

5 rows A-- 39 columns

Now, we can graph this with `matplotlib`. If you do not have it, you need to do a `pip install matplotlib`. Matplotlib has a nifty graphing function called `matshow` that we can use:

```python
import matplotlib.pyplot as plt

plt.matshow(min_wage_corr)
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAZGv-mgKcTvvQTTtE0%2Fuploads%2FA3ukfELjtlQ74NQrH8mm%2Ffile.png?alt=media)

It wouldn't be Matplotlib, however, if we didnt need to do some customization.

Again, I will just do the customization. If you would like to learn more about Matplotlib, check out the data visualization series.

```python
import matplotlib.pyplot as plt

labels = [c[:2] for c in min_wage_corr.columns]  # get abbv state names.

fig = plt.figure(figsize=(12,12))  # figure so we can add axis
ax = fig.add_subplot(111)  # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn)  # display the matrix
ax.set_xticks(np.arange(len(labels)))  # show them all!
ax.set_yticks(np.arange(len(labels)))  # show them all!
ax.set_xticklabels(labels)  # set to be the abbv (vs useless #)
ax.set_yticklabels(labels)  # set to be the abbv (vs useless #)

plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAZGv-mgKcTvvQTTtE0%2Fuploads%2FWeEo0kjPkV1xBD9ci09Q%2Ffile.png?alt=media)

Our simple abbreviations aren't cutting it. We need something better. A quick google search found me `https://www.infoplease.com/state-abbreviations-and-state-postal-codes`, which contains a table.

Guess what can read tables from the internet? Pandas can! You can use `pd.read_html(URL)` and pandas will search for any tables to populate a list of dfs with. Just remember, `pd.read_html` will return a list of dfs, not just one df.

```python
# pip install lxml html5lib bs4
# on mac, run: /Applications/Python\ 3.7/Install\ Certificates.command


import requests

web = requests.get("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
dfs = pd.read_html(web.text)


import pandas as pd

# https://www.infoplease.com/state-abbreviations-and-state-postal-codes

dfs = pd.read_html("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
```

```python
for df in dfs:
    print(df.head())  # one is states, the other territory
```

```python
  State/District Abbreviation Postal Code
0        Alabama         Ala.          AL
1         Alaska       Alaska          AK
2        Arizona        Ariz.          AZ
3       Arkansas         Ark.          AR
4     California       Calif.          CA
  Territory/Associate Abbreviation Postal Code
0      American Samoa          NaN          AS
1                Guam         Guam          GU
2    Marshall Islands          NaN          MH
3          Micronesia          NaN          FM
4   Northern Marianas          NaN          MP
```

```python
state_abbv = dfs[0]

state_abbv.head()
```

|   | State/District | Abbreviation | Postal Code |
| - | -------------- | ------------ | ----------- |
| 0 | Alabama        | Ala.         | AL          |
| 1 | Alaska         | Alaska       | AK          |
| 2 | Arizona        | Ariz.        | AZ          |
| 3 | Arkansas       | Ark.         | AR          |
| 4 | California     | Calif.       | CA          |

Often sources decide to disable access, or disappear, so I may want to save this dataframe both for myself and to share with you all in case they stop allowing robot access! Saving a dataframe in pandas is easy:

```python
state_abbv.to_csv("datasets/state_abbv.csv")
```

Bring back:

```python
state_abbv = pd.read_csv("datasets/state_abbv.csv")
state_abbv.head()
```

|   | Unnamed: 0 | State/District | Abbreviation | Postal Code |
| - | ---------- | -------------- | ------------ | ----------- |
| 0 | 0          | Alabama        | Ala.         | AL          |
| 1 | 1          | Alaska         | Alaska       | AK          |
| 2 | 2          | Arizona        | Ariz.        | AZ          |
| 3 | 3          | Arkansas       | Ark.         | AR          |
| 4 | 4          | California     | Calif.       | CA          |

So what happened? Well, we saved and loaded with the "index," which has created duplication. A CSV file has no idea about indexes, so pandas will by default just load in all of the data as columns, and then assign a new index. We can do things like saving with no index, we can opt to save specific columns only, and we can load in and specify an index on load. For example, this time, let's save just the specific columns we're after:

```python
state_abbv[["State/District", "Postal Code"]].to_csv("datasets/state_abbv.csv", index=False)  # index in this case is worthless
```

Then, we can do:

```python
state_abbv = pd.read_csv("datasets/state_abbv.csv", index_col=0)
state_abbv.head()
```

|                | Postal Code |
| -------------- | ----------- |
| State/District |             |
| Alabama        | AL          |
| Alaska         | AK          |
| Arizona        | AZ          |
| Arkansas       | AR          |
| California     | CA          |

Any time you're unsure about what you can do, you should always check out the [Pandas Docs](https://pandas.pydata.org/pandas-docs/stable/reference/index.html). They are great to just scroll through, just to learn more about what you can do, but also to learn about various parameters and methods that exist that you might otherwise not realize.

For example, while we're mainly working with CSVs here, we can work with many other formats: SQL, json, HDF5, BigQuery, and much much more!

Back to the task at hand, we are trying to use the Postal Codes for our abbreviations:

```python
abbv_dict = state_abbv.to_dict()
```

```
abbv_dict
```

```
{'Postal Code': {'Alabama': 'AL',
  'Alaska': 'AK',
  'Arizona': 'AZ',
  'Arkansas': 'AR',
  'California': 'CA',
  'Colorado': 'CO',
  'Connecticut': 'CT',
  'Delaware': 'DE',
  'District of Columbia': 'DC',
  'Florida': 'FL',
  'Georgia': 'GA',
  'Hawaii': 'HI',
  'Idaho': 'ID',
  'Illinois': 'IL',
  'Indiana': 'IN',
  'Iowa': 'IA',
  'Kansas': 'KS',
  'Kentucky': 'KY',
  'Louisiana': 'LA',
  'Maine': 'ME',
  'Maryland': 'MD',
  'Massachusetts': 'MA',
  'Michigan': 'MI',
  'Minnesota': 'MN',
  'Mississippi': 'MS',
  'Missouri': 'MO',
  'Montana': 'MT',
  'Nebraska': 'NE',
  'Nevada': 'NV',
  'New Hampshire': 'NH',
  'New Jersey': 'NJ',
  'New Mexico': 'NM',
  'New York': 'NY',
  'North Carolina': 'NC',
  'North Dakota': 'ND',
  'Ohio': 'OH',
  'Oklahoma': 'OK',
  'Oregon': 'OR',
  'Pennsylvania': 'PA',
  'Rhode Island': 'RI',
  'South Carolina': 'SC',
  'South Dakota': 'SD',
  'Tennessee': 'TN',
  'Texas': 'TX',
  'Utah': 'UT',
  'Vermont': 'VT',
  'Virginia': 'VA',
  'Washington': 'WA',
  'West Virginia': 'WV',
  'Wisconsin': 'WI',
  'Wyoming': 'WY'}}
```

We can see here that it's a dict that maps to the dict we actually want, so we can reference the codes with:

```
abbv_dict = abbv_dict['Postal Code']
abbv_dict
```

```
{'Alabama': 'AL',
 'Alaska': 'AK',
 'Arizona': 'AZ',
 'Arkansas': 'AR',
 'California': 'CA',
 'Colorado': 'CO',
 'Connecticut': 'CT',
 'Delaware': 'DE',
 'District of Columbia': 'DC',
 'Florida': 'FL',
 'Georgia': 'GA',
 'Hawaii': 'HI',
 'Idaho': 'ID',
 'Illinois': 'IL',
 'Indiana': 'IN',
 'Iowa': 'IA',
 'Kansas': 'KS',
 'Kentucky': 'KY',
 'Louisiana': 'LA',
 'Maine': 'ME',
 'Maryland': 'MD',
 'Massachusetts': 'MA',
 'Michigan': 'MI',
 'Minnesota': 'MN',
 'Mississippi': 'MS',
 'Missouri': 'MO',
 'Montana': 'MT',
 'Nebraska': 'NE',
 'Nevada': 'NV',
 'New Hampshire': 'NH',
 'New Jersey': 'NJ',
 'New Mexico': 'NM',
 'New York': 'NY',
 'North Carolina': 'NC',
 'North Dakota': 'ND',
 'Ohio': 'OH',
 'Oklahoma': 'OK',
 'Oregon': 'OR',
 'Pennsylvania': 'PA',
 'Rhode Island': 'RI',
 'South Carolina': 'SC',
 'South Dakota': 'SD',
 'Tennessee': 'TN',
 'Texas': 'TX',
 'Utah': 'UT',
 'Vermont': 'VT',
 'Virginia': 'VA',
 'Washington': 'WA',
 'West Virginia': 'WV',
 'Wisconsin': 'WI',
 'Wyoming': 'WY'}
```

Now we can re-do our labels with:

```
labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.
```

```
-----------------------------------------------------------------
KeyError                        Traceback (most recent call last)
<ipython-input-18-d53be929efcb> in <module>()
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

<ipython-input-18-d53be929efcb> in <listcomp>(.0)
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

KeyError: 'Federal (FLSA)'
```

Okay. Fine, we have to hack this one in ourselves!

```
abbv_dict['Federal (FLSA)'] = "FLSA"
```

```
labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.
```

```
-----------------------------------------------------------------
KeyError                        Traceback (most recent call last)
<ipython-input-20-d53be929efcb> in <module>()
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

<ipython-input-20-d53be929efcb> in <listcomp>(.0)
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

KeyError: 'Guam'
```

Hmm, we might have to revisit the territories, but:

```
abbv_dict['Guam'] = "GU"
abbv_dict['Puerto Rico'] = "PR"
labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.
```

Okay good enough! Back to our graph now!

```
fig = plt.figure(figsize=(12,12))  # figure so we can add axis
ax = fig.add_subplot(111)  # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn)  # display the matrix
ax.set_xticks(np.arange(len(labels)))  # show them all!
ax.set_yticks(np.arange(len(labels)))  # show them all!
ax.set_xticklabels(labels)  # set to be the abbv (vs useless #)
ax.set_yticklabels(labels)  # set to be the abbv (vs useless #)

plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAZGv-mgKcTvvQTTtE0%2Fuploads%2FyWH94eR4JHHEBlIt8557%2Ffile.png?alt=media)

Cool! We've covered quite a bit again, but hopefully that was interesting, and we got to to begin to combine datasets, if only to inform our column names.

... but more cool things happen when we can combine datasets moreso for their data! While correlation is not causation, we can still gleam interesting things from it! Plus... we get to make cool graphs, so why not? In the next tutorial, we're going to explore the relationships of minimum wage to unemployment, and maybe even toss in political persuasions of those states while we're at it!
