# Groupby - Data Analysis with Python 3 and Pandas

Hello and welcome to another data analysis with Python and Pandas tutorial. In this tutorial, we're going to change up the dataset and play with minimum wage data now.

You can find this dataset here: [Kaggle Minimum Wage by State](https://www.kaggle.com/lislejoem/us-minimum-wage-by-state-from-1968-to-2017). This dataset goes from 1968 to 2017, giving the minimum wage (lowest amount of money that employers can pay workers by the hour), by state.

Description of the data:

Year: Year of data

State: State/Territory of data

Table\_Data: The scraped, unclean data from the US Department of Labor.

Footnote: The footnote associated with Table\_Data, provided by the US Department of Labor.

High.Value: As there were some values in Table\_Data that had multiple values (usually associated with footnotes), this is the higher of the two values in the table. It could be useful for viewing the proposed minimum wage, because in most cases, the higher value meant that all persons protected under minimum wage laws eventually had minimum wage set at that value.

Low\.Value: This is the same as High.Value, but has the lower of the two values. This could be useful for viewing the effective minimum wage at the year of setting the minimum wage, as peoples protected under such minimum wage laws made that value during that year (although, in most cases, they had a higher minimum wage after that year).

CPI.Average: This is the average Consumer Price Index associated with that year. It was used to calculate 2018-equivalent values.

High.2018: This is the 2018-equivalent dollars for High.Value.

Low\.2018: This is the 2018-equivalent dollars for Low\.Value.

Once you have downloaded the data, let's begin working with it.

```python
import pandas as pd

# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
df = pd.read_csv("datasets/Minimum Wage Data.csv", encoding="latin")
```

Right away, we've got some encoding issues. Looks like the user saved the formatting funky-like. Because the data was grabbed from the internet, it would have made more sense to leave it in UTF-8, but, for whatever reason, that wasn't the case, and I initially hit an encoding error on loading it in. I tried latin encoding next, and boom, there we go. Now, let's go ahead and just save our own version, with utf-8 encoding!

```python
df.to_csv("datasets/minwage.csv", encoding="utf-8")
```

```
df = pd.read_csv("datasets/minwage.csv")
```

```
df.head()
```

|   | Unnamed: 0 | Year | State      | Table\_Data         | Footnote | High.Value | Low\.Value | CPI.Average | High.2018 | Low\.2018 |
| - | ---------- | ---- | ---------- | ------------------- | -------- | ---------- | ---------- | ----------- | --------- | --------- |
| 0 | 0          | 1968 | Alabama    | ...                 | NaN      | 0.00000    | 0.00000    | 34.783333   | 0.00      | 0.00      |
| 1 | 1          | 1968 | Alaska     | 2.10                | NaN      | 2.10000    | 2.10000    | 34.783333   | 15.12     | 15.12     |
| 2 | 2          | 1968 | Arizona    | 18.72 - 26.40/wk(b) | (b)      | 0.66000    | 0.46800    | 34.783333   | 4.75      | 3.37      |
| 3 | 3          | 1968 | Arkansas   | 1.25/day(b)         | (b)      | 0.15625    | 0.15625    | 34.783333   | 1.12      | 1.12      |
| 4 | 4          | 1968 | California | 1.65(b)             | (b)      | 1.65000    | 1.65000    | 34.783333   | 11.88     | 11.88     |

Let's check out a new functionality with pandas, called group by. We can automatically create groups by unique column values. Sounds familiar? It's exactly what we did before, just with pandas instead of our own Python logic. That's one thing I really enjoy with Pandas. It's very easy to work with Pandas using your own logic, or with some built-in Pandas logic.

```python
gb = df.groupby("State")
gb.get_group("Alabama").set_index("Year").head()
```

|      | Unnamed: 0 | State   | Table\_Data | Footnote | High.Value | Low\.Value | CPI.Average | High.2018 | Low\.2018 |
| ---- | ---------- | ------- | ----------- | -------- | ---------- | ---------- | ----------- | --------- | --------- |
| Year |            |         |             |          |            |            |             |           |           |
| 1968 | 0          | Alabama | ...         | NaN      | 0.0        | 0.0        | 34.783333   | 0.0       | 0.0       |
| 1969 | 55         | Alabama | ...         | NaN      | 0.0        | 0.0        | 36.683333   | 0.0       | 0.0       |
| 1970 | 110        | Alabama | ...         | NaN      | 0.0        | 0.0        | 38.825000   | 0.0       | 0.0       |
| 1971 | 165        | Alabama | ...         | NaN      | 0.0        | 0.0        | 40.491667   | 0.0       | 0.0       |
| 1972 | 220        | Alabama | ...         | NaN      | 0.0        | 0.0        | 41.816667   | 0.0       | 0.0       |

Aside from getting groups, we can also just iterate over the groups:

```python
act_min_wage = pd.DataFrame()

for name, group in df.groupby("State"):
    if act_min_wage.empty:
        act_min_wage = group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name})
    else:
        act_min_wage = act_min_wage.join(group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name}))

act_min_wage.head()
```

|      | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | District of Columbia | Federal (FLSA) | ... | Tennessee | Texas | U.S. Virgin Islands | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming |
| ---- | ------- | ------ | ------- | -------- | ---------- | -------- | ----------- | -------- | -------------------- | -------------- | --- | --------- | ----- | ------------------- | ---- | ------- | -------- | ---------- | ------------- | --------- | ------- |
| Year |         |        |         |          |            |          |             |          |                      |                |     |           |       |                     |      |         |          |            |               |           |         |
| 1968 | 0.0     | 15.12  | 3.37    | 1.12     | 11.88      | 7.20     | 10.08       | 9.00     | 9.00                 | 8.28           | ... | 0.0       | 0.00  | NaN                 | 7.20 | 10.08   | 0.0      | 11.52      | 7.20          | 9.00      | 8.64    |
| 1969 | 0.0     | 14.33  | 3.19    | 1.07     | 11.26      | 6.83     | 9.56        | 8.53     | 8.53                 | 7.85           | ... | 0.0       | 0.00  | NaN                 | 6.83 | 9.56    | 0.0      | 10.92      | 6.83          | 8.53      | 8.19    |
| 1970 | 0.0     | 13.54  | 3.02    | 7.09     | 10.64      | 6.45     | 10.32       | 8.06     | 10.32                | 8.38           | ... | 0.0       | 0.00  | NaN                 | 6.45 | 10.32   | 0.0      | 10.32      | 6.45          | 8.38      | 8.38    |
| 1971 | 0.0     | 12.99  | 2.89    | 6.80     | 10.20      | 6.18     | 9.89        | 7.73     | 9.89                 | 8.04           | ... | 0.0       | 0.00  | NaN                 | 6.18 | 9.89    | 0.0      | 9.89       | 6.18          | 8.04      | 8.04    |
| 1972 | 0.0     | 12.57  | 2.80    | 7.19     | 9.88       | 5.99     | 11.08       | 9.58     | 9.58                 | 9.58           | ... | 0.0       | 8.38  | NaN                 | 7.19 | 9.58    | 0.0      | 9.58       | 7.19          | 8.68      | 8.98    |

5 rows A-- 55 columns

Sometimes, it is interesting to just see some various stats on your data. One thing you can do very quick is run a `describe` on your data to get various features right away:

```python
act_min_wage.describe()
```

|       | Alabama | Alaska    | Arizona   | Arkansas  | California | Colorado  | Connecticut | Delaware  | District of Columbia | Federal (FLSA) | ... | Tennessee | Texas     | U.S. Virgin Islands | Utah      | Vermont   | Virginia  | Washington | West Virginia | Wisconsin | Wyoming   |
| ----- | ------- | --------- | --------- | --------- | ---------- | --------- | ----------- | --------- | -------------------- | -------------- | --- | --------- | --------- | ------------------- | --------- | --------- | --------- | ---------- | ------------- | --------- | --------- |
| count | 50.0    | 50.000000 | 50.000000 | 50.000000 | 50.000000  | 50.000000 | 50.000000   | 50.000000 | 50.000000            | 50.000000      | ... | 50.0      | 50.000000 | 39.000000           | 50.000000 | 50.000000 | 50.000000 | 50.000000  | 50.000000     | 50.000000 | 50.000000 |
| mean  | 0.0     | 9.583800  | 2.316200  | 6.781600  | 8.632800   | 6.389800  | 8.823600    | 7.498800  | 8.343200             | 7.773200       | ... | 0.0       | 5.541200  | 6.643590            | 6.836600  | 8.657400  | 5.897400  | 8.301800   | 7.166400      | 7.694200  | 5.149200  |
| std   | 0.0     | 1.845176  | 3.474504  | 1.341223  | 1.187508   | 1.521052  | 1.023613    | 1.265436  | 1.476133             | 0.887354       | ... | 0.0       | 2.202229  | 1.590176            | 0.819367  | 0.984647  | 2.726151  | 1.868613   | 0.793673      | 0.762675  | 1.998928  |
| min   | 0.0     | 7.380000  | 0.000000  | 1.070000  | 6.420000   | 3.840000  | 6.820000    | 4.410000  | 5.510000             | 6.220000       | ... | 0.0       | 0.000000  | 4.390000            | 4.790000  | 6.800000  | 0.000000  | 4.410000   | 5.580000      | 6.420000  | 2.260000  |
| 25%   | 0.0     | 8.252500  | 0.000000  | 6.522500  | 7.695000   | 5.072500  | 7.865000    | 7.015000  | 7.097500             | 7.172500       | ... | 0.0       | 4.762500  | 5.105000            | 6.317500  | 7.800000  | 5.892500  | 7.227500   | 6.630000      | 7.107500  | 3.270000  |
| 50%   | 0.0     | 8.890000  | 0.000000  | 6.920000  | 8.670000   | 6.375000  | 9.075000    | 7.845000  | 8.510000             | 7.655000       | ... | 0.0       | 6.070000  | 6.770000            | 6.930000  | 8.885000  | 6.855000  | 9.055000   | 7.075000      | 7.580000  | 5.445000  |
| 75%   | 0.0     | 10.492500 | 3.147500  | 7.290000  | 9.307500   | 7.677500  | 9.455000    | 8.327500  | 9.140000             | 8.242500       | ... | 0.0       | 7.105000  | 7.830000            | 7.470000  | 9.372500  | 7.620000  | 9.672500   | 7.612500      | 8.145000  | 6.607500  |
| max   | 0.0     | 15.120000 | 10.220000 | 8.680000  | 11.880000  | 9.500000  | 11.080000   | 9.580000  | 12.000000            | 10.010000      | ... | 0.0       | 8.380000  | 10.010000           | 8.330000  | 10.320000 | 8.800000  | 11.520000  | 9.130000      | 9.660000  | 8.980000  |

8 rows A-- 55 columns

Another one that we can do is `.corr()` or `.cov()` to get correlation or covariance respectively.

```python
act_min_wage.corr().head()
```

|            | Alabama | Alaska    | Arizona  | Arkansas  | California | Colorado  | Connecticut | Delaware | District of Columbia | Federal (FLSA) | ... | Tennessee | Texas     | U.S. Virgin Islands | Utah      | Vermont  | Virginia  | Washington | West Virginia | Wisconsin | Wyoming   |
| ---------- | ------- | --------- | -------- | --------- | ---------- | --------- | ----------- | -------- | -------------------- | -------------- | --- | --------- | --------- | ------------------- | --------- | -------- | --------- | ---------- | ------------- | --------- | --------- |
| Alabama    | NaN     | NaN       | NaN      | NaN       | NaN        | NaN       | NaN         | NaN      | NaN                  | NaN            | ... | NaN       | NaN       | NaN                 | NaN       | NaN      | NaN       | NaN        | NaN           | NaN       | NaN       |
| Alaska     | NaN     | 1.000000  | 0.018638 | -0.377934 | 0.717653   | -0.129332 | 0.680886    | 0.258997 | 0.321785             | 0.637679       | ... | NaN       | -0.522472 | 0.277016            | -0.020462 | 0.663558 | -0.601640 | 0.411593   | 0.044814      | 0.702570  | 0.727932  |
| Arizona    | NaN     | 0.018638  | 1.000000 | 0.089395  | 0.498994   | 0.742527  | 0.458163    | 0.429836 | 0.590848             | -0.049914      | ... | NaN       | 0.369900  | -0.758607           | 0.439344  | 0.517960 | 0.047405  | 0.587975   | 0.584596      | 0.131588  | 0.346528  |
| Arkansas   | NaN     | -0.377934 | 0.089395 | 1.000000  | -0.234367  | 0.135749  | 0.047580    | 0.016125 | 0.266889             | 0.117245       | ... | NaN       | 0.503242  | -0.204485           | 0.194680  | 0.087429 | 0.582192  | -0.072343  | 0.420819      | 0.000470  | -0.250592 |
| California | NaN     | 0.717653  | 0.498994 | -0.234367 | 1.000000   | 0.483313  | 0.876215    | 0.479197 | 0.596865             | 0.371966       | ... | NaN       | -0.239533 | -0.417782           | 0.392898  | 0.877922 | -0.326364 | 0.754085   | 0.371765      | 0.584067  | 0.722617  |

5 rows A-- 55 columns

For some reason, we can see that Alabama and Tennessee at least are returning NaNs. Upon looking above at the `.describe()`, or if we just printed the head, we'd see that Alabama, for example, reports all 0s. What's up there?

We can just move on, or we could inspect what's going on here. Let's just briefly inspect, shall we? To begin, we'll start with our "base" dataset, which is currently under the var name of `df`.

```python
df.head()
```

|   | Unnamed: 0 | Year | State      | Table\_Data         | Footnote | High.Value | Low\.Value | CPI.Average | High.2018 | Low\.2018 |
| - | ---------- | ---- | ---------- | ------------------- | -------- | ---------- | ---------- | ----------- | --------- | --------- |
| 0 | 0          | 1968 | Alabama    | ...                 | NaN      | 0.00000    | 0.00000    | 34.783333   | 0.00      | 0.00      |
| 1 | 1          | 1968 | Alaska     | 2.10                | NaN      | 2.10000    | 2.10000    | 34.783333   | 15.12     | 15.12     |
| 2 | 2          | 1968 | Arizona    | 18.72 - 26.40/wk(b) | (b)      | 0.66000    | 0.46800    | 34.783333   | 4.75      | 3.37      |
| 3 | 3          | 1968 | Arkansas   | 1.25/day(b)         | (b)      | 0.15625    | 0.15625    | 34.783333   | 1.12      | 1.12      |
| 4 | 4          | 1968 | California | 1.65(b)             | (b)      | 1.65000    | 1.65000    | 34.783333   | 11.88     | 11.88     |

```python
issue_df = df[df['Low.2018']==0]

issue_df.head()
```

|    | Unnamed: 0 | Year | State    | Table\_Data | Footnote | High.Value | Low\.Value | CPI.Average | High.2018 | Low\.2018 |
| -- | ---------- | ---- | -------- | ----------- | -------- | ---------- | ---------- | ----------- | --------- | --------- |
| 0  | 0          | 1968 | Alabama  | ...         | NaN      | 0.0        | 0.0        | 34.783333   | 0.0       | 0.0       |
| 10 | 10         | 1968 | Florida  | ...         | NaN      | 0.0        | 0.0        | 34.783333   | 0.0       | 0.0       |
| 11 | 11         | 1968 | Georgia  | ...         | NaN      | 0.0        | 0.0        | 34.783333   | 0.0       | 0.0       |
| 15 | 15         | 1968 | Illinois | ...         | NaN      | 0.0        | 0.0        | 34.783333   | 0.0       | 0.0       |
| 17 | 17         | 1968 | Iowa     | ...         | NaN      | 0.0        | 0.0        | 34.783333   | 0.0       | 0.0       |

Okay, how do we get them all? Well, we could just grab the uniques from the state column like:

```python
issue_df['State'].unique()
```

```
array(['Alabama', 'Florida', 'Georgia', 'Illinois', 'Iowa', 'Kansas',
       'Louisiana', 'Mississippi', 'Missouri', 'Montana',
       'South Carolina', 'Tennessee', 'Texas', 'Virginia', 'Arizona'],
      dtype=object)
```

Let's confirm that these are all actually problematic for us. First, let's remove the ones that we know are problematic from our correlation table:

```python
import numpy as np

# axis 1 == columns. 0,default, is for rows
act_min_wage.replace(0, np.NaN).dropna(axis=1).corr().head()
```

|             | Alaska    | Arkansas  | California | Colorado  | Connecticut | Delaware | District of Columbia | Federal (FLSA) | Guam      | Hawaii   | ... | Pennsylvania | Puerto Rico | Rhode Island | South Dakota | Utah      | Vermont  | Washington | West Virginia | Wisconsin | Wyoming   |
| ----------- | --------- | --------- | ---------- | --------- | ----------- | -------- | -------------------- | -------------- | --------- | -------- | --- | ------------ | ----------- | ------------ | ------------ | --------- | -------- | ---------- | ------------- | --------- | --------- |
| Alaska      | 1.000000  | -0.377934 | 0.717653   | -0.129332 | 0.680886    | 0.258997 | 0.321785             | 0.637679       | 0.787915  | 0.672620 | ... | 0.610814     | -0.038118   | 0.652353     | -0.326316    | -0.020462 | 0.663558 | 0.411593   | 0.044814      | 0.702570  | 0.727932  |
| Arkansas    | -0.377934 | 1.000000  | -0.234367  | 0.135749  | 0.047580    | 0.016125 | 0.266889             | 0.117245       | 0.039593  | 0.204801 | ... | 0.159923     | 0.232186    | 0.003498     | 0.800116     | 0.194680  | 0.087429 | -0.072343  | 0.420819      | 0.000470  | -0.250592 |
| California  | 0.717653  | -0.234367 | 1.000000   | 0.483313  | 0.876215    | 0.479197 | 0.596865             | 0.371966       | 0.492052  | 0.519241 | ... | 0.429061     | 0.512712    | 0.780916     | -0.036787    | 0.392898  | 0.877922 | 0.754085   | 0.371765      | 0.584067  | 0.722617  |
| Colorado    | -0.129332 | 0.135749  | 0.483313   | 1.000000  | 0.402020    | 0.566304 | 0.673371             | -0.232035      | -0.192616 | 0.069800 | ... | -0.136195    | 0.657364    | 0.429852     | 0.399137     | 0.622330  | 0.448485 | 0.612637   | 0.533623      | 0.011501  | 0.130053  |
| Connecticut | 0.680886  | 0.047580  | 0.876215   | 0.402020  | 1.000000    | 0.552613 | 0.652488             | 0.487750       | 0.632073  | 0.621503 | ... | 0.531769     | 0.626712    | 0.802485     | 0.105707     | 0.302538  | 0.898469 | 0.715691   | 0.400099      | 0.585790  | 0.814971  |

5 rows A-- 39 columns

Looks good, let's save as a var:

```python
min_wage_corr = act_min_wage.replace(0, np.NaN).dropna(axis=1).corr()
```

Now let's see if any of the identified problems exist after we've dropped:

```python
for problem in issue_df['State'].unique():
    if problem in min_wage_corr.columns:
        print("Missing something here....")
```

Alright, there's our answer then. These states all are problematic. Can we recover from this? Let's see!

```python
grouped_issues = issue_df.groupby("State")

grouped_issues.get_group("Alabama").head(3)
```

|     | Unnamed: 0 | Year | State   | Table\_Data | Footnote | High.Value | Low\.Value | CPI.Average | High.2018 | Low\.2018 |
| --- | ---------- | ---- | ------- | ----------- | -------- | ---------- | ---------- | ----------- | --------- | --------- |
| 0   | 0          | 1968 | Alabama | ...         | NaN      | 0.0        | 0.0        | 34.783333   | 0.0       | 0.0       |
| 55  | 55         | 1969 | Alabama | ...         | NaN      | 0.0        | 0.0        | 36.683333   | 0.0       | 0.0       |
| 110 | 110        | 1970 | Alabama | ...         | NaN      | 0.0        | 0.0        | 38.825000   | 0.0       | 0.0       |

Right away, we can see we're missing any `Footnote`, `High.Value`, `Low.Value`, and the `High.2018`, `Low.2018`. Recall that the `Table_Data` was the "raw" data that was scraped. Here, we're getting elipses for whatever reason. Probably the scraper that grabbed this data needed to interact better with the web page. Unfortunately, this is the data we have. A final check I might do is to see if literally all of the columns are zero. There are a billion ways we could do this, but let's just...check the sum for `Low.2018`:

```python
grouped_issues.get_group("Alabama")['Low.2018'].sum()
```

```
0.0
```

Looks like we just never get any value for Alabama. Let's see if this is true for all of the issues in our group.

```python
for state, data in grouped_issues:
    if data['Low.2018'].sum() != 0.0:
        print("Some data found for", state)
```

Looks like we wont be recovering from this, without bringing in another dataset, or maybe scraping better. Hey, I think it could be basic enough to fill in this missing data if we scraped, and it might be useful for the tutorial. Let's see. This dataset was scraped from: [the Department of Labor](https://www.dol.gov/whd/state/stateMinWageHis.htm)...but, upon checking, nope. Those `...` are just plain there. I don't see how we're going to overcome that! The show will have to go on without those states! At least we were able to find out why, by using Pandas.

In the next tutorial, we'll get into some visualization and more into Pandas


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://shahyaseen71.gitbook.io/technocolabs-data-internship/groupby-data-analysis-with-python-3-and-pandas.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
