Introduction - Data Analysis with Python 3 and Pandas
Welcome to a data analysis tutorial with Python and the Pandas data analysis library.
The field of data analytics is quite large and what you might be aiming to do with it is likely to never match up exactly to any tutorial. With that in mind, I think the best way for us to approach learning data analysis with Python is simply by example. My plan here is to find some datasets and do some of the common data analysis tasks, using the Pandas package, to hopefully get you familiar enough with the package to work with it on your own.
To begin, let's make sure we're all on the same page.
I will be using Python 3.7 and Pandas 0.24.1
You can likely follow along with different versions of things, just know there may be minor differences that you will need to work out. With Pandas, I have personally found I can usually google my errors with a high degree of success.
So, after you've got Python and done a pip install pandas
, you're ready!
There will be quite a few packages and libraries that we install through the course of this series. If you'd rather focus on the code and not getting packages, you can check out a pre-compiled and optimized distribution of Python from Activestate, which will have everything you will need to follow along with this series. Get ActivePython.
Let's jump in!
Oh, wait, we probably should have a dataset too.
The internet is stuffed full of datasets, so there are many to choose from. I am personally going to be using datasets from Kaggle.
If you are not familiar, Kaggle is a data analysis competitions website. I think that, if you're looking to practice real-world data analysis challenges, Kaggle is the single best place to do it, even if you're not looking to compete.
Many, if not most, of the competitions on Kaggle are actual company problems. Things just like I get often asked to do in my contract work or that you might be asked to do if you find employment as a data analyst. These are typically "unsolved" types of problems, rather than simpler, solved, issues that you will typically encounter in tutorials.
I don't think we're quite ready to jump into anything serious, so let's find a simpler dataset to start with. To find datasets, check out the Kaggle Datasets. Tons of goodies here.
To begin, let's check out Avocado Prices. I absolutely adore avocados! Did you know avocados are a fruit? Most closely classified as ... a berry! Imagine getting some "mixed berries" flavored thing, and there's avocado in there. Hah!
Anyway, download that dataset. You will need to login/create an account to use Kaggle, but you should. If for whatever reason you don't want to, or the dataset is missing, I will also host it here: Avocado Prices.
Unzip the file using whatever you use to zip/unzip things, and you're left with a CSV file.
CSV files are highly common file types that you will find with data analysis. The structure of a CSV tends to be something that is meant to be organized by columns and rows, where the file itself has values, separted by commas (hey is that where the name CSV comes from!?!) and then the rows are separated by new lines in the document. So, let's read this csv in with Pandas.
For now, let's make sure our file is in the same working directory as our Python script or in a directory like "datasets." I will be doing the latter, but you can feel free to do as you wish. So, to begin, we have a file called avocado.csv
and we want to load that into pandas. It's a CSV file, so it's already in a sort of columns and rows format, we just want to load that into a pandas dataframe
.
To do this, we will use a method called read_csv
. Let's see how that works. I am going to be doing this in a Jupyter Notebook. You can use whatever editor that you like, but the Jupyer notebooks are pretty useful for data analysis and just general poking around with data. To use them, you can just do:
pip install jupyterlab
Then in a terminal/command prompt, you can do:
jupyter lab
Then you can go file > new > notebook, pick Python 3, and you're good to go! Let's start by loading in a file.
A dataframe
is a type of pandas object that is basically a "table" like object with columns and rows, which we can also perform various calcuations and statistical operations..etc on. We can print it out:
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
0
0
2015-12-27
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.00
conventional
2015
Albany
1
1
2015-12-20
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.00
conventional
2015
Albany
2
2
2015-12-13
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.00
conventional
2015
Albany
3
3
2015-12-06
1.08
78992.15
1132.00
71976.41
72.58
5811.16
5677.40
133.76
0.00
conventional
2015
Albany
4
4
2015-11-29
1.28
51039.60
941.48
43838.39
75.78
6183.95
5986.26
197.69
0.00
conventional
2015
Albany
5
5
2015-11-22
1.26
55979.78
1184.27
48067.99
43.61
6683.91
6556.47
127.44
0.00
conventional
2015
Albany
6
6
2015-11-15
0.99
83453.76
1368.92
73672.72
93.26
8318.86
8196.81
122.05
0.00
conventional
2015
Albany
7
7
2015-11-08
0.98
109428.33
703.75
101815.36
80.00
6829.22
6266.85
562.37
0.00
conventional
2015
Albany
8
8
2015-11-01
1.02
99811.42
1022.15
87315.57
85.34
11388.36
11104.53
283.83
0.00
conventional
2015
Albany
9
9
2015-10-25
1.07
74338.76
842.40
64757.44
113.00
8625.92
8061.47
564.45
0.00
conventional
2015
Albany
10
10
2015-10-18
1.12
84843.44
924.86
75595.85
117.07
8205.66
7877.86
327.80
0.00
conventional
2015
Albany
11
11
2015-10-11
1.28
64489.17
1582.03
52677.92
105.32
10123.90
9866.27
257.63
0.00
conventional
2015
Albany
12
12
2015-10-04
1.31
61007.10
2268.32
49880.67
101.36
8756.75
8379.98
376.77
0.00
conventional
2015
Albany
13
13
2015-09-27
0.99
106803.39
1204.88
99409.21
154.84
6034.46
5888.87
145.59
0.00
conventional
2015
Albany
14
14
2015-09-20
1.33
69759.01
1028.03
59313.12
150.50
9267.36
8489.10
778.26
0.00
conventional
2015
Albany
15
15
2015-09-13
1.28
76111.27
985.73
65696.86
142.00
9286.68
8665.19
621.49
0.00
conventional
2015
Albany
16
16
2015-09-06
1.11
99172.96
879.45
90062.62
240.79
7990.10
7762.87
227.23
0.00
conventional
2015
Albany
17
17
2015-08-30
1.07
105693.84
689.01
94362.67
335.43
10306.73
10218.93
87.80
0.00
conventional
2015
Albany
18
18
2015-08-23
1.34
79992.09
733.16
67933.79
444.78
10880.36
10745.79
134.57
0.00
conventional
2015
Albany
19
19
2015-08-16
1.33
80043.78
539.65
68666.01
394.90
10443.22
10297.68
145.54
0.00
conventional
2015
Albany
20
20
2015-08-09
1.12
111140.93
584.63
100961.46
368.95
9225.89
9116.34
109.55
0.00
conventional
2015
Albany
21
21
2015-08-02
1.45
75133.10
509.94
62035.06
741.08
11847.02
11768.52
78.50
0.00
conventional
2015
Albany
22
22
2015-07-26
1.11
106757.10
648.75
91949.05
966.61
13192.69
13061.53
131.16
0.00
conventional
2015
Albany
23
23
2015-07-19
1.26
96617.00
1042.10
82049.40
2238.02
11287.48
11103.49
183.99
0.00
conventional
2015
Albany
24
24
2015-07-12
1.05
124055.31
672.25
94693.52
4257.64
24431.90
24290.08
108.49
33.33
conventional
2015
Albany
25
25
2015-07-05
1.35
109252.12
869.45
72600.55
5883.16
29898.96
29663.19
235.77
0.00
conventional
2015
Albany
26
26
2015-06-28
1.37
89534.81
664.23
57545.79
4662.71
26662.08
26311.76
350.32
0.00
conventional
2015
Albany
27
27
2015-06-21
1.27
104849.39
804.01
76688.55
5481.18
21875.65
21662.00
213.65
0.00
conventional
2015
Albany
28
28
2015-06-14
1.32
89631.30
850.58
55400.94
4377.19
29002.59
28343.14
659.45
0.00
conventional
2015
Albany
29
29
2015-06-07
1.07
122743.06
656.71
99220.82
90.32
22775.21
22314.99
460.22
0.00
conventional
2015
Albany
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
18219
6
2018-02-11
1.56
1317000.47
98465.26
270798.27
1839.80
945638.02
768242.42
177144.00
251.60
organic
2018
TotalUS
18220
7
2018-02-04
1.53
1384683.41
117922.52
287724.61
1703.52
977084.84
774695.74
201878.69
510.41
organic
2018
TotalUS
18221
8
2018-01-28
1.61
1336979.09
118616.17
280080.34
1270.61
936859.49
796104.27
140652.84
102.38
organic
2018
TotalUS
18222
9
2018-01-21
1.63
1283987.65
108705.28
259172.13
1490.02
914409.26
710654.40
203526.59
228.27
organic
2018
TotalUS
18223
10
2018-01-14
1.59
1476651.08
145680.62
323669.83
1580.01
1005593.78
858772.69
146808.97
12.12
organic
2018
TotalUS
18224
11
2018-01-07
1.51
1517332.70
129541.43
296490.29
1289.07
1089861.24
915452.78
174381.57
26.89
organic
2018
TotalUS
18225
0
2018-03-25
1.60
271723.08
26996.28
77861.39
117.56
166747.85
87108.00
79495.39
144.46
organic
2018
West
18226
1
2018-03-18
1.73
210067.47
33437.98
47165.54
110.40
129353.55
73163.12
56020.24
170.19
organic
2018
West
18227
2
2018-03-11
1.63
264691.87
27566.25
60383.57
276.42
176465.63
107174.93
69290.70
0.00
organic
2018
West
18228
3
2018-03-04
1.46
347373.17
25990.60
71213.19
79.01
250090.37
85835.17
164087.33
167.87
organic
2018
West
18229
4
2018-02-25
1.49
301985.61
34200.18
49139.34
85.58
218560.51
99989.62
118314.77
256.12
organic
2018
West
18230
5
2018-02-18
1.64
224798.60
30149.00
38800.64
123.13
155725.83
120428.13
35257.73
39.97
organic
2018
West
18231
6
2018-02-11
1.47
275248.53
24732.55
61713.53
243.00
188559.45
88497.05
99810.80
251.60
organic
2018
West
18232
7
2018-02-04
1.41
283378.47
22474.66
55360.49
133.41
205409.91
70232.59
134666.91
510.41
organic
2018
West
18233
8
2018-01-28
1.80
185974.53
22918.40
33051.14
93.52
129911.47
77822.23
51986.86
102.38
organic
2018
West
18234
9
2018-01-21
1.83
189317.99
27049.44
33561.32
439.47
128267.76
76091.99
51947.50
228.27
organic
2018
West
18235
10
2018-01-14
1.82
207999.67
33869.12
47435.14
433.52
126261.89
89115.78
37133.99
12.12
organic
2018
West
18236
11
2018-01-07
1.48
297190.60
34734.97
62967.74
157.77
199330.12
103761.55
95544.39
24.18
organic
2018
West
18237
0
2018-03-25
1.62
15303.40
2325.30
2171.66
0.00
10806.44
10569.80
236.64
0.00
organic
2018
WestTexNewMexico
18238
1
2018-03-18
1.56
15896.38
2055.35
1499.55
0.00
12341.48
12114.81
226.67
0.00
organic
2018
WestTexNewMexico
18239
2
2018-03-11
1.56
22128.42
2162.67
3194.25
8.93
16762.57
16510.32
252.25
0.00
organic
2018
WestTexNewMexico
18240
3
2018-03-04
1.54
17393.30
1832.24
1905.57
0.00
13655.49
13401.93
253.56
0.00
organic
2018
WestTexNewMexico
18241
4
2018-02-25
1.57
18421.24
1974.26
2482.65
0.00
13964.33
13698.27
266.06
0.00
organic
2018
WestTexNewMexico
18242
5
2018-02-18
1.56
17597.12
1892.05
1928.36
0.00
13776.71
13553.53
223.18
0.00
organic
2018
WestTexNewMexico
18243
6
2018-02-11
1.57
15986.17
1924.28
1368.32
0.00
12693.57
12437.35
256.22
0.00
organic
2018
WestTexNewMexico
18244
7
2018-02-04
1.63
17074.83
2046.96
1529.20
0.00
13498.67
13066.82
431.85
0.00
organic
2018
WestTexNewMexico
18245
8
2018-01-28
1.71
13888.04
1191.70
3431.50
0.00
9264.84
8940.04
324.80
0.00
organic
2018
WestTexNewMexico
18246
9
2018-01-21
1.87
13766.76
1191.92
2452.79
727.94
9394.11
9351.80
42.31
0.00
organic
2018
WestTexNewMexico
18247
10
2018-01-14
1.93
16205.22
1527.63
2981.04
727.01
10969.54
10919.54
50.00
0.00
organic
2018
WestTexNewMexico
18248
11
2018-01-07
1.62
17489.58
2894.77
2356.13
224.53
12014.15
11988.14
26.01
0.00
organic
2018
WestTexNewMexico
18249 rows A-- 14 columns
Okay, that's a bit messy to print that out everytime. Often, we just want to see a small snippet of our dataframe just to make sure everything is what we expect. Most people will use the .head()
method for this:
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
0
0
2015-12-27
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.0
conventional
2015
Albany
1
1
2015-12-20
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.0
conventional
2015
Albany
2
2
2015-12-13
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.0
conventional
2015
Albany
3
3
2015-12-06
1.08
78992.15
1132.00
71976.41
72.58
5811.16
5677.40
133.76
0.0
conventional
2015
Albany
4
4
2015-11-29
1.28
51039.60
941.48
43838.39
75.78
6183.95
5986.26
197.69
0.0
conventional
2015
Albany
You can pass a parameter to the head, which is how many rows you want. Like
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
0
0
2015-12-27
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.0
conventional
2015
Albany
1
1
2015-12-20
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.0
conventional
2015
Albany
2
2
2015-12-13
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.0
conventional
2015
Albany
Often, you may apply rolling window types of operations, where the head will wind up containing NAN type data, and instead you want to see the end. You can do that too with .tail()
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
18243
6
2018-02-11
1.57
15986.17
1924.28
1368.32
0.00
12693.57
12437.35
256.22
0.0
organic
2018
WestTexNewMexico
18244
7
2018-02-04
1.63
17074.83
2046.96
1529.20
0.00
13498.67
13066.82
431.85
0.0
organic
2018
WestTexNewMexico
18245
8
2018-01-28
1.71
13888.04
1191.70
3431.50
0.00
9264.84
8940.04
324.80
0.0
organic
2018
WestTexNewMexico
18246
9
2018-01-21
1.87
13766.76
1191.92
2452.79
727.94
9394.11
9351.80
42.31
0.0
organic
2018
WestTexNewMexico
18247
10
2018-01-14
1.93
16205.22
1527.63
2981.04
727.01
10969.54
10919.54
50.00
0.0
organic
2018
WestTexNewMexico
18248
11
2018-01-07
1.62
17489.58
2894.77
2356.13
224.53
12014.15
11988.14
26.01
0.0
organic
2018
WestTexNewMexico
We can also reference specific columns, like:
Also, you can use attribute-like dot notation like:
But most people use the dict-like methodology. I am not sure if I have ever seen the attribute-like method, so probably don't do it, just know that other people might! A common goal with data analysis is to visualize data. We all love pretty graphs, plus they help us generalize data usually pretty well. So, how might we graph this data. Looking at the data, it's clear that it's actually organized by date, but also region, so we could plot line graphs of individual regions over time.
To do this, we'll need matplotlib
, which is a popular data visualization library. To get it, let's do:
pip install matplotlib
Next, how might we get an individual region? We'd need to filter for that region column! Let's see how we might do that:
Ok, so that might look a bit dense, but let's parse that out.
albany_df = df[ df['region'] == "Albany" ]
We're just saying that the albany_df
is the df
, where the df['region']
column is equal to Albany
. The result is a new dataframe where this is the case:
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
0
0
2015-12-27
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.0
conventional
2015
Albany
1
1
2015-12-20
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.0
conventional
2015
Albany
2
2
2015-12-13
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.0
conventional
2015
Albany
3
3
2015-12-06
1.08
78992.15
1132.00
71976.41
72.58
5811.16
5677.40
133.76
0.0
conventional
2015
Albany
4
4
2015-11-29
1.28
51039.60
941.48
43838.39
75.78
6183.95
5986.26
197.69
0.0
conventional
2015
Albany
Okay, so one more thing you will often see is dataframes are "indexed" by something. Let's see what this dataframe is indexed by:
In this case, the index is worthless to us. It's just incrementing row counts, which we have no use for here. Instead, we should ask ourselves, how is this Albany avocado data organized? How does each row relate to the other? Well, by date. That's the main way this data is organized. So really, we want Date to be our index! We can do this with set_index
.
Unnamed: 0
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
Date
2015-12-27
0
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.00
conventional
2015
Albany
2015-12-20
1
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.00
conventional
2015
Albany
2015-12-13
2
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.00
conventional
2015
Albany
2015-12-06
3
1.08
78992.15
1132.00
71976.41
72.58
5811.16
5677.40
133.76
0.00
conventional
2015
Albany
2015-11-29
4
1.28
51039.60
941.48
43838.39
75.78
6183.95
5986.26
197.69
0.00
conventional
2015
Albany
2015-11-22
5
1.26
55979.78
1184.27
48067.99
43.61
6683.91
6556.47
127.44
0.00
conventional
2015
Albany
2015-11-15
6
0.99
83453.76
1368.92
73672.72
93.26
8318.86
8196.81
122.05
0.00
conventional
2015
Albany
2015-11-08
7
0.98
109428.33
703.75
101815.36
80.00
6829.22
6266.85
562.37
0.00
conventional
2015
Albany
2015-11-01
8
1.02
99811.42
1022.15
87315.57
85.34
11388.36
11104.53
283.83
0.00
conventional
2015
Albany
2015-10-25
9
1.07
74338.76
842.40
64757.44
113.00
8625.92
8061.47
564.45
0.00
conventional
2015
Albany
2015-10-18
10
1.12
84843.44
924.86
75595.85
117.07
8205.66
7877.86
327.80
0.00
conventional
2015
Albany
2015-10-11
11
1.28
64489.17
1582.03
52677.92
105.32
10123.90
9866.27
257.63
0.00
conventional
2015
Albany
2015-10-04
12
1.31
61007.10
2268.32
49880.67
101.36
8756.75
8379.98
376.77
0.00
conventional
2015
Albany
2015-09-27
13
0.99
106803.39
1204.88
99409.21
154.84
6034.46
5888.87
145.59
0.00
conventional
2015
Albany
2015-09-20
14
1.33
69759.01
1028.03
59313.12
150.50
9267.36
8489.10
778.26
0.00
conventional
2015
Albany
2015-09-13
15
1.28
76111.27
985.73
65696.86
142.00
9286.68
8665.19
621.49
0.00
conventional
2015
Albany
2015-09-06
16
1.11
99172.96
879.45
90062.62
240.79
7990.10
7762.87
227.23
0.00
conventional
2015
Albany
2015-08-30
17
1.07
105693.84
689.01
94362.67
335.43
10306.73
10218.93
87.80
0.00
conventional
2015
Albany
2015-08-23
18
1.34
79992.09
733.16
67933.79
444.78
10880.36
10745.79
134.57
0.00
conventional
2015
Albany
2015-08-16
19
1.33
80043.78
539.65
68666.01
394.90
10443.22
10297.68
145.54
0.00
conventional
2015
Albany
2015-08-09
20
1.12
111140.93
584.63
100961.46
368.95
9225.89
9116.34
109.55
0.00
conventional
2015
Albany
2015-08-02
21
1.45
75133.10
509.94
62035.06
741.08
11847.02
11768.52
78.50
0.00
conventional
2015
Albany
2015-07-26
22
1.11
106757.10
648.75
91949.05
966.61
13192.69
13061.53
131.16
0.00
conventional
2015
Albany
2015-07-19
23
1.26
96617.00
1042.10
82049.40
2238.02
11287.48
11103.49
183.99
0.00
conventional
2015
Albany
2015-07-12
24
1.05
124055.31
672.25
94693.52
4257.64
24431.90
24290.08
108.49
33.33
conventional
2015
Albany
2015-07-05
25
1.35
109252.12
869.45
72600.55
5883.16
29898.96
29663.19
235.77
0.00
conventional
2015
Albany
2015-06-28
26
1.37
89534.81
664.23
57545.79
4662.71
26662.08
26311.76
350.32
0.00
conventional
2015
Albany
2015-06-21
27
1.27
104849.39
804.01
76688.55
5481.18
21875.65
21662.00
213.65
0.00
conventional
2015
Albany
2015-06-14
28
1.32
89631.30
850.58
55400.94
4377.19
29002.59
28343.14
659.45
0.00
conventional
2015
Albany
2015-06-07
29
1.07
122743.06
656.71
99220.82
90.32
22775.21
22314.99
460.22
0.00
conventional
2015
Albany
...
...
...
...
...
...
...
...
...
...
...
...
...
...
2017-04-30
35
1.74
3046.63
388.81
280.28
0.00
2377.54
2377.54
0.00
0.00
organic
2017
Albany
2017-04-23
36
1.92
2087.60
110.25
182.56
0.00
1794.79
1794.79
0.00
0.00
organic
2017
Albany
2017-04-16
37
1.85
2886.48
265.82
203.84
0.00
2416.82
2416.82
0.00
0.00
organic
2017
Albany
2017-04-09
38
1.92
2209.82
159.65
189.67
0.00
1860.50
1860.50
0.00
0.00
organic
2017
Albany
2017-04-02
39
1.86
3492.87
885.46
362.37
0.00
2245.04
2245.04
0.00
0.00
organic
2017
Albany
2017-03-26
40
2.02
2250.22
166.49
263.32
0.00
1820.41
1820.41
0.00
0.00
organic
2017
Albany
2017-03-19
41
1.87
2763.38
503.14
175.98
0.00
2084.26
2084.26
0.00
0.00
organic
2017
Albany
2017-03-12
42
1.97
2001.95
123.51
206.64
0.00
1671.80
1671.80
0.00
0.00
organic
2017
Albany
2017-03-05
43
1.84
2228.14
241.00
208.79
0.00
1778.35
1778.35
0.00
0.00
organic
2017
Albany
2017-02-26
44
1.71
2185.96
508.31
240.10
0.00
1437.55
1437.55
0.00
0.00
organic
2017
Albany
2017-02-19
45
1.67
2523.56
1049.50
141.41
0.00
1332.65
1332.65
0.00
0.00
organic
2017
Albany
2017-02-12
46
1.78
1806.40
119.52
170.57
0.00
1516.31
1516.31
0.00
0.00
organic
2017
Albany
2017-02-05
47
1.72
1753.35
26.75
223.33
0.00
1503.27
1503.27
0.00
0.00
organic
2017
Albany
2017-01-29
48
1.86
1795.81
32.53
123.14
0.00
1640.14
1640.14
0.00
0.00
organic
2017
Albany
2017-01-22
49
1.82
1897.07
78.83
128.24
0.00
1690.00
1690.00
0.00
0.00
organic
2017
Albany
2017-01-15
50
1.84
1982.65
82.30
328.02
0.00
1572.33
1572.33
0.00
0.00
organic
2017
Albany
2017-01-08
51
1.94
2229.52
63.46
478.31
0.00
1687.75
1687.75
0.00
0.00
organic
2017
Albany
2017-01-01
52
1.87
1376.70
71.65
192.63
0.00
1112.42
1112.42
0.00
0.00
organic
2017
Albany
2018-03-25
0
1.71
2321.82
42.95
272.41
0.00
2006.46
1996.46
10.00
0.00
organic
2018
Albany
2018-03-18
1
1.66
3154.45
275.89
297.96
0.00
2580.60
2577.27
3.33
0.00
organic
2018
Albany
2018-03-11
2
1.68
2570.52
131.67
229.56
0.00
2209.29
2209.29
0.00
0.00
organic
2018
Albany
2018-03-04
3
1.48
3851.30
311.55
296.77
0.00
3242.98
3239.65
3.33
0.00
organic
2018
Albany
2018-02-25
4
1.56
5356.63
816.56
532.59
0.00
4007.48
4007.48
0.00
0.00
organic
2018
Albany
2018-02-18
5
1.43
7566.17
4314.30
251.85
0.00
3000.02
3000.02
0.00
0.00
organic
2018
Albany
2018-02-11
6
1.43
3817.93
59.18
289.85
0.00
3468.90
3468.90
0.00
0.00
organic
2018
Albany
2018-02-04
7
1.52
4124.96
118.38
420.36
0.00
3586.22
3586.22
0.00
0.00
organic
2018
Albany
2018-01-28
8
1.32
6987.56
433.66
374.96
0.00
6178.94
6178.94
0.00
0.00
organic
2018
Albany
2018-01-21
9
1.54
3346.54
14.67
253.01
0.00
3078.86
3078.86
0.00
0.00
organic
2018
Albany
2018-01-14
10
1.47
4140.95
7.30
301.87
0.00
3831.78
3831.78
0.00
0.00
organic
2018
Albany
2018-01-07
11
1.54
4816.90
43.51
412.17
0.00
4361.22
4357.89
3.33
0.00
organic
2018
Albany
338 rows A-- 13 columns
Wait, what? Why did it print out like that? Part of the benefit of the notebook is that this happened to us, but I would explain this either way. Some of the methods in pandas will modify your dataframe in place
, but MOST are going to simply do the thing and return a new dataframe. So if we just check real quick:
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
0
0
2015-12-27
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.0
conventional
2015
Albany
1
1
2015-12-20
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.0
conventional
2015
Albany
2
2
2015-12-13
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.0
conventional
2015
Albany
3
3
2015-12-06
1.08
78992.15
1132.00
71976.41
72.58
5811.16
5677.40
133.76
0.0
conventional
2015
Albany
4
4
2015-11-29
1.28
51039.60
941.48
43838.39
75.78
6183.95
5986.26
197.69
0.0
conventional
2015
Albany
We can see that the albany_df
is not impacted. There are two ways we can handle for this. One is to re-define:
Unnamed: 0
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
Date
2015-12-27
0
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.0
conventional
2015
Albany
2015-12-20
1
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.0
conventional
2015
Albany
2015-12-13
2
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.0
conventional
2015
Albany
2015-12-06
3
1.08
78992.15
1132.00
71976.41
72.58
5811.16
5677.40
133.76
0.0
conventional
2015
Albany
2015-11-29
4
1.28
51039.60
941.48
43838.39
75.78
6183.95
5986.26
197.69
0.0
conventional
2015
Albany
The other option we can use is the inplace
parameter. Something like:
albany_df.set_index("Date", inplace=True)
would also work. Okay, now that we've done that, let's plot!
When we call .plot()
on a dataframe, it is just assumed that the x axis will be your index, and then Y will be all of your columns, which is why we specified one column in particular.
This graph is a bit messy, however, especially with the dates, which also look out of order and such. Let's see if we can't carry on with this in the next tutorial!
The next tutorial: Graphing/Visualization - Data Analysis With Python 3 And Pandas
Last updated