Introduction - Data Analysis with Python 3 and Pandas

Welcome to a data analysis tutorial with Python and the Pandas data analysis library.

The field of data analytics is quite large and what you might be aiming to do with it is likely to never match up exactly to any tutorial. With that in mind, I think the best way for us to approach learning data analysis with Python is simply by example. My plan here is to find some datasets and do some of the common data analysis tasks, using the Pandas package, to hopefully get you familiar enough with the package to work with it on your own.

To begin, let's make sure we're all on the same page.

I will be using Python 3.7 and Pandas 0.24.1

You can likely follow along with different versions of things, just know there may be minor differences that you will need to work out. With Pandas, I have personally found I can usually google my errors with a high degree of success.

So, after you've got Python and done a pip install pandas, you're ready!

There will be quite a few packages and libraries that we install through the course of this series. If you'd rather focus on the code and not getting packages, you can check out a pre-compiled and optimized distribution of Python from Activestate, which will have everything you will need to follow along with this series. Get ActivePython.

Let's jump in!

Oh, wait, we probably should have a dataset too.

The internet is stuffed full of datasets, so there are many to choose from. I am personally going to be using datasets from Kaggle.

If you are not familiar, Kaggle is a data analysis competitions website. I think that, if you're looking to practice real-world data analysis challenges, Kaggle is the single best place to do it, even if you're not looking to compete.

Many, if not most, of the competitions on Kaggle are actual company problems. Things just like I get often asked to do in my contract work or that you might be asked to do if you find employment as a data analyst. These are typically "unsolved" types of problems, rather than simpler, solved, issues that you will typically encounter in tutorials.

I don't think we're quite ready to jump into anything serious, so let's find a simpler dataset to start with. To find datasets, check out the Kaggle Datasets. Tons of goodies here.

To begin, let's check out Avocado Prices. I absolutely adore avocados! Did you know avocados are a fruit? Most closely classified as ... a berry! Imagine getting some "mixed berries" flavored thing, and there's avocado in there. Hah!

Anyway, download that dataset. You will need to login/create an account to use Kaggle, but you should. If for whatever reason you don't want to, or the dataset is missing, I will also host it here: Avocado Prices.

Unzip the file using whatever you use to zip/unzip things, and you're left with a CSV file.

CSV files are highly common file types that you will find with data analysis. The structure of a CSV tends to be something that is meant to be organized by columns and rows, where the file itself has values, separted by commas (hey is that where the name CSV comes from!?!) and then the rows are separated by new lines in the document. So, let's read this csv in with Pandas.

For now, let's make sure our file is in the same working directory as our Python script or in a directory like "datasets." I will be doing the latter, but you can feel free to do as you wish. So, to begin, we have a file called avocado.csv and we want to load that into pandas. It's a CSV file, so it's already in a sort of columns and rows format, we just want to load that into a pandas dataframe.

To do this, we will use a method called read_csv. Let's see how that works. I am going to be doing this in a Jupyter Notebook. You can use whatever editor that you like, but the Jupyer notebooks are pretty useful for data analysis and just general poking around with data. To use them, you can just do:

pip install jupyterlab

Then in a terminal/command prompt, you can do:

jupyter lab

Then you can go file > new > notebook, pick Python 3, and you're good to go! Let's start by loading in a file.

import pandas as pd  # convention to import and use pandas like this

df = pd.read_csv("datasets/avocado.csv")  # df stands for dataframe. Also a common convention to call this df

A dataframe is a type of pandas object that is basically a "table" like object with columns and rows, which we can also perform various calcuations and statistical operations..etc on. We can print it out:

df

Unnamed: 0

Date

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

0

0

2015-12-27

1.33

64236.62

1036.74

54454.85

48.16

8696.87

8603.62

93.25

0.00

conventional

2015

Albany

1

1

2015-12-20

1.35

54876.98

674.28

44638.81

58.33

9505.56

9408.07

97.49

0.00

conventional

2015

Albany

2

2

2015-12-13

0.93

118220.22

794.70

109149.67

130.50

8145.35

8042.21

103.14

0.00

conventional

2015

Albany

3

3

2015-12-06

1.08

78992.15

1132.00

71976.41

72.58

5811.16

5677.40

133.76

0.00

conventional

2015

Albany

4

4

2015-11-29

1.28

51039.60

941.48

43838.39

75.78

6183.95

5986.26

197.69

0.00

conventional

2015

Albany

5

5

2015-11-22

1.26

55979.78

1184.27

48067.99

43.61

6683.91

6556.47

127.44

0.00

conventional

2015

Albany

6

6

2015-11-15

0.99

83453.76

1368.92

73672.72

93.26

8318.86

8196.81

122.05

0.00

conventional

2015

Albany

7

7

2015-11-08

0.98

109428.33

703.75

101815.36

80.00

6829.22

6266.85

562.37

0.00

conventional

2015

Albany

8

8

2015-11-01

1.02

99811.42

1022.15

87315.57

85.34

11388.36

11104.53

283.83

0.00

conventional

2015

Albany

9

9

2015-10-25

1.07

74338.76

842.40

64757.44

113.00

8625.92

8061.47

564.45

0.00

conventional

2015

Albany

10

10

2015-10-18

1.12

84843.44

924.86

75595.85

117.07

8205.66

7877.86

327.80

0.00

conventional

2015

Albany

11

11

2015-10-11

1.28

64489.17

1582.03

52677.92

105.32

10123.90

9866.27

257.63

0.00

conventional

2015

Albany

12

12

2015-10-04

1.31

61007.10

2268.32

49880.67

101.36

8756.75

8379.98

376.77

0.00

conventional

2015

Albany

13

13

2015-09-27

0.99

106803.39

1204.88

99409.21

154.84

6034.46

5888.87

145.59

0.00

conventional

2015

Albany

14

14

2015-09-20

1.33

69759.01

1028.03

59313.12

150.50

9267.36

8489.10

778.26

0.00

conventional

2015

Albany

15

15

2015-09-13

1.28

76111.27

985.73

65696.86

142.00

9286.68

8665.19

621.49

0.00

conventional

2015

Albany

16

16

2015-09-06

1.11

99172.96

879.45

90062.62

240.79

7990.10

7762.87

227.23

0.00

conventional

2015

Albany

17

17

2015-08-30

1.07

105693.84

689.01

94362.67

335.43

10306.73

10218.93

87.80

0.00

conventional

2015

Albany

18

18

2015-08-23

1.34

79992.09

733.16

67933.79

444.78

10880.36

10745.79

134.57

0.00

conventional

2015

Albany

19

19

2015-08-16

1.33

80043.78

539.65

68666.01

394.90

10443.22

10297.68

145.54

0.00

conventional

2015

Albany

20

20

2015-08-09

1.12

111140.93

584.63

100961.46

368.95

9225.89

9116.34

109.55

0.00

conventional

2015

Albany

21

21

2015-08-02

1.45

75133.10

509.94

62035.06

741.08

11847.02

11768.52

78.50

0.00

conventional

2015

Albany

22

22

2015-07-26

1.11

106757.10

648.75

91949.05

966.61

13192.69

13061.53

131.16

0.00

conventional

2015

Albany

23

23

2015-07-19

1.26

96617.00

1042.10

82049.40

2238.02

11287.48

11103.49

183.99

0.00

conventional

2015

Albany

24

24

2015-07-12

1.05

124055.31

672.25

94693.52

4257.64

24431.90

24290.08

108.49

33.33

conventional

2015

Albany

25

25

2015-07-05

1.35

109252.12

869.45

72600.55

5883.16

29898.96

29663.19

235.77

0.00

conventional

2015

Albany

26

26

2015-06-28

1.37

89534.81

664.23

57545.79

4662.71

26662.08

26311.76

350.32

0.00

conventional

2015

Albany

27

27

2015-06-21

1.27

104849.39

804.01

76688.55

5481.18

21875.65

21662.00

213.65

0.00

conventional

2015

Albany

28

28

2015-06-14

1.32

89631.30

850.58

55400.94

4377.19

29002.59

28343.14

659.45

0.00

conventional

2015

Albany

29

29

2015-06-07

1.07

122743.06

656.71

99220.82

90.32

22775.21

22314.99

460.22

0.00

conventional

2015

Albany

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

18219

6

2018-02-11

1.56

1317000.47

98465.26

270798.27

1839.80

945638.02

768242.42

177144.00

251.60

organic

2018

TotalUS

18220

7

2018-02-04

1.53

1384683.41

117922.52

287724.61

1703.52

977084.84

774695.74

201878.69

510.41

organic

2018

TotalUS

18221

8

2018-01-28

1.61

1336979.09

118616.17

280080.34

1270.61

936859.49

796104.27

140652.84

102.38

organic

2018

TotalUS

18222

9

2018-01-21

1.63

1283987.65

108705.28

259172.13

1490.02

914409.26

710654.40

203526.59

228.27

organic

2018

TotalUS

18223

10

2018-01-14

1.59

1476651.08

145680.62

323669.83

1580.01

1005593.78

858772.69

146808.97

12.12

organic

2018

TotalUS

18224

11

2018-01-07

1.51

1517332.70

129541.43

296490.29

1289.07

1089861.24

915452.78

174381.57

26.89

organic

2018

TotalUS

18225

0

2018-03-25

1.60

271723.08

26996.28

77861.39

117.56

166747.85

87108.00

79495.39

144.46

organic

2018

West

18226

1

2018-03-18

1.73

210067.47

33437.98

47165.54

110.40

129353.55

73163.12

56020.24

170.19

organic

2018

West

18227

2

2018-03-11

1.63

264691.87

27566.25

60383.57

276.42

176465.63

107174.93

69290.70

0.00

organic

2018

West

18228

3

2018-03-04

1.46

347373.17

25990.60

71213.19

79.01

250090.37

85835.17

164087.33

167.87

organic

2018

West

18229

4

2018-02-25

1.49

301985.61

34200.18

49139.34

85.58

218560.51

99989.62

118314.77

256.12

organic

2018

West

18230

5

2018-02-18

1.64

224798.60

30149.00

38800.64

123.13

155725.83

120428.13

35257.73

39.97

organic

2018

West

18231

6

2018-02-11

1.47

275248.53

24732.55

61713.53

243.00

188559.45

88497.05

99810.80

251.60

organic

2018

West

18232

7

2018-02-04

1.41

283378.47

22474.66

55360.49

133.41

205409.91

70232.59

134666.91

510.41

organic

2018

West

18233

8

2018-01-28

1.80

185974.53

22918.40

33051.14

93.52

129911.47

77822.23

51986.86

102.38

organic

2018

West

18234

9

2018-01-21

1.83

189317.99

27049.44

33561.32

439.47

128267.76

76091.99

51947.50

228.27

organic

2018

West

18235

10

2018-01-14

1.82

207999.67

33869.12

47435.14

433.52

126261.89

89115.78

37133.99

12.12

organic

2018

West

18236

11

2018-01-07

1.48

297190.60

34734.97

62967.74

157.77

199330.12

103761.55

95544.39

24.18

organic

2018

West

18237

0

2018-03-25

1.62

15303.40

2325.30

2171.66

0.00

10806.44

10569.80

236.64

0.00

organic

2018

WestTexNewMexico

18238

1

2018-03-18

1.56

15896.38

2055.35

1499.55

0.00

12341.48

12114.81

226.67

0.00

organic

2018

WestTexNewMexico

18239

2

2018-03-11

1.56

22128.42

2162.67

3194.25

8.93

16762.57

16510.32

252.25

0.00

organic

2018

WestTexNewMexico

18240

3

2018-03-04

1.54

17393.30

1832.24

1905.57

0.00

13655.49

13401.93

253.56

0.00

organic

2018

WestTexNewMexico

18241

4

2018-02-25

1.57

18421.24

1974.26

2482.65

0.00

13964.33

13698.27

266.06

0.00

organic

2018

WestTexNewMexico

18242

5

2018-02-18

1.56

17597.12

1892.05

1928.36

0.00

13776.71

13553.53

223.18

0.00

organic

2018

WestTexNewMexico

18243

6

2018-02-11

1.57

15986.17

1924.28

1368.32

0.00

12693.57

12437.35

256.22

0.00

organic

2018

WestTexNewMexico

18244

7

2018-02-04

1.63

17074.83

2046.96

1529.20

0.00

13498.67

13066.82

431.85

0.00

organic

2018

WestTexNewMexico

18245

8

2018-01-28

1.71

13888.04

1191.70

3431.50

0.00

9264.84

8940.04

324.80

0.00

organic

2018

WestTexNewMexico

18246

9

2018-01-21

1.87

13766.76

1191.92

2452.79

727.94

9394.11

9351.80

42.31

0.00

organic

2018

WestTexNewMexico

18247

10

2018-01-14

1.93

16205.22

1527.63

2981.04

727.01

10969.54

10919.54

50.00

0.00

organic

2018

WestTexNewMexico

18248

11

2018-01-07

1.62

17489.58

2894.77

2356.13

224.53

12014.15

11988.14

26.01

0.00

organic

2018

WestTexNewMexico

18249 rows A-- 14 columns

Okay, that's a bit messy to print that out everytime. Often, we just want to see a small snippet of our dataframe just to make sure everything is what we expect. Most people will use the .head() method for this:

df.head()

Unnamed: 0

Date

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

0

0

2015-12-27

1.33

64236.62

1036.74

54454.85

48.16

8696.87

8603.62

93.25

0.0

conventional

2015

Albany

1

1

2015-12-20

1.35

54876.98

674.28

44638.81

58.33

9505.56

9408.07

97.49

0.0

conventional

2015

Albany

2

2

2015-12-13

0.93

118220.22

794.70

109149.67

130.50

8145.35

8042.21

103.14

0.0

conventional

2015

Albany

3

3

2015-12-06

1.08

78992.15

1132.00

71976.41

72.58

5811.16

5677.40

133.76

0.0

conventional

2015

Albany

4

4

2015-11-29

1.28

51039.60

941.48

43838.39

75.78

6183.95

5986.26

197.69

0.0

conventional

2015

Albany

You can pass a parameter to the head, which is how many rows you want. Like

df.head(3)

Unnamed: 0

Date

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

0

0

2015-12-27

1.33

64236.62

1036.74

54454.85

48.16

8696.87

8603.62

93.25

0.0

conventional

2015

Albany

1

1

2015-12-20

1.35

54876.98

674.28

44638.81

58.33

9505.56

9408.07

97.49

0.0

conventional

2015

Albany

2

2

2015-12-13

0.93

118220.22

794.70

109149.67

130.50

8145.35

8042.21

103.14

0.0

conventional

2015

Albany

Often, you may apply rolling window types of operations, where the head will wind up containing NAN type data, and instead you want to see the end. You can do that too with .tail()

df.tail(6)

Unnamed: 0

Date

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

18243

6

2018-02-11

1.57

15986.17

1924.28

1368.32

0.00

12693.57

12437.35

256.22

0.0

organic

2018

WestTexNewMexico

18244

7

2018-02-04

1.63

17074.83

2046.96

1529.20

0.00

13498.67

13066.82

431.85

0.0

organic

2018

WestTexNewMexico

18245

8

2018-01-28

1.71

13888.04

1191.70

3431.50

0.00

9264.84

8940.04

324.80

0.0

organic

2018

WestTexNewMexico

18246

9

2018-01-21

1.87

13766.76

1191.92

2452.79

727.94

9394.11

9351.80

42.31

0.0

organic

2018

WestTexNewMexico

18247

10

2018-01-14

1.93

16205.22

1527.63

2981.04

727.01

10969.54

10919.54

50.00

0.0

organic

2018

WestTexNewMexico

18248

11

2018-01-07

1.62

17489.58

2894.77

2356.13

224.53

12014.15

11988.14

26.01

0.0

organic

2018

WestTexNewMexico

We can also reference specific columns, like:

df['AveragePrice'].head()
0    1.33
1    1.35
2    0.93
3    1.08
4    1.28
Name: AveragePrice, dtype: float64

Also, you can use attribute-like dot notation like:

df.AveragePrice.head()
0    1.33
1    1.35
2    0.93
3    1.08
4    1.28
Name: AveragePrice, dtype: float64

But most people use the dict-like methodology. I am not sure if I have ever seen the attribute-like method, so probably don't do it, just know that other people might! A common goal with data analysis is to visualize data. We all love pretty graphs, plus they help us generalize data usually pretty well. So, how might we graph this data. Looking at the data, it's clear that it's actually organized by date, but also region, so we could plot line graphs of individual regions over time.

To do this, we'll need matplotlib, which is a popular data visualization library. To get it, let's do:

pip install matplotlib

Next, how might we get an individual region? We'd need to filter for that region column! Let's see how we might do that:

albany_df = df[df['region']=="Albany"]

Ok, so that might look a bit dense, but let's parse that out.

albany_df = df[ df['region'] == "Albany" ]

We're just saying that the albany_df is the df, where the df['region'] column is equal to Albany. The result is a new dataframe where this is the case:

albany_df.head()

Unnamed: 0

Date

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

0

0

2015-12-27

1.33

64236.62

1036.74

54454.85

48.16

8696.87

8603.62

93.25

0.0

conventional

2015

Albany

1

1

2015-12-20

1.35

54876.98

674.28

44638.81

58.33

9505.56

9408.07

97.49

0.0

conventional

2015

Albany

2

2

2015-12-13

0.93

118220.22

794.70

109149.67

130.50

8145.35

8042.21

103.14

0.0

conventional

2015

Albany

3

3

2015-12-06

1.08

78992.15

1132.00

71976.41

72.58

5811.16

5677.40

133.76

0.0

conventional

2015

Albany

4

4

2015-11-29

1.28

51039.60

941.48

43838.39

75.78

6183.95

5986.26

197.69

0.0

conventional

2015

Albany

Okay, so one more thing you will often see is dataframes are "indexed" by something. Let's see what this dataframe is indexed by:

albany_df.index
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            17603, 17604, 17605, 17606, 17607, 17608, 17609, 17610, 17611,
            17612],
           dtype='int64', length=338)

In this case, the index is worthless to us. It's just incrementing row counts, which we have no use for here. Instead, we should ask ourselves, how is this Albany avocado data organized? How does each row relate to the other? Well, by date. That's the main way this data is organized. So really, we want Date to be our index! We can do this with set_index.

albany_df.set_index("Date")

Unnamed: 0

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

Date

2015-12-27

0

1.33

64236.62

1036.74

54454.85

48.16

8696.87

8603.62

93.25

0.00

conventional

2015

Albany

2015-12-20

1

1.35

54876.98

674.28

44638.81

58.33

9505.56

9408.07

97.49

0.00

conventional

2015

Albany

2015-12-13

2

0.93

118220.22

794.70

109149.67

130.50

8145.35

8042.21

103.14

0.00

conventional

2015

Albany

2015-12-06

3

1.08

78992.15

1132.00

71976.41

72.58

5811.16

5677.40

133.76

0.00

conventional

2015

Albany

2015-11-29

4

1.28

51039.60

941.48

43838.39

75.78

6183.95

5986.26

197.69

0.00

conventional

2015

Albany

2015-11-22

5

1.26

55979.78

1184.27

48067.99

43.61

6683.91

6556.47

127.44

0.00

conventional

2015

Albany

2015-11-15

6

0.99

83453.76

1368.92

73672.72

93.26

8318.86

8196.81

122.05

0.00

conventional

2015

Albany

2015-11-08

7

0.98

109428.33

703.75

101815.36

80.00

6829.22

6266.85

562.37

0.00

conventional

2015

Albany

2015-11-01

8

1.02

99811.42

1022.15

87315.57

85.34

11388.36

11104.53

283.83

0.00

conventional

2015

Albany

2015-10-25

9

1.07

74338.76

842.40

64757.44

113.00

8625.92

8061.47

564.45

0.00

conventional

2015

Albany

2015-10-18

10

1.12

84843.44

924.86

75595.85

117.07

8205.66

7877.86

327.80

0.00

conventional

2015

Albany

2015-10-11

11

1.28

64489.17

1582.03

52677.92

105.32

10123.90

9866.27

257.63

0.00

conventional

2015

Albany

2015-10-04

12

1.31

61007.10

2268.32

49880.67

101.36

8756.75

8379.98

376.77

0.00

conventional

2015

Albany

2015-09-27

13

0.99

106803.39

1204.88

99409.21

154.84

6034.46

5888.87

145.59

0.00

conventional

2015

Albany

2015-09-20

14

1.33

69759.01

1028.03

59313.12

150.50

9267.36

8489.10

778.26

0.00

conventional

2015

Albany

2015-09-13

15

1.28

76111.27

985.73

65696.86

142.00

9286.68

8665.19

621.49

0.00

conventional

2015

Albany

2015-09-06

16

1.11

99172.96

879.45

90062.62

240.79

7990.10

7762.87

227.23

0.00

conventional

2015

Albany

2015-08-30

17

1.07

105693.84

689.01

94362.67

335.43

10306.73

10218.93

87.80

0.00

conventional

2015

Albany

2015-08-23

18

1.34

79992.09

733.16

67933.79

444.78

10880.36

10745.79

134.57

0.00

conventional

2015

Albany

2015-08-16

19

1.33

80043.78

539.65

68666.01

394.90

10443.22

10297.68

145.54

0.00

conventional

2015

Albany

2015-08-09

20

1.12

111140.93

584.63

100961.46

368.95

9225.89

9116.34

109.55

0.00

conventional

2015

Albany

2015-08-02

21

1.45

75133.10

509.94

62035.06

741.08

11847.02

11768.52

78.50

0.00

conventional

2015

Albany

2015-07-26

22

1.11

106757.10

648.75

91949.05

966.61

13192.69

13061.53

131.16

0.00

conventional

2015

Albany

2015-07-19

23

1.26

96617.00

1042.10

82049.40

2238.02

11287.48

11103.49

183.99

0.00

conventional

2015

Albany

2015-07-12

24

1.05

124055.31

672.25

94693.52

4257.64

24431.90

24290.08

108.49

33.33

conventional

2015

Albany

2015-07-05

25

1.35

109252.12

869.45

72600.55

5883.16

29898.96

29663.19

235.77

0.00

conventional

2015

Albany

2015-06-28

26

1.37

89534.81

664.23

57545.79

4662.71

26662.08

26311.76

350.32

0.00

conventional

2015

Albany

2015-06-21

27

1.27

104849.39

804.01

76688.55

5481.18

21875.65

21662.00

213.65

0.00

conventional

2015

Albany

2015-06-14

28

1.32

89631.30

850.58

55400.94

4377.19

29002.59

28343.14

659.45

0.00

conventional

2015

Albany

2015-06-07

29

1.07

122743.06

656.71

99220.82

90.32

22775.21

22314.99

460.22

0.00

conventional

2015

Albany

...

...

...

...

...

...

...

...

...

...

...

...

...

...

2017-04-30

35

1.74

3046.63

388.81

280.28

0.00

2377.54

2377.54

0.00

0.00

organic

2017

Albany

2017-04-23

36

1.92

2087.60

110.25

182.56

0.00

1794.79

1794.79

0.00

0.00

organic

2017

Albany

2017-04-16

37

1.85

2886.48

265.82

203.84

0.00

2416.82

2416.82

0.00

0.00

organic

2017

Albany

2017-04-09

38

1.92

2209.82

159.65

189.67

0.00

1860.50

1860.50

0.00

0.00

organic

2017

Albany

2017-04-02

39

1.86

3492.87

885.46

362.37

0.00

2245.04

2245.04

0.00

0.00

organic

2017

Albany

2017-03-26

40

2.02

2250.22

166.49

263.32

0.00

1820.41

1820.41

0.00

0.00

organic

2017

Albany

2017-03-19

41

1.87

2763.38

503.14

175.98

0.00

2084.26

2084.26

0.00

0.00

organic

2017

Albany

2017-03-12

42

1.97

2001.95

123.51

206.64

0.00

1671.80

1671.80

0.00

0.00

organic

2017

Albany

2017-03-05

43

1.84

2228.14

241.00

208.79

0.00

1778.35

1778.35

0.00

0.00

organic

2017

Albany

2017-02-26

44

1.71

2185.96

508.31

240.10

0.00

1437.55

1437.55

0.00

0.00

organic

2017

Albany

2017-02-19

45

1.67

2523.56

1049.50

141.41

0.00

1332.65

1332.65

0.00

0.00

organic

2017

Albany

2017-02-12

46

1.78

1806.40

119.52

170.57

0.00

1516.31

1516.31

0.00

0.00

organic

2017

Albany

2017-02-05

47

1.72

1753.35

26.75

223.33

0.00

1503.27

1503.27

0.00

0.00

organic

2017

Albany

2017-01-29

48

1.86

1795.81

32.53

123.14

0.00

1640.14

1640.14

0.00

0.00

organic

2017

Albany

2017-01-22

49

1.82

1897.07

78.83

128.24

0.00

1690.00

1690.00

0.00

0.00

organic

2017

Albany

2017-01-15

50

1.84

1982.65

82.30

328.02

0.00

1572.33

1572.33

0.00

0.00

organic

2017

Albany

2017-01-08

51

1.94

2229.52

63.46

478.31

0.00

1687.75

1687.75

0.00

0.00

organic

2017

Albany

2017-01-01

52

1.87

1376.70

71.65

192.63

0.00

1112.42

1112.42

0.00

0.00

organic

2017

Albany

2018-03-25

0

1.71

2321.82

42.95

272.41

0.00

2006.46

1996.46

10.00

0.00

organic

2018

Albany

2018-03-18

1

1.66

3154.45

275.89

297.96

0.00

2580.60

2577.27

3.33

0.00

organic

2018

Albany

2018-03-11

2

1.68

2570.52

131.67

229.56

0.00

2209.29

2209.29

0.00

0.00

organic

2018

Albany

2018-03-04

3

1.48

3851.30

311.55

296.77

0.00

3242.98

3239.65

3.33

0.00

organic

2018

Albany

2018-02-25

4

1.56

5356.63

816.56

532.59

0.00

4007.48

4007.48

0.00

0.00

organic

2018

Albany

2018-02-18

5

1.43

7566.17

4314.30

251.85

0.00

3000.02

3000.02

0.00

0.00

organic

2018

Albany

2018-02-11

6

1.43

3817.93

59.18

289.85

0.00

3468.90

3468.90

0.00

0.00

organic

2018

Albany

2018-02-04

7

1.52

4124.96

118.38

420.36

0.00

3586.22

3586.22

0.00

0.00

organic

2018

Albany

2018-01-28

8

1.32

6987.56

433.66

374.96

0.00

6178.94

6178.94

0.00

0.00

organic

2018

Albany

2018-01-21

9

1.54

3346.54

14.67

253.01

0.00

3078.86

3078.86

0.00

0.00

organic

2018

Albany

2018-01-14

10

1.47

4140.95

7.30

301.87

0.00

3831.78

3831.78

0.00

0.00

organic

2018

Albany

2018-01-07

11

1.54

4816.90

43.51

412.17

0.00

4361.22

4357.89

3.33

0.00

organic

2018

Albany

338 rows A-- 13 columns

Wait, what? Why did it print out like that? Part of the benefit of the notebook is that this happened to us, but I would explain this either way. Some of the methods in pandas will modify your dataframe in place, but MOST are going to simply do the thing and return a new dataframe. So if we just check real quick:

albany_df.head()

Unnamed: 0

Date

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

0

0

2015-12-27

1.33

64236.62

1036.74

54454.85

48.16

8696.87

8603.62

93.25

0.0

conventional

2015

Albany

1

1

2015-12-20

1.35

54876.98

674.28

44638.81

58.33

9505.56

9408.07

97.49

0.0

conventional

2015

Albany

2

2

2015-12-13

0.93

118220.22

794.70

109149.67

130.50

8145.35

8042.21

103.14

0.0

conventional

2015

Albany

3

3

2015-12-06

1.08

78992.15

1132.00

71976.41

72.58

5811.16

5677.40

133.76

0.0

conventional

2015

Albany

4

4

2015-11-29

1.28

51039.60

941.48

43838.39

75.78

6183.95

5986.26

197.69

0.0

conventional

2015

Albany

We can see that the albany_df is not impacted. There are two ways we can handle for this. One is to re-define:

albany_df = albany_df.set_index("Date")
albany_df.head()

Unnamed: 0

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

Date

2015-12-27

0

1.33

64236.62

1036.74

54454.85

48.16

8696.87

8603.62

93.25

0.0

conventional

2015

Albany

2015-12-20

1

1.35

54876.98

674.28

44638.81

58.33

9505.56

9408.07

97.49

0.0

conventional

2015

Albany

2015-12-13

2

0.93

118220.22

794.70

109149.67

130.50

8145.35

8042.21

103.14

0.0

conventional

2015

Albany

2015-12-06

3

1.08

78992.15

1132.00

71976.41

72.58

5811.16

5677.40

133.76

0.0

conventional

2015

Albany

2015-11-29

4

1.28

51039.60

941.48

43838.39

75.78

6183.95

5986.26

197.69

0.0

conventional

2015

Albany

The other option we can use is the inplace parameter. Something like:

albany_df.set_index("Date", inplace=True)

would also work. Okay, now that we've done that, let's plot!

albany_df['AveragePrice'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11dd80940>

When we call .plot() on a dataframe, it is just assumed that the x axis will be your index, and then Y will be all of your columns, which is why we specified one column in particular.

This graph is a bit messy, however, especially with the dates, which also look out of order and such. Let's see if we can't carry on with this in the next tutorial!

The next tutorial: Graphing/Visualization - Data Analysis With Python 3 And Pandas

Last updated