Graphing/visualization - Data Analysis with Python 3 and Pandas

Welcome to part 2 of the data analysis with Python and Pandas tutorials, where we're learning about the prices of Avocados at the moment. Soon, we'll find a new dataset, but let's learn a few more things with this one. Where we left off, we were graphing the price from Albany over time, but it was quite messy. Here's a recap:

import pandas as pd

df = pd.read_csv("datasets/avocado.csv")

albany_df = df[df['region']=="Albany"]
albany_df.set_index("Date", inplace=True)

albany_df["AveragePrice"].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11fd925f8>

So dates are funky types of data, since they are strings, but also have order, at least to us. When it comes to dates, we have to help computers out a bit. Luckily for us, Pandas comes built in with ways to handle for dates. First, we need to convert the date column to datetime objects:

Alright, the formatting looks better in terms of axis, but that graph is pretty wild! Could we settle it down a bit? We could smooth the data with a rolling average.

To do this, let's make a new column, and apply some smoothing:

Hmm, so what happened? Pandas understands that a date is a date, and to sort the X axis, but I am now wondering if the dataframe itself is sorted. If it's not, that would seriously screw up our moving average calculations. This data may be indexed by date, but is it sorted? Let's see.

What's this warning above? Should we be worried? Basically, all it's telling us is that we're doing operations on a copy of a slice of a dataframe, and to watch out because we might not be modifying what we were hoping to modify (like the main df). In this case, we're not trying to work with the main dataframe, so I think this warning is just plain annoying, but whatever. It's just a warning, not an error.

And there we have it! A more useful summary of avocado prices for Albany over the years.

Visualizations are cool, but what if we want to save our new, smoother, data like above? We can give it a new column in our dataframe:

Unnamed: 0

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

price25ma

Date

2015-01-04

51

1.22

40873.28

2819.50

28287.42

49.90

9716.46

9186.93

529.53

0.0

conventional

2015

Albany

NaN

2015-01-04

51

1.79

1373.95

57.42

153.88

0.00

1162.65

1162.65

0.00

0.0

organic

2015

Albany

NaN

2015-01-11

50

1.24

41195.08

1002.85

31640.34

127.12

8424.77

8036.04

388.73

0.0

conventional

2015

Albany

NaN

2015-01-11

50

1.77

1182.56

39.00

305.12

0.00

838.44

838.44

0.00

0.0

organic

2015

Albany

NaN

2015-01-18

49

1.17

44511.28

914.14

31540.32

135.77

11921.05

11651.09

269.96

0.0

conventional

2015

Albany

NaN

Perfect example of why tail is useful sometimes...

Unnamed: 0

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

price25ma

Date

2018-03-11

2

1.68

2570.52

131.67

229.56

0.00

2209.29

2209.29

0.00

0.0

organic

2018

Albany

1.4224

2018-03-18

1

1.66

3154.45

275.89

297.96

0.00

2580.60

2577.27

3.33

0.0

organic

2018

Albany

1.4316

2018-03-18

1

1.35

105304.65

13234.86

61037.58

55.00

30977.21

26755.90

3721.31

500.0

conventional

2018

Albany

1.4276

2018-03-25

0

1.57

149396.50

16361.69

109045.03

65.45

23924.33

19273.80

4270.53

380.0

conventional

2018

Albany

1.4272

2018-03-25

0

1.71

2321.82

42.95

272.41

0.00

2006.46

1996.46

10.00

0.0

organic

2018

Albany

1.4368

That warning sure is annoying though isn't it. What could we do? A common idea is to silence it all of the warnings. What if we instead...

This way, we are explicit. Pandas is happy, we're happy. You can ignore pandas warnings, but I would strongly advise against silencing them. Now you know you can do it, but I am not going to show you how :)

Another subtle thing you might have glossed over is the requirement for us to sort things how we intend before we start performing operations and calcs. Many times, you wont be visualizing columns before you make them. You may actually never visualize them. Imagine if we wrote the above code before we sorted by date, basically just assuming things were ordered by date. We'd have produced bad data.

Then imagine maybe we're doing some machine learning or other statistical analysis on that data. Well, chances are, our MA column is not only fairly useless, it's also being informed often of future data!

It's very easy to make mistakes like this. Check your code early and check it often through printing it out and visualizing it where possible!

Alright, we want more cool stuff, what else can we do?

Let's graph prices in the different regions. We hard-coded the Albany region, but hmm, we don't know all of the regions. What do we do?! If we every just wanted to get a "list" from one of our columns, we could reference just that column, like:

Then convert to array with:

Could go to list like:

So, very quickly you can take your pandas data and get it out into array/list form and use your own knowledge of python. Or, you could also just use some Pandas method. Just know, if you're trying to do it, it probably has a method!

That was quick and painless!

I set the limit to 16 just to show we're getting bogged down. This one really tripped me up. I couldn't quite figure out what was going on. Things were taking exponentially longer and longer, then memory was getting exhausted. That's not what I know and love with Pandas, so what gives? Upon some digging, we find that hmm, dates are still getting duplicated. For example:

Albany_price25ma

Atlanta_price25ma

BaltimoreWashington_price25ma

Boise_price25ma

Boston_price25ma

BuffaloRochester_price25ma

California_price25ma

Charlotte_price25ma

Chicago_price25ma

CincinnatiDayton_price25ma

Columbus_price25ma

DallasFtWorth_price25ma

Denver_price25ma

Detroit_price25ma

GrandRapids_price25ma

GreatLakes_price25ma

Date

2018-03-25

1.4368

1.2884

1.3844

1.5016

1.588

1.2232

1.4232

1.4916

1.5708

1.2792

1.1704

1.118

1.2888

1.1492

1.3264

1.2788

2018-03-25

1.4368

1.2884

1.3844

1.5016

1.588

1.2232

1.4232

1.4916

1.5708

1.2792

1.1704

1.118

1.2888

1.1728

1.3164

1.2568

2018-03-25

1.4368

1.2884

1.3844

1.5016

1.588

1.2232

1.4232

1.4916

1.5708

1.2792

1.1704

1.118

1.2888

1.1728

1.3164

1.2788

2018-03-25

1.4368

1.2884

1.3844

1.5016

1.588

1.2232

1.4232

1.4916

1.5708

1.2792

1.1704

1.118

1.2888

1.1728

1.3264

1.2568

2018-03-25

1.4368

1.2884

1.3844

1.5016

1.588

1.2232

1.4232

1.4916

1.5708

1.2792

1.1704

1.118

1.2888

1.1728

1.3264

1.2788

Each row should be a separate date, but it's not. Through some debugging, we can discover what's happening, which actually informs us to why our previous data looked so ugly too.

Our avocados have multiple prices: Organic and Conventional! So, let's pick one. I'll go with organic. So we'll just start over pretty much.

Unnamed: 0

Date

AveragePrice

Total Volume

4046

4225

4770

Total Bags

Small Bags

Large Bags

XLarge Bags

type

year

region

9489

51

2015-01-04

1.24

142349.77

107490.73

25711.96

2.93

9144.15

9144.15

0.00

0.0

organic

2015

California

10269

51

2015-01-04

1.50

6329.83

3730.80

2141.91

0.00

457.12

426.67

30.45

0.0

organic

2015

LasVegas

10893

51

2015-01-04

1.12

17296.85

14569.66

1868.59

0.00

858.60

830.00

28.60

0.0

organic

2015

PhoenixTucson

9437

51

2015-01-04

1.73

379.82

0.00

59.82

0.00

320.00

320.00

0.00

0.0

organic

2015

BuffaloRochester

11621

51

2015-01-04

1.30

5782.70

723.29

4221.15

0.00

838.26

223.33

614.93

0.0

organic

2015

Spokane

Now, let's just copy and paste the code from above, minus the print:

California_price25ma

LasVegas_price25ma

PhoenixTucson_price25ma

BuffaloRochester_price25ma

Spokane_price25ma

LosAngeles_price25ma

Philadelphia_price25ma

Boston_price25ma

StLouis_price25ma

Louisville_price25ma

...

Houston_price25ma

Chicago_price25ma

Plains_price25ma

Indianapolis_price25ma

SouthCentral_price25ma

Columbus_price25ma

Albany_price25ma

Detroit_price25ma

NewOrleansMobile_price25ma

NewYork_price25ma

Date

2018-02-25

1.9128

1.9120

1.7468

1.2912

2.1544

1.9024

1.6324

1.7636

1.9240

1.7044

...

1.6132

1.8160

1.8116

1.4928

1.5728

1.6064

1.5112

1.4980

1.5384

1.9308

2018-03-04

1.8876

1.8748

1.7404

1.2744

2.1040

1.8656

1.6260

1.7708

1.8868

1.6816

...

1.5960

1.8024

1.7900

1.4744

1.5592

1.5804

1.4992

1.4692

1.5288

1.9156

2018-03-11

1.8636

1.8440

1.7324

1.2652

2.0552

1.8284

1.6300

1.7824

1.8468

1.6508

...

1.5696

1.7836

1.7672

1.4540

1.5400

1.5496

1.5044

1.4444

1.5076

1.9092

2018-03-18

1.8516

1.8204

1.7216

1.2560

2.0012

1.8160

1.6304

1.7932

1.8192

1.6176

...

1.5360

1.7732

1.7452

1.4320

1.5204

1.5088

1.5140

1.4092

1.4860

1.8948

2018-03-25

1.8364

1.7968

1.7104

1.2416

1.9496

1.8016

1.6256

1.7984

1.7976

1.5844

...

1.5128

1.7672

1.7232

1.4160

1.5072

1.4848

1.5188

1.3964

1.4616

1.8876

5 rows A-- 54 columns

Now it's quick! Awesome!

Let's graph!

Lots more we could poke around with here, but, in the next tutorial, we'll be visiting a new dataset with new challenges.

Last updated

Was this helpful?