Machine Learning with Scikit-learn - Data Analysis with Python 3 and Pandas

Hello and welcome to part 6 of the Data Analysis with Python and Pandas series, where we're going to be looking into using Pandas as the data pre-processing step for machine learning.

Let's start with a simple regression task, where we're attempting to price out the value of diamonds, using the following diamond dataset.

import pandas as pd

df = pd.read_csv("datasets/diamonds.csv", index_col=0)
df.head()

carat

cut

color

clarity

depth

table

price

0.23

Ideal

SI2

61.5

55.0

326

3.95

3.98

2.43

0.21

Premium

SI1

59.8

61.0

326

3.89

3.84

2.31

0.23

Good

VS1

56.9

65.0

327

4.05

4.07

2.31

0.29

Premium

VS2

62.4

58.0

334

4.20

4.23

2.63

0.31

Good

SI2

63.3

58.0

335

4.34

4.35

2.75

Now, the curiosity is if we could come up with some sort of formula to take inputs like carat, cut, color, clarity, depth, table, x, y, and z to then see if we can predict the price.

The basis of machine learning is math, so columns with string values like cut and clarity have to be converted to numbers.

I would like to start us using linear regression, so it's also fairly ideal that our string classifications are linear, meaning they have a meaningful order. Let's see what all of our cuts are, for example:

df['cut'].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

Okay, we can take this and hard-code the order:

cut_class_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}

Next, let's check out clarity:

df['clarity'].unique()

array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
      dtype=object)

FL,IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3 - Taken from the dataset page, this is ordered best to worst, so now we need this in a dict too.

We also have color. D is the best, J is the worst.

clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}
color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}

Now we map this:

df['cut'] = df['cut'].map(cut_class_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)
df.head()

carat

cut

color

clarity

depth

table

price

0.23

61.5

55.0

326

3.95

3.98

2.43

0.21

59.8

61.0

326

3.89

3.84

2.31

0.23

56.9

65.0

327

4.05

4.07

2.31

0.29

62.4

58.0

334

4.20

4.23

2.63

0.31

63.3

58.0

335

4.34

4.35

2.75

Alright, let's see if we can train a regression model to figure this out. This will be what is called a "supervised" learning task. With supervised learning, your job will pretty much always be the same. You take the data you want to use to make a prediction, and separate it out into an array. Then you take the data you want to predict, and separate that out into another array.

Then, you feed the data you want to use to make the prediction (features) and then the correct values that you want to build a model to learn to map to (your labels) into some type of model.

Scikit-learn is a popular package used for doing regular machine learning (not deep learning usually, though you can do deep learning with sklearn). To get it:

pip install scikit-learn

While you pip install scikit-learn, you actually import things from sklearn.

Next, we pick a model. A super easy way to figure out what model you want is: choosing the right estimator.

This would suggest to us to use an SGD Regressor. All you need to know is this model will take our input features, make them into variables that will be used in an equation to get as close as possible to outputting whatever trained values we pass.

Then, later, we can either save some samples for true out of sample testing, or just make some up to see what the model says would be the price of the diamond.

If you've ever seen a home value estimate or something, this is how they are done. They take in a bunch of features and run them through a regression algorithm to come up with a value.

Okay, so our first job is to convert to features and labels. Always be careful in this step, making sure you dont accidentally pass something about your label into your features, thus informing the model about the intended label more than you intend.

In machine learning, the standard is typically feature sets are stored as a capital X and labels as a lowercase y.

import sklearn
from sklearn.linear_model import SGDRegressor

df = sklearn.utils.shuffle(df) # always shuffle your data to avoid any biases that may emerge b/c of some order.

X = df.drop("price", axis=1).values
y = df["price"].values

Recall that many methods will return a dataframe. So for X we want all of the columns EXCEPT for the price one, so we can just drop it. Then we use .values to convert to a numpy array. Then, for our labels, y, we say this is just the price column's values. Great, but we want to probably save some of these values for testing the model after it's been trained. So we'll do something like:

test_size = 200

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

Now we can train and test our classifier!

clf = SGDRegressor(max_iter=1000)
clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDRegressor in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.
  FutureWarning)

-70999348.67836547

for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0], y)

-1873089.4444007874 1583
20986651.015835285 2317
-4881010.731568813 18306
-24991985.05929947 1902
14339821.173259735 2006
-8960386.342274189 1632
-4840298.9448366165 765
-22461286.443089962 10483
-14519607.59065342 8088
-6780084.278997421 1210

Well, that's not very good. The score for these regression models is r-squared/coefficient of determination, so I am actually not even sure how we got -70999348.67836547, but apparently we did. R-Squared is more often between 0 and 100%, where 100% is a perfect fit (1.0). Let's try support vector regression instead:

from sklearn import svm

clf = svm.SVR()

clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

0.46726117925796273

for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0], y)

1632.1945103421106 1583
2747.0166625762904 2317
4309.38741735391 18306
2855.019642969522 1902
2018.3967527509897 2006
1795.4334151059863 1632
885.2779854342953 765
4342.190407055095 10483
4362.044331408653 8088
2644.932731929752 1210

Well, the good news is some of these are at least close. We're in the same zipcode at least! That took a while to run though. One difference between svm.SVR() and the SGDRegressor according to the docs is that svm.SVR() by default has an unlimited number of iterations. Let's try that with the SGDRegressor to be fair, by setting it to something quite large. Apparently -1 isn't allowed! 10,000 it is!

clf = SGDRegressor(max_iter=10000)

clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0], y)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDRegressor in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.
  FutureWarning)

-77.62258535684498
-10607.814383262768 1583
15612.800161935855 2317
-4995.711159085855 18306
-39280.4156140876 1902
9433.864430975169 2006
-18342.94926221436 1632
-14012.98911494622 765
-27977.65451188339 10483
-17797.9384791404 8088
-18595.021118712146 1210

Ok no, it just isn't gonna work unless we tweak more. Let's go back to the svm.SVR() model and see if we can improve it.

The most common way to improve models is to scale data. Let's try that.

import sklearn
from sklearn import svm, preprocessing

df = sklearn.utils.shuffle(df) # always shuffle your data to avoid any biases that may emerge b/c of some order.

X = df.drop("price", axis=1).values
X = preprocessing.scale(X)
y = df["price"].values

test_size = 200

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

clf = svm.SVR()

clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

for X,y in list(zip(X_test, y_test))[:10]:
    print(f"model predicts {clf.predict([X])[0]}, real value: {y}")

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

0.5772286685430601
model predicts 6967.907249841538, real value: 9107
model predicts 6360.251557641346, real value: 6148
model predicts 1641.0608805840475, real value: 1721
model predicts 6550.871296191623, real value: 7803
model predicts 3452.415197196565, real value: 3613
model predicts 1641.3798716196943, real value: 1813
model predicts 1649.1921453472548, real value: 1629
model predicts 5346.704674915696, real value: 5382
model predicts 3641.8673612381776, real value: 3283
model predicts 2886.18448932456, real value: 3502

This improved our score a bit, so that's nice. We could keep tweaking things and probably improve this model further, but that's not quite the intention of this series, so this will do for now.

Any new diamond data you got would need to be combined into your main dataset, scaled, then predicted from.

Okay, that's all for now! I hope you have enjoyed!.

PreviousCombining multiple datasets - Data Analysis with Python 3 and Pandas NextLine Charts

Last updated 5 years ago

Was this helpful?