The mission of this Five Step Guide article is to teach you how to create a machine learning model to predict car mileage using auto-mpg dataset in city driving conditions, given data of some parameters (features) for hundreds of cars. In this project, I use auto-mpg dataset of almost 400 cars having accurate values of following parameters:
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
Steps to create a Machine Learning model to predict Car mileage
This problem is a multivariate regression problem. In this article, we will train a machine learning model to predict car mileage by learning the relationship (weights for regression equation) between dependent variable (y) and independent variables or features (x1, x2, x3 etc).
It’s obvious that the mileage of a vehicle doesn’t depend purely on only these parameters. There are several other factors in play like direction and strength of wind, city roads, city traffic, weather, driver experience and ability etc.
Tweet
This article classifies as a short course on how to predict car mileage using Machine Learning with Python code
Learn how to predict car mileage using Machine Learning with Python code
Detailed Python code on Machine Learning model to predict car mileage using auto-mpg dataset. Python cross_val_score & cross_val_predict explained
Course Provider: Person
Course Provider Name: Jatin Grover
4.5
The steps I have followed are more or less commonly followed steps for all regression machine learning problems. In the paragraphs later in the article, I have explained the other two below-mentioned key aspects of approaching and solving a machine learning regression problem.
- Using python cross_val_score & cross_val_predict to choose the best ML algo
- Feature Selection using backward elimination in machine learning
To keep yourself up to date on similar topics, FOLLOW ME by clicking the following icons.
Importing the libraries and creating Pandas DataFrame
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessingfrom sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn import metrics
Thereafter, I read auto-mpg dataset and create a Pandas DataFrame.
df = pd.read_csv(‘auto-mpg.data’, sep=’s+’, header=None, names=[‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’, ‘weight’, ‘acceleration’, ‘modelyear’, ‘origin’, ‘carname’])
display(df.head(3))
Two quick python pandas tips to predict car mileage using machine learning techniques:
- Regular expression: ‘s+’ is a regex matching any white-space character such as [tnrfv]
- How to read a *.dat file? Using
pandas.read_csv()
function by specifying the delimiter as space/tab
The output is like this:
Data pre-processing and visualization
Firstly we check for any null, missing, incomplete, or inappropriate values using the following code.
Check data completeness
Now we check for any null, missing, incomplete, or inappropriate values using the following code.
df = pd.read_csv(‘auto-mpg.data’, sep=’s+’, header=None,
df.isnull().sum()
df.info()
- The first command will tell you whether there’s any missing value for any numerical data, not string data since string datatype data can be blank. The aforementioned command doesn’t capture that.
- The second command will tell you whether the datatype of every feature is as per our expectation i.e. we expect displacement, horsepower, mpg etc. to be numerical datatype (float/int).
df.info()
will help check whether the data type is exactly what we are expecting.
Expected output is something like this:
We see that horsepower column is perceived as object data type by Pandas, whereas we should be expecting a floating value. It means there is a string somewhere. Now our goal is to find that string values(s) and deduce what to do with the corrupt data.
Generally the steps are to check for null, missing, incomplete, inappropriate values and subsequently clean the data by converting data type to appropriate data types, filling missing values, normalizing etc.
You may like to read similar articles
FiveStepGuide may earn an Affiliate Commission if you purchase something through links on this page.
Just thought of sharing with you that I took the the following course Udemy for data preprocessing. It is a fantastic coure for newbies to Python / ML. Notice the number of people who reviewed it.
Click on the following image to directly go to Udemy’s course. It comes with a 30-day money back guarantee.
Clean corrupt data
We follow the steps below to clean the corrupt data for horsepower column
- Display unique values in horsepower column.
str(set(df[‘horsepower’]))
Output:
"{'?', '160.0', '65.00', '129.0', '167.0', '66.00', '208.0', '103.0', '116.0', '60.00', '52.00', '92.00', '115.0', '139.0', '90.00', '180...... <truncated output>
We can see that the only string present in the entire column is ‘?’ in few rows.
2. Find percentage of non-numeric data in horsepower column
def removenotnum(list1):
notnum = []
for x in list1:
try:
float(x)
except:
notnum.append(x)
return notnum
notnumtable = removenotnum(df['horsepower'])
print(‘all rubbish values →’, set(notnumtable))
print(‘Percent of identified rubbish data in Table →’, len(notnumtable) / len(df[‘horsepower’])*100)
OUTPUT -->
all rubbish values --> {'?'}
Percent of identified rubbish data in Table --> 1.507537688442211
It turns out that only 1.5% of data is corrupt. Identify the row index of those rows containing rubbish value for horsepower column and remove those rows.
indexnames = df[(df[‘horsepower’] == ‘?’)].index
df.drop(axis=0,index=indexnames,inplace=True)
Now convert the remaining clean data in horsepower column to float and see the data types now.
df[‘horsepower’] = df[‘horsepower’].astype(float)
df.info()
Data visualization using plots and charts
Now, we use some data visualisation techniques to visualise our data on charts, histograms etc to further do any data preprocessing or feature engineering if required. In short, we plot to check any anomaly, outlier, distribution, range of values etc.
Read more on data visualization in the following brilliant article
Pairplot
Show a pairplot of dependent variable (y) with respect to every independent variable or feature (x1, x2, x3 etc) except car name
sns.pairplot(df, x_vars=df.drop([‘carname’,’mpg’], axis=1, inplace=False).columns, y_vars= [‘mpg’])
I would recommend you to try a python course from Coursera or the brilliant Python Learning Path titled Linkedin Learning Python Essential Training. Click on the image below to start your 30 day trial.
Histogram plot
Now we plot histogram of dependent variable (y) and every independent variable or feature (x1, x2, x3 etc) except car name. Define and describe a histplot()
function and later call it to plot all histograms
def histplot(df, listvar):
fig, axes = plt.subplots(nrows=1, ncols=len(listvar),
figsize=(20, 3))
counter=0
for ax in axes:
df.hist(column=listvar[counter], bins=20, ax=axes[counter])
plt.ylabel(‘Price’)
plt.xlabel(listvar[counter])
counter = counter+1
plt.show()
histplot(df, df.drop([‘carname’], axis=1, inplace=False).columns)
Check and remove outliers using Boxplot
- To see if the data has any outliers, we will now plot a boxplot of every independent variable or feature (x1, x2, x3 etc) except car name. Define and describe a
dfboxplot()
function - Define list of continuous variables and call
dfboxplot()
for only those to detect outliers
def dfboxplot(df, listvars):
fig,axes=plt.subplots(nrows=1,ncols=len(listvars),figsize=(20,3))
counter=0
for ax in axes:
df.boxplot(column=listvars[counter], ax=axes[counter])
plt.ylabel(‘Price’)
plt.xlabel(listvars[counter])
counter = counter+1
plt.show()
# Create a list of continuous variables
linear_vars = df.select_dtypes(include=[np.number]).columns
# call dfboxplot() for only linear_vars to detect outliers
dfboxplot(df, linear_vars)
Lastly, remove outliers using z-score. Generally a z-score of 3 is considered practically useful to detect and remove outliers. You can read more about other methods of removing outliers from the following links:
def dfboxplot(df, listvars):
fig,axes=plt.subplots(nrows=1,ncols=len(listvars),figsize=(20,3))
counter=0
# this removes dataframe’s outliers inplace
def removeoutliers(df, listvars, z):
from scipy import stats
for var in listvars:
df1 = df[np.abs(stats.zscore(df[var])) < z]
return df1
# remove outliers where z score > 3
df = removeoutliers(df, linear_vars,3)
Set up machine learning model to predict car mileage
In this section, I will show you initial steps needed before we begin applying different machine learning techniques or models.
1. Set up python pandas dataframes for independent variable X and dependent variable y
The dependent variable is (y) and the independent features are (x1, x2, x3 etc)
X = df.drop([‘carname’,’mpg’], axis=1, inplace=False)
y = df[[‘mpg’]]
Tip: Two square brackets [[… ]] are needed to create a dataframe. Single square brackets [] will create a series / array
Detailed steps, Python code & Dataset for diamond price prediction using machine learning, SVM, neural networks. KNN, Keras, labelencoder vs onehotencoder, remove outliers python pandas
2. Convert features to log
— Since we say in histogram plots that features are not normally distributed, we will convert them to log.
Most Machine Learning equations rely on the assumption that the underlying data is normally distributed.
Tweet
def convertfeatures2log(df, listvars):
for var in listvars:
df[var] = np.log(df[var])convertfeatures2log(X, X.columns)
convertfeatures2log(y, y.columns)histplot(X, X.columns)
y.hist(bins=20)
3. Test Train Split
Data scientists generally split the data for machine learning into either two or three subsets: 2 subsets for training and testing, while 3 for training, validation and testing. I have elaborated this in my earlier post (search the page for this string: ‘ Train Test Split’) – the link is given below.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, random_state=42)
4. Screen test for ML applicability for this use case — using Random Forest
Calling a machine learning algo nowadays is a piece of cake since there are several libraries which help call and run an algo with just few lines of code.
rf = RandomForestRegressor(n_estimators = 300)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
Explanation of the code above — The first line creates an instance of RandomForestRegressor class. The second line fits the training data to that regressor. The last line predicts y values based on the X_test data and puts all values in y_pred.
Now, we plot a scatter plots for predictions vs. actuals:
- ML algo log results vs. log of mpg in auto-mpg dataset
- exponent of predicted values vs. real mpg in auto-mpg dataset
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(12,5))
grid = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)
ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1])
sns.scatterplot(x = y_test['mpg'], y = y_pred, ax=ax1)
sns.regplot(x = y_test['mpg'], y=y_pred, ax=ax1)
ax1.set_title("Log of Predictions vs. actuals")
ax1.set_xlabel('Actual MPG')
ax1.set_ylabel('predicted MPG')
sns.scatterplot(x = np.exp(y_test['mpg']), y = np.exp(y_pred), ax=ax2,)
sns.regplot(x = np.exp(y_test['mpg']), y=np.exp(y_pred), ax=ax2)
ax2.set_title("Real values of Predictions vs. actuals")
ax2.set_xlabel('Actual MPG')
ax2.set_ylabel('predicted MPG')
Now check metrics to test whether its worthwhile
print(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))
print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output →
MAE: 0.07270868117493247
MSE: 0.009611197970657645
RMSE: 0.0980367174616615
Prediction Error (MAE)= Actual Value – Predicted Value
Since our features were converted to log, we need to calculate the anti-log of MAE of approx 0.0727. So our Machine learning model to predict car mileage using auto-mpg dataset is roughly 82% accurate.
After inspecting the scatter plot, we realize its quite well predicted. So now, I will try to improve the ML algo for more accurate predictions.
In the next section, I will talk more about the 2 other key aspects of approaching and solving a regression machine learning problem that I mentioned earlier:
- Using python cross_val_score & cross_val_predict to choose the best ML algo
- Feature Selection in machine learning
Eyeing the best Machine Learning lens
In this section, I will share with you how I selected the best machine learning algorithm out of the set of algorithms I identified for this particular problem.
There are many ways to select the best machine learning algorithm out of the set of algorithms for the identified problem. One of the most well known methods is using python cross_val_score & cross_val_predict.
- First, we run the
KFold()
function. K-Fold Cross Validation uses a given data set, splits it into a K number of folds where each fold is used as a testing set while other K-1 are used as training set. For example, for 10-Fold cross validation (K=10), the auto-mpg dataset is split into 10 folds. In the first iteration, the first fold is used for validation while the 9 remaining folds form the training set. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 10 folds have been used as testing sets. - Subsequently, the cross_val_score function takes the model, X and y, and kfold’s result as inputs and outputs multiple results — a list of regression model metrics scores. The
cross_val_score
function splits the data, usingKFold
as described above, into K pieces, trains on each combination of K-1 folds and gives back the metrics of the model.
Kfold splits the data into n_splits number of folds where for n_splits times, a dataset will randomly be split into train and test set. Cross_val_score returns array of scores of the estimator for each run of the cross validation.
Tweet
See the following page for better explanation of python cross_val_score & cross_val_predict and KFold — their explanation and real use:
Cross validation and model selection
The following code uses KFold to get 10 splits of training data and subsequently cross_val_score to score multiple regression algorithms on all 10 folds and get the score matrix. The mean and standard deviation of the scores of all 10 folds of every regression algorithm used is then displayed.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
models=[]
models.append(('LR', LinearRegression()))
models.append(('RF', RandomForestRegressor(n_estimators=100)))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR',SVR(gamma='auto')))
# now evaluate each model
results = []
names = []
print("model: mean of score across 10 folds (std dev of score)")
for name, model in models:
# --> split training dataset into 10 parts; train on 9 and test on 1; repeat for all combinations.
kfold = KFold (n_splits=10, random_state=42);
cv_results = cross_val_score(model, X_train, y_train, cv=kfold);
results.append(cv_results);
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
The output turns out to be:
model: mean of score across 10 folds (std dev of score)
LR: 0.878770 (0.042622)
RF: 0.880276 (0.053347)
KNN: 0.796948 (0.092200)
CART: 0.820365 (0.054995)
SVR: 0.840068 (0.066258)
It seems the best algorithm is Linear Regression or Random forest for our use case.
To keep yourself up to date on similar topics, FOLLOW ME by clicking the following icons.
Python Cross_val_score & Cross_val_predict
Lots of people wonder on what’s the difference between Python cross_val_score and cross_val_predict and which one to use when. The following 2 links possibly will clear their doubts. I could have written two full paragraphs on the same but there’s no use if the same content is present elsewhere — read, discussed and appreciated by hundreds of people in the community.
Stackoverflow.com: cross-val predict accuracy score
Stackoverflow.com: Difference between cross-val score and cross-val predict
Which features to select? — Feature selection
I will briefly discuss on a very important machine learning topic: feature selection. Feature selection is the process of decreasing the input variables used to develop a predictive machine learning model. It is desirable in order to reduce the computational cost of modeling as well as to improve the performance of the model.
Sometimes, to create the perfect machine learning model, we use a lot of features and later realize the curse of dimensionality falling upon our machine learning model.
Tweet
We then try to somehow come up with ways to reduce the number of features. I will discuss a couple of methods of feature selection of dimensionality reduction here.
How feature selection can help?
- Training Time: far lesser number of data points help to reduce algorithm complexity, thereby enabling faster training.
- Overfitting: Ensure no redundant data – make better decisions by reducing noise.
- Accuracy: Lesser features means decreased incorrect data, thereby increasing ML model accuracy.
Generally speaking, if you have tens or hundreds of features and get an accuracy of 60% (not considered good for a predictive model), you should look at feature selection and feature engineering. This way, without making huge design or code changes in your Machine learning model, your accuracy may jump to 80-90%.
You can read more about Feature selection by visiting Towardsdatascience.com >> feature selection or by having a far better visual experience by watching the video by clicking at the image below:
In another article, I have described Backward elimination, a powerful feature selection technique. I have described the theory and further built up on code from this very post on “Machine Learning model to predict car mileage using auto-mpg dataset “. You may click the NEXT button below to visit that article.
Hope this gave you a first step-by-step code to implement a regression machine learning algorithm to predict a number along with some flavor of cross_val_score & cross_val_predict in Python, and backward elimination in machine learning.
Top Data science News
References
For this particular post, I have referred to several websites including Machine Learning Mastery and others few posts and websites mentioned in this article itself.
I have also referred to the backward elimination code from the Udemy course by Kirill Eremenko
The following course was my first ever course on Udemy and what a fantastic learning I must say. I can any day suggest you to immediately subscribe to this course since I believe it is probably the best course you should take up after Andrew Ng’s course. Notice the number of people who reviewed it.
Click on the following image to directly go to Udemy’s course. It comes with a 30-day money back guarantee – I have myself availed it for 2 other courses 🙂
About Me
I work in IT projects side of an Investment bank. I have 15 years of experience in building production ready applications in Front-office Trading and for the last 3 years in Machine Learning, NLP, NER, Anomaly detection etc.
Feel free to connect with me on LinkedIn or follow me on Medium
The same post was first written on Medium.com. Please click here to check out my initial post on medium.
Thanks for reading this post. I have written several other such Technology related articles in other sections on this website. Please visit the following related posts as shown below.
If you liked this post, kindly comment and like using the comment form below.
Frequently Asked Questions (FAQ)
Difference between training set and test set in ML models for prediction?
1. The training set consists of labeled data used to train the model. It comprises of the examples provided to the model for learning purpose. Generally 70-80% of the total data is typically identified and used as training dataset
2. At the end of the ML prediction process, there is always a hypothesis generated by the model as an output. The test set is then used to test a model’s accuracy. Generally 20-30% of the data is taken as Test data. The test set is unlabeled data.
Overfitting vs. underfitting?
Overfitting occurs when a model matches the training data almost perfectly, but not the test data. It means the model predicts the training data actually too well. And too much of anything is not good for health.
Underfitting occurs when a model fails to perform poorly both in training and test data.
It’s a vast and debatable topic in itself so I wouldn’t be discussing this in FAQ. You can refer to the following 2 websites:
https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/
https://www.kaggle.com/dansbecker/underfitting-and-overfitting
What are the features used to predict car mileage?
To solve this problem of car fuel efficiency prediction, you may use auto-mpg dataset. It’s a simple but very useful task of predicting car mileage. Almost every car buyer is looking to buying a car with best mileage. The parameters or features provided in this auto-mpg dataset are:
1. mpg:
2. cylinder
3. displacement
4. horsepower:
5. weight
6. acceleration
7. model year
8. origin
9. car name