This quick 5-step guide will describe Backward Elimination code in Python for a machine learning regression problem. Backward elimination is an advanced technique for feature selection. It basically helps you select optimal number of features. Sometimes using all features can cause slowness or other performance issues in your machine learning model.

## ToC – Backward Elimination Python code & process

This articles classifies as a short course for Backward Elimination in machine learning using python code

##### Backward Elimination in machine learning using python code

Python code for backward elimination as a feature selection technique in Machine Learning. Well-described 5-steps of Backward Elimination code in Python using live project.

**Course Provider: **
Person

**Editor's Rating:**

4.5

## Introduction to Backward Elimination in Machine Learning

There are many useful techniques for feature selection (or dimension reduction). If your model has several features, it is possible that not all features are equally important. Some features actually can be derived from other features. So you can dump a few features to improve performance or accuracy. Before I start with the Backward Elimination code in Python, lets understand feature selection with a small example.

However, oftentimes, you have to make a judgement call on whether you would like to keep the derived features or dump those. For instance, total land area of your house is a derived field from length and breadth of your total land. So can we safely remove the total land area feature from a machine learning algorithm to predict house prices?

Think about it this way. Would you like a house which has a wider front face than one with smaller front but goes deeper in the alley? So, in this case we will have to retain at least one of the two redundant features (length or breadth) as well as the total land area feature.

To be sure that you have the optimal number of features, you have to follow some feature selection or dimensionality reduction techniques. There are many other dimensionality reduction techniques like lasso reduction (shrinking large regression coefficients in order to reduce overfitting), Principal Component Analysis (PCA) and so on.

FiveStepGuide may earn an Affiliate Commission if you purchase something through links on this page.

You can read more about Feature selection by watching the video when you click at the image below (opens in another window):

However, here I will explain **Backward Elimination in Machine Learning**. I will also take you through the python code for Backward Elimination.

This is just an exercise useful to reduce the number of features when there actually are hundreds or thousands of features. It’s not really necessary if you have less than 30 features.

To start using the backward elimination code in Python, you need to first prepare your data. First step is to add an array of ones (all elements of that array are “1”) for this regression algorithm to work — array of 1’s represents the constant assigned to first dimension of independent variable X, generally called x_{0}.

```
np.ones (5)
# output is as follows
array([1., 1., 1., 1., 1.])
```

## 5-steps to Backward Elimination in Machine Learning with Python code

Before I start the step-by-step guide, I would ask you to visit the below mentioned article that I wrote on machine learning to predict car mileage. The backward elimination code in Python that I show in this article can be used as the code steps **following** the code written in the Car mileage prediction article. In other words, the python code for backward elimination is the PART 2 of the Car mileage prediction article.

Detailed Python code & steps on several Machine Learning prediction algorithms to predict car mileage using UCI dataset.

A quick rundown of steps for the **Backward Elimination python code** is as follows:

### Step 1: Select a P-value^{1} significance level

Generally a 5% significance level for P-value is perfect for normal circumstances. So keep the P-value = 0.05

^{1} Read this link to know more about p-value.

### Step 2: Fit the model with all predictors (features)

This step is also very simple. Just fit your machine learning model with all the features. For instance, if you have 60 features, fit the model on your test dataset with all of them. The Python code for Backward Elimination step 2 is as follows.

```
import statsmodels.api as sm
X_train_opt = np.append(arr = np.ones((274,1)).astype(int), values = X_train, axis = 1)
X_train_opt = X_train_opt[:,[0, 1, 2, 3, 4, 5, 6, 7]]
regressor_OLS = sm.OLS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()
```

The output is a large table of statistics results. Note that we are interested only in the P-value result (highlighted in yellow).

### Step 3: Identify the predictor with highest P-value.

Now, we note P-values of all predictors. Thereafter, we will search for the predictor with the highest P-value. If its P-value > significance level, go to step 4. Otherwise consider this as the final list of features.

In our case, P-value of x_{1 }and x_{2} are greater than significance level, and the greatest of the higher values is for x_{2}.

You may like to read similar articles

### Step 4: Remove the predictor with highest P-value

Modify the set of features to contain all features apart from the one identified in last step. In our case, its x_{2}.

`# Note that the second feature (x`_{2}) is not used now.
X_train_opt = X_train_opt[ : , [0, 1, 3, 4, 5, 6, 7]]

### Step 5: Fit the model again (Step 2) and stop if p-value of all features is more than significance level

Now use the statsmodels.api library to use OLS function for the penultimate step of python code for Backward Elimination.

Now fit the model without x_{2}.

```
import statsmodels.api as sm
regressor_OLS = sm.OLS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()
```

The output this time is:

Now, the P-value of x_{1} is greater than significance level.

As explained earlier, repeat the Backward Elimination code in Python until we remove all features with p-value higher the significance level i.e. 0.05.

#### Now, remove x_{1} feature and Fit the model again (repeat Step 2 without x_{1})

```
# Note that now the model is run without 1st and 2nd features
X_train_opt = np.append(arr = np.ones((274,1)).astype(int), values = X_train, axis = 1)
X_train_opt = X_train_opt[:,[0,3, 4, 5, 6, 7]]
regressor_OLS = sm.O#### LS(endog = y_train, exog = X_train_opt).fit()
regressor_OLS.summary()
```

What I have described here is the key step in Backward Elimination in Machine Learning. We have to iterate the process from step 2 to step 4 till the highest P-value in the data is less than 0.05. This way, if you have more than 30 features, hopefully you will be able to filter out few features and see an increase in machine learning model performance.

Now you can see that all features have a P-value less than the significance level. This ends our exercise with **Backward elimination Python code in Machine Learning**.

I would recommend you try a python course from Coursera or the brilliant Python *Learning Path *titled **Linkedin Learning Python Essential Training**. Click on the image below to start your 30 day trial.

## Test ML model performance with reduced feature set

Now we know that the optimal feature-set required for our algo is just feature number 3 to 7. So we create another X_train and X_test with feature number 3 to 7 only (in red-bordered rectangle).

`X_train2 = X_train.iloc[:,[2,3, 4, 5, 6]]`

### Random forest algo on reduced number of features

We try to check performance using Random forest model.

```
rf = RandomForestRegressor(n_estimators = 10)
rf.fit(X_train2,y_train)
y_pred = rf.predict(X_test2)
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(12,5))
grid = gridspec.GridSpec(ncols=2, nrows=1, figure=fig)
ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1])
sns.scatterplot(x = y_test['mpg'], y = y_pred, ax=ax1)
sns.regplot(x = y_test['mpg'], y=y_pred, ax=ax1)
ax1.set_title("Log of Predictions vs. actuals")
ax1.set_xlabel('Actual MPG')
ax1.set_ylabel('predicted MPG')
sns.scatterplot(x = np.exp(y_test['mpg']), y = np.exp(y_pred), ax=ax2,)
sns.regplot(x = np.exp(y_test['mpg']), y=np.exp(y_pred), ax=ax2)
ax2.set_title("Real values of Predictions vs. actuals")
ax2.set_xlabel('Actual MPG')
ax2.set_ylabel('predicted MPG')
```

```
print(‘MAE:’, metrics.mean_absolute_error(y_test, y_pred))
print(‘MSE:’, metrics.mean_squared_error(y_test, y_pred))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output →
MAE: 0.07700061119950408
MSE: 0.010134333336198278
RMSE: 0.1006694260249768
```

We see that there is not much difference between the performance or accuracy of the predictions since as explained, this is a smaller dataset with very less feature-set.

The following course was my first ever course on Udemy and what a fantastic learning I must say. I can any day suggest you to immediately subscribe to this course since I believe it is probably the best course you should take up after Andrew Ng’s course. Notice the number of people who reviewed it.

Click on the following image to directly go to Udemy’s course. It comes with a 30-day money back guarantee – I have myself availed it for 2 other courses 🙂

Thank you for reading this article. I have written several other such articles under Machine Learning topics with extensive knowledge, especially on Machine Learning basics. You may like to click the “Data Science” category and read those articles or you may read the following related articles shown below.

If you liked this post, kindly comment and like using the comment form below.

## Frequently Asked Questions (FAQ)

### What is difference between correlation and causation?

Causality means one action causes another action, called outcome.

Correlation means one action is related (or correlated) to another action. It doesn’t necessarily mean that the first action *“causes”* the second action.

### What are the different methods of feature selection?

There are several Dimensionality reduction or Feature Selection techniques:

– Lasso reduction: shrink large regression coefficients in order to reduce overfitting

– Principal Component Analysis (PCA)

– Discard correlated variables to create a reduce features dataset

– Discard derived features. This is a judgemental call.

– Eliminate features after identifying by plotting charts of independent variables after using Random Forest

– Use Linear Regression to select the features based on ‘p’ values

– Forward selection,

– Backward selection

– Stepwise selection

### What is the “Curse of Dimensionality”?

It signifies that the underlying dataset has more features than possibly required.

Additionally, if you have more features than observations, you run the risk of overfitting. Observations may become more difficult to cluster. Because if you have too many dimensions, it can cause each observation to appear close to each other.

PCA is the most popular Dimensionality reduction techniques.

### What is PCA in Machine Learning?

Principle Component Analysis (PCA) is a Dimensionality reduction technique which helps in reducing the number of features (variables) that can be discarded due to correlation with each other. The reduced data set variables will be named as principal components.