Marketing analytics | Significance of Feature engineering, Model selection and Hyper-parameter tuning

Ben Roshan
Analytics Vidhya
Published in
10 min readJul 14, 2020

--

Introduction

In this article we will use a marketing analytics project to perform data cleaning,feature engineering,model selection and hyper-parameter tuning and explain why it is significant to use, Let’s look at the marketing problem statement to understand about our dataset.

Project Summary:

The increasing popularity of online shopping has led to the emergence of new economic activities. To succeed in the highly competitive e-commerce environment, it is vital to understand consumer intention. Understanding what motivates consumer intention is critical because such intention is key to survival in this fast-paced and hyper-competitive environment. Where prior research has attempted at most a limited adaptation of the information system success model, we propose a comprehensive, empirical model that separates the ‘use’ construct into ‘intention to use’ and ‘actual use’. This makes it possible to test the importance of user intentions in determining their online shopping behaviour. Our results suggest that the consumer’s intention to use is quite important, and accurately predicts the usage behaviour of consumers. In contrast, consumer satisfaction has a significant impact on intention to use but no direct causal relation with actual use.

Objectives:

  1. Meet and Clean Data
  2. Feature Engineering
  3. Model the Data using Machine Learning
  4. Strategies to tackle the problem statement

The dataset is also given to us on a golden plate csv file at Kaggle: Online Shopper’s Intention

Greeting the data

Gathering the data

Dataset :

The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.

Attribute :

  • Revenue => class whether it can make a revenue or not
  • Administrative, Administrative Duration, Informational, Informational Duration, Product Related and Product Related Duration => represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories.
  • Bounce Rate => percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session
  • Exit Rate => the percentage that were the last in the session
  • Page Value => feature represents the average value for a web page that a user visited before completing an e-commerce transaction
  • Special Day => indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with transaction. For example, for Valentine’s day, this value takes a nonzero value between February 2 and February 12, zero,before and after this date unless it is close to another special day, and its maximum value of 1 on February 8
  • Operating system,browser, region, traffic type => Different types of operating systems, browser, region and traffic type used to visit the website
  • Visitor type => Whether the customer is a returning or new visitor
  • Weekend => A Boolean value indicating whether the date of the visit is weekend
  • Month => Month of the year

Data Cleaning

Let’s check for missing values

If you have noticed in the plot, there was no null values shown. But while checking each records we have 14 null records in 8 features. Let’s decide on how to deal these missing records

Handling Missing Values

Part 1

The Administration, Informational and product related features are types of pages, technically they are nominal data. So I guess we impute that data with median, Also before doing that we have the min value as 0(page type as 0 means it should be null value) should also be considered as NaN value, so we convert it into NaN value and impute it.

We have now figured out the real number null values hiding in the dataset. Let’s impute them with the median

Part 2

We are left with page duration bounce and exit rates where we have -1 as the minimum value in duration(time cannot be negative) which should be considered as a null value and 0 duration(time can’t be zero) occurs when the page type was 0 which we imputed earlier, so we can convert this into NaN and impute this as well.
For rates, we can impute the the NaN values directly(Here we don’t need to worry about rates being bounce rates having 0 values, as there are many such cases where bounce rates can be 0 because the user must have liked the website and moved onto other web pages towards transactions)

We have now figured out the real number null values hiding in the dataset. Let’s impute them with the mean value

We have handled all the null values. Let’s get into Feature Engineering

Feature Engineering

Now let’s work on the data little bit and jump into the feature engineering.

Handling Outliers

Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant. Let’s check out our numerical feature outliers through box-plot

OOF, we have a lot of outliers, if you notice informational_duration and page values, they don’t have any distribution and if you remove the outliers there will be one value left in it. So except that two features, we are removing the outliers via IQR method

IQR Method- Removing Outliers

We have removed a good number of outliers !

‘Special Day’- Feature Clubbing !

Here I’m planning to club the special day feature which has 0.2,0.4,0.6,0.8,1 . So let’s club the values and replace the values based on a condition. These are probability values, so

  • If it is greater than 0.4 it is ‘1’ which indicates it is a ‘Special day’.
  • If it is less than or equal to 0.4 it is 0 which indicates ‘Not a Special day’

Now, let’s change the values into boolean as it makes more sense

  • 1-True-Special Day
  • 0-False-Not a Special Day

Converting d-types

Since we have categorical variables which are identified as numerical. I guess it is better to convert them to categories

Feature Scaling

Real world dataset contains features that highly vary in magnitudes, units, and range. Normalization should be performed when the scale of a feature is irrelevant or misleading and not should Normalize when the scale is meaningful.

The algorithms which use Euclidean Distance measure are sensitive to Magnitudes. Here feature scaling helps to weigh all the features equally.

Formally, If a feature in the dataset is big in scale compared to others then in algorithms where Euclidean distance is measured this big scaled feature becomes dominating and needs to be normalized.

Examples of Algorithms where Feature Scaling matters
1. K-Means uses the Euclidean distance measure here feature scaling matters.
2. K-Nearest-Neighbours also require feature scaling.
3. Principal Component Analysis (PCA): Tries to get the feature with maximum variance, here too feature scaling is required.
4. Gradient Descent: Calculation speed increase as Theta calculation becomes faster after feature scaling. Check this article to understand more:Analytics Vidhya

Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models are not affected by feature scaling.In Short, any Algorithm which is Not Distance based is Not affected by Feature Scaling. It is optional for our problem statement as we will be using Random forest model(you can skip this step)

We have scaled our numerical features using standard scaler

Label Encoding

Label Encoding in Python can be achieved using Sklearn Library. Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier. Check this blog to understand more :Analytics Vidhya

Let’s encode our month feature using Label encoder

We have encoded the required features, Let’s prune our features through Feature selection

Feature Selection

Extremely Randomized Trees Classifier(Extra Trees Classifier) is a type of ensemble learning technique which aggregates the results of multiple de-correlated decision trees collected in a “forest” to output it’s classification result. In concept, it is very similar to a Random Forest Classifier and only differs from it in the manner of construction of the decision trees in the forest. Check this article for more details: Analytics Vidhya

Let’s check the feature importance and prune our features to make our model perform well.

From the bar plot we can see the importance of features based on it’s impact towards output. Let’s take up the top 14 features

Train and Test Split (80:20)

Let’s drop the required features and split the data into train and test

Modelling our Data

Let’s enter into the crucial phase of building THE machine learning model. Before checking “what could be the best algorithm for prediction” we have to decide on the “why”. It is highly important.

Why?

Our main aim is to predict whether there is a revenue transaction made owing to those values from the features. The output is either going to be 0 or 1. So we can decide that we can use classification models for our problem

What ?

To decide on what can be the best possible classification models let’s not waste time running models. Instead we do quality code by creating pipeline and check all the model accuracy at once. After that we will select one model based on it’s accuracy.

Model Selection

Using Cross validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. Read this article to understand more: Analytics Vidhya

From the test results, we can choose Random forest as our model. It gave us more accurate results since it is an ensemble model. Next we will also test our choice via pipelines .

Using Pipelines

A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative . Read more in this article: Analytics Vidhya

Great. We have got pretty good accurate results from our model. We can see that Random Forest have the highest accuracy being an ensemble model. It usually does have higher accuracy. Let’s select the best model

Random Forest with Hyper-parameter Tuning

In machine learning, hyper-parameter optimization or tuning is the problem of choosing a set of optimal hyper-parameters for a learning algorithm. These measures are called hyper-parameters, and have to be tuned so that the model can optimally solve the machine learning problem. You can read more in details from this blog : Analytics Vidhya

Let’s build a random forest classifier model with hyper-parameter tuning

We have got the best parameters for the model and the mean accuracy is 90.3%

Fitting all the parameters to the model

Let’s fit all the parameters we derived by hyper-parameter tuning into the actual model

We have got 90.3% accuracy. Great

What good is analytics without any insights derived?

This is a short appendix part in our article, to showcase the EDA of this dataset and suggest few strategies to improve marketing efforts of the firm.

The significant importance of PageValue suggests that customers look at considerably different products and its recommendations. Hence a significant improvement on recommendation engines and bundle packages would bring in more conversions. Including more products exploiting the long tail effect in e-commerce will also bring in more revenue drivers.

Here are the revised pointers than can help improve the conversion rate

  1. Following a minimalist approach in UI
  2. Being transparent to the visitors about the prices and information of product
  3. Improving the stay duration by providing them targeted ads like discounts and offers
  4. Reducing the bounce rates through faster refreshing rate and attractive landing page which has highly targeted products exclusive for the visitors
  5. Personalized emails for each visitors and engaging the loyal visitors(returning visitor) through coupons and exclusive access of products

--

--