Instacart Market Basket Analysis : Part 2 ( FE & Modelling)

9 min readApr 5, 2021

This is a 3-part series on end to end case study based on Kaggle problem.

In the last post we discussed ML approach for this problem , and drew some conclusions with Exploratory Data Analysis, refer Part 1.

Modelling Strategy
Feature Engineering
Generate Training and Test Data
Training Models
Generate Submission Files
Improve the model
References

Modelling Strategy

Strategy : 1

Generate Training Data (using prior_orders_data )

For every orders in prior_orders_data, take n-1 orders of every user for feature engineering.
nth order of every user will used to label the dependent variable i.e. reordered.

example : —

let, user A have 90 orders in prior_orders_data.

build features using 89 orders.
based on these features we will label the data with reordered(0/1) if any of the products he brought in 89 orders appeared in his 90th order.

Generate Validation Data ( Using train_orders_data)

Now that our training data is generated using prior_orders_data, we could leverage train_orders_data ( which contains 1 order per user) to test our trained model
we will predict the product reorder probability using trained model to give accuracy.
Then we will pick top probable products whose probability of reordering was high and which can maximize F1-Score.
We will use faron’s f1 optimization code to do this.
check the actual F1 score , and calculated F1 score. This will give us an idea on how effective the model is .

Generate Test Data ( from orders.csv with eval ==’test’)

Add features built on training data , based on orders and users
For every order and product predict if it is reordered(0/1)
Then we will pick top probable products whose probability of reordering was high and which can maximize F1-Score.
We will use faron’s f1 optimization code to pick products maximizing f1 score.

Strategy : 2

Generate Training Data (using prior_orders_data and train_orders_data)

Build features on prior_orders_data
Order from train_orders_data for every user will be used to label the dependent variable i.e. reordered.
we will predict the product reorder probability .
Then we will pick top probable products whose probability of reordering was high

Generate Test Data ( from orders.csv with eval ==’test’)

Add features built on training data , based on orders and users
For every order and product predict if it is reordered(0/1)
Then we will pick top probable products whose probability of reordering was high

After training models using both approaches, strategy 2 produced slightly better results. We will proceed with strategy 2 now.

Feature Engineering

we want to predict if ,

User A →will buy Product B →in his next order C → reordered(1/0) ?

This Future Order ID is obtained from train orders and test orders in orders.csv.

This structure is inspired by Symeon Kokovidis ‘s kernel

Generate product only features

feat_1 : product_reorder_rate : How frequently the product was reordered regardless the user preference ?
feat_2 : average_pos_incart : Average position of product in the cart ?

next 3 values are calculated based product being

isorganic
isYogurt — aisle
produce — department
isFrozen — department
isdairy — department
isbreakfast — department
issnack — department
isbeverage — department

out of 49688 products 4903 products are organic

These features are picked as they were most reordered product type / aisle/ dept. Now these values are then reduced to 3 columns using Non-Negative Matrix Factorization, to reduce sparsity

feat_3 : p_reduced_feat_1 : column 1 from NMF output

feat_4 : p_reduced_feat_2 : column 2 from NMF output

feat_5 : p_reduced_feat_3 : column 3 from NMF output

feat_6 : aisle_reorder_rate : How frequently a product is reordered from the aisle to which this product belongs

feat_7 : department_reorder_rate : How frequently a product is reordered from the department to which this product belongs

Generate User only Features

feat_1 : user_reorder_rate : What is the average reorder rate on orders placed by a user ?
feat_2 : user_unique_products : What is the count of distinct products ordered by a user?
feat_3 : user_total_products : Count of all products ordered by a user?
feat_4 : user_avg_cart_size : Average products per order by a user ? = average cart size ?
feat_5 : user_avg_days_between_orders : Average number of days between 2 orders by a user ?
feat_6 : user_reordered_products_ratio : number of unique products reordered / number of unique products ordered

Generate User Product Features

Now that we have created product only and user only features , we will now create features based on how user interacts with a product

feat_1 : u_p_order_rate : How frequently user ordered the product ?
feat_2 : u_p_reorder_rate : How frequently user reordered the product ?-
feat_3 : u_p_avg_position : What is the average position of product in the cart on orders placed by user ?
feat_4 : u_p_orders_since_last : What’s the number of orders placed since the product was last ordered ?
feat_5 : max_streak : Number of orders where user continuously brought a product without miss

Merge above features :

Now merge these independent features ( user only features , product only features, and user — product features) → call it as merged_df .

complete image source : https://www.google.com/url?sa=i&url=https%3A%2F%2Frealpython.com%2Fpandas-merge-join-and-concat%2F&psig=AOvVaw25sJH5G4Nxdnk6sE9bUnZF&ust=1617347975456000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCNjL-PDA3O8CFQAAAAAdAAAAABAD — image source : google

merge above features

Now , this dataframe will contain feature for all possible user — product pairs, some of which will be used in training models (using train orders) and testing ( using test orders)

Misc Features

a. Product features based on time

feature : reorder frequency of a product given any hour of the day

b. Product features based on day of week

feature: What is the reorder frequency of any product given any day of week ?

c. Product features based on difference between 2 orders

feature: how frequently a product was reordered given a difference between 2 orders (days) contains the product.

d. User feature based on difference between 2 orders

feature: how frequently user reorders any product given a difference between 2 order (days).

e. User — product reorder rate based on difference between 2 orders

feature: How frequently user reordered a product given difference between 2 orders (days).

Merge all data (Best thing about pandas)

Image ref : https://iamluminousmen-media.s3.amazonaws.com/media/introduction-to-pyspark-join-types/introduction-to-pyspark-join-types-7.jpg

Generate Training Data and Test data

This code will merge the merged_df (from above ) with the train orders data,

similarly , we will merge test data with merged_df(from above) to generate test data.

lets have a sneak peek at our training data and test data

As we can see , we do not have reordered column for test data ( we will predict that).

Additional step

before we start training models, we will reduce the size of our dataframe (currently 3 GB) to 0.6 GB, by changing the default dtypes of columns . Changing dtype of columns from their default value ie. int64/float64/object to their lower range reduced the size by almost 3 times.

Ex — int64 → uint8 ( for department ID , Aisle ID)

We save this file to HDF5 format as saving to CSV after changing the dtypes , resets the dtypes back to default.

Training Models

This will be comparison study based on different approaches, and selecting the best performing one. For each models, we decide performance based on score on kaggle and logloss.

We will use 2 approaches to get results,

a. Global Threshold (0.18 , 0.19, 0.2 ):

These global thresholds are selected based on:

Adhoc approach: Uploaded many submission files using different thresholds, and saw that after 0.2 F1- score started to decrease.
Using strategy 1, discussed above: we tested on train_orders, to get an global thresholds

As seen above Mean F1 — drops after probability threshold 0.2. and highest were 0.18, 0.19, 0.2

For every model, we will generate 3 results ( submission files) for above thresholds.

We will pick only those products whose predicted reorder probability ≥ given threshold else None.

b. Local Threshold ( F1 - Maximization)

As described in last post , we will use Faron’s implementation of F1- Maximization such that every order will have its own local threshold and pick those products which will maximize the F1- score.

Here are some examples , where there can be different thresholds for different orders. ( examples shown here are generated to debug the model after training them)

As seen from above examples , local threshold for every order can boost the F1 Score.

Let’s train some models,

Model 1 : Logistic Regression

Logistic regression

Logloss on validation data : 0.2550918280106341

Model 2: Decision Tree

Logloss on validation data : 0.2509911734828939

Model 3: Random Forest Classifier

Logloss on validation data : 0.25187675305313206

Model 4: Multi - Layer Perceptron Model

MLP model

Training Accuracy : 90.75 %
Validation Accuracy : 90.74 %
Logloss on validation data : 0.2513314122715033

Model 5 : XGBoost Classifier

Logloss on validation data : 0.24345293402046597

We can see that hour_reorder_rate ( one of misc features) has highest importance.

Model 6 : CatBoost Classifier

Logloss on validation data : 0.24300858358388394

We can see that u_p_orders_since_last has highest importance, which in contrast to xgb was in 2nd last.

Note:

One thing I want to point out here is that, days_since_prior_order is an important feature here for both XGB and Catboost model. But here is the catch, for any future order after deploying the model, we cant have days_since_prior_order value, as we don’t know user’s last order date. We will handle this in our deployment section.

Generate Submission Files

lets generate submission files, (for both global and local thresholds)

The function globl_threshold_products → generates submission files based on global thresholds (0.18, 0.19, 0.20)
The function getscores_on_testdata → generates submission file based on local threshold to maximize F1- Score

We will do this step for every model, submit them on kaggle to check their performance.

Model comparison

We see catboost classifier on local threshold scored highest

Improve the model

We can improve the Catboost Model slightly

During my trials on different models , I found out that , instead of splitting our training data randomly into training and validation set for training our models, if we split training data on users , we can improve our models.

Although this step improves model slightly , it was significant enough to mention here.

Lets see how much model improved after we split our training data based on users.