Instacart Market Basket Analysis : Part 2 ( FE & Modelling)
This is a 3-part series on end to end case study based on Kaggle problem.
In the last post we discussed ML approach for this problem , and drew some conclusions with Exploratory Data Analysis, refer Part 1.
Table of Contents
- Modelling Strategy
- Feature Engineering
- Generate Training and Test Data
- Training Models
- Generate Submission Files
- Improve the model
- References
Modelling Strategy
Strategy : 1
Generate Training Data (using prior_orders_data )
- For every orders in prior_orders_data, take n-1 orders of every user for feature engineering.
- nth order of every user will used to label the dependent variable i.e. reordered.
example : —
let, user A have 90 orders in prior_orders_data.
- build features using 89 orders.
- based on these features we will label the data with reordered(0/1) if any of the products he brought in 89 orders appeared in his 90th order.
Generate Validation Data ( Using train_orders_data)
- Now that our training data is generated using prior_orders_data, we could leverage train_orders_data ( which contains 1 order per user) to test our trained model
- we will predict the product reorder probability using trained model to give accuracy.
- Then we will pick top probable products whose probability of reordering was high and which can maximize F1-Score.
- We will use faron’s f1 optimization code to do this.
- check the actual F1 score , and calculated F1 score. This will give us an idea on how effective the model is .
Generate Test Data ( from orders.csv with eval ==’test’)
- Add features built on training data , based on orders and users
- For every order and product predict if it is reordered(0/1)
- Then we will pick top probable products whose probability of reordering was high and which can maximize F1-Score.
- We will use faron’s f1 optimization code to pick products maximizing f1 score.
Strategy : 2
Generate Training Data (using prior_orders_data and train_orders_data)
- Build features on prior_orders_data
- Order from train_orders_data for every user will be used to label the dependent variable i.e. reordered.
- we will predict the product reorder probability .
- Then we will pick top probable products whose probability of reordering was high
Generate Test Data ( from orders.csv with eval ==’test’)
- Add features built on training data , based on orders and users
- For every order and product predict if it is reordered(0/1)
- Then we will pick top probable products whose probability of reordering was high
After training models using both approaches, strategy 2 produced slightly better results. We will proceed with strategy 2 now.
Feature Engineering
we want to predict if ,
User A →will buy Product B →in his next order C → reordered(1/0) ?
This Future Order ID is obtained from train orders and test orders in orders.csv.
This structure is inspired by Symeon Kokovidis ‘s kernel
Generate product only features
- feat_1 : product_reorder_rate : How frequently the product was reordered regardless the user preference ?
- feat_2 : average_pos_incart : Average position of product in the cart ?
next 3 values are calculated based product being
- isorganic
- isYogurt — aisle
- produce — department
- isFrozen — department
- isdairy — department
- isbreakfast — department
- issnack — department
- isbeverage — department
These features are picked as they were most reordered product type / aisle/ dept. Now these values are then reduced to 3 columns using Non-Negative Matrix Factorization, to reduce sparsity
feat_3 : p_reduced_feat_1 : column 1 from NMF output
feat_4 : p_reduced_feat_2 : column 2 from NMF output
feat_5 : p_reduced_feat_3 : column 3 from NMF output
feat_6 : aisle_reorder_rate : How frequently a product is reordered from the aisle to which this product belongs
feat_7 : department_reorder_rate : How frequently a product is reordered from the department to which this product belongs
Generate User only Features
- feat_1 : user_reorder_rate : What is the average reorder rate on orders placed by a user ?
- feat_2 : user_unique_products : What is the count of distinct products ordered by a user?
- feat_3 : user_total_products : Count of all products ordered by a user?
- feat_4 : user_avg_cart_size : Average products per order by a user ? = average cart size ?
- feat_5 : user_avg_days_between_orders : Average number of days between 2 orders by a user ?
- feat_6 : user_reordered_products_ratio : number of unique products reordered / number of unique products ordered
Generate User Product Features
Now that we have created product only and user only features , we will now create features based on how user interacts with a product
- feat_1 : u_p_order_rate : How frequently user ordered the product ?
- feat_2 : u_p_reorder_rate : How frequently user reordered the product ?-
- feat_3 : u_p_avg_position : What is the average position of product in the cart on orders placed by user ?
- feat_4 : u_p_orders_since_last : What’s the number of orders placed since the product was last ordered ?
- feat_5 : max_streak : Number of orders where user continuously brought a product without miss
Merge above features :
Now merge these independent features ( user only features , product only features, and user — product features) → call it as merged_df .
Now , this dataframe will contain feature for all possible user — product pairs, some of which will be used in training models (using train orders) and testing ( using test orders)
Misc Features
a. Product features based on time
feature : reorder frequency of a product given any hour of the day
b. Product features based on day of week
feature: What is the reorder frequency of any product given any day of week ?
c. Product features based on difference between 2 orders
feature: how frequently a product was reordered given a difference between 2 orders (days) contains the product.
d. User feature based on difference between 2 orders
feature: how frequently user reorders any product given a difference between 2 order (days).
e. User — product reorder rate based on difference between 2 orders
feature: How frequently user reordered a product given difference between 2 orders (days).
Merge all data (Best thing about pandas)
Generate Training Data and Test data
This code will merge the merged_df (from above ) with the train orders data,
similarly , we will merge test data with merged_df(from above) to generate test data.
lets have a sneak peek at our training data and test data
As we can see , we do not have reordered column for test data ( we will predict that).
Additional step
before we start training models, we will reduce the size of our dataframe (currently 3 GB) to 0.6 GB, by changing the default dtypes of columns . Changing dtype of columns from their default value ie. int64/float64/object to their lower range reduced the size by almost 3 times.
Ex — int64 → uint8 ( for department ID , Aisle ID)
We save this file to HDF5 format as saving to CSV after changing the dtypes , resets the dtypes back to default.
Training Models
This will be comparison study based on different approaches, and selecting the best performing one. For each models, we decide performance based on score on kaggle and logloss.
We will use 2 approaches to get results,
a. Global Threshold (0.18 , 0.19, 0.2 ):
These global thresholds are selected based on:
- Adhoc approach: Uploaded many submission files using different thresholds, and saw that after 0.2 F1- score started to decrease.
- Using strategy 1, discussed above: we tested on train_orders, to get an global thresholds
As seen above Mean F1 — drops after probability threshold 0.2. and highest were 0.18, 0.19, 0.2
For every model, we will generate 3 results ( submission files) for above thresholds.
We will pick only those products whose predicted reorder probability ≥ given threshold else None.
b. Local Threshold ( F1 - Maximization)
As described in last post , we will use Faron’s implementation of F1- Maximization such that every order will have its own local threshold and pick those products which will maximize the F1- score.
Here are some examples , where there can be different thresholds for different orders. ( examples shown here are generated to debug the model after training them)
As seen from above examples , local threshold for every order can boost the F1 Score.
Let’s train some models,
Model 1 : Logistic Regression
Logloss on validation data : 0.2550918280106341
Model 2: Decision Tree
Logloss on validation data : 0.2509911734828939
Model 3: Random Forest Classifier
Logloss on validation data : 0.25187675305313206
Model 4: Multi - Layer Perceptron Model
Training Accuracy : 90.75 %
Validation Accuracy : 90.74 %
Logloss on validation data : 0.2513314122715033
Model 5 : XGBoost Classifier
Logloss on validation data : 0.24345293402046597
We can see that hour_reorder_rate ( one of misc features) has highest importance.
Model 6 : CatBoost Classifier
Logloss on validation data : 0.24300858358388394
We can see that u_p_orders_since_last has highest importance, which in contrast to xgb was in 2nd last.
Note:
One thing I want to point out here is that, days_since_prior_order is an important feature here for both XGB and Catboost model. But here is the catch, for any future order after deploying the model, we cant have days_since_prior_order value, as we don’t know user’s last order date. We will handle this in our deployment section.
Generate Submission Files
lets generate submission files, (for both global and local thresholds)
- The function globl_threshold_products → generates submission files based on global thresholds (0.18, 0.19, 0.20)
- The function getscores_on_testdata → generates submission file based on local threshold to maximize F1- Score
We will do this step for every model, submit them on kaggle to check their performance.
Model comparison
We see catboost classifier on local threshold scored highest
Improve the model
We can improve the Catboost Model slightly
During my trials on different models , I found out that , instead of splitting our training data randomly into training and validation set for training our models, if we split training data on users , we can improve our models.
Although this step improves model slightly , it was significant enough to mention here.
Lets see how much model improved after we split our training data based on users.
That marks the end of our modelling stage.
Next up
- Part 3: Deployment