Instacart Market Basket Analysis : Part 1 (Introduction & EDA)
This is a 3-part series on end to end case study based on Kaggle problem.
Table of Contents
- Introduction
- Why this is a ML problem ?
- How to approach this problem ?
- Existing Solutions
- F1- Score Maximization
- Exploratory Data Analysis
- References
Introduction
Business Problem
Ordering food supplies online is a new way of restocking groceries and other essential items. Be it early morning or midnight, ordering groceries online is stress-free activity without much hassle. But what happens when you forget few items while adding items to the cart or want to get better suggestions on your items? Will you wait for a couple of hours and then order? To deal with such situations, users are provided with suggestions based on their past orders or user preferences.
Instacart, a grocery order and the delivery app with over 500 Million products and 40000 stores serves across U.S. & Canada. Instacart provides a user experience where you will get product recommendation based on your previous orders.
Instacart provided us with transactional data of customer orders over time to predict which previously purchased products will be in a user’s next order. This data is open-sourced and given as a Kaggle challenge.
Real-World / Business Objectives and Constraints
The objective is to predict which products will be in a user’s next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, Instacart provided between 4 and 100 of their orders, along with the sequence in which products were placed in the cart.
Data
Data can be broadly divided into 3 parts.
- Prior data : Order history of every user . This data contains nearly 3–100 past orders per user
- Train data : Current order data of every user . This data contains only 1 order per user
- Test data : Future order data of every user . This data will not contain any product information ( We need to predict that )
orders.csv — consists of order details placed by any user — shape: (3421083, 7)
- Order_id : Unique for every order
- User_id : Unique for every user
- Eval_set : ( prior / train / test)
- Order_number : ith order placed by user
- Order_dow : Day of week
- Order_hour_of_day : Time of day in hr
- Days_since_prior_order : difference in days between 2 orders
order_products__prior.csv — consists of all product details for any prior order, shape: (32434489, 4)
- order_id : Unique order id for every order
- product_id : product ID of item
- add_to_cart_order : denotes the sequence in which products were added to cart.
- reordered : product is reordered ? (1/0)
order_products__train.csv — consists of all product details for a train order, shape: (1384617, 4)
- order_id : Unique order id for every order
- product_id : product ID of item
- add_to_cart_order : denotes the sequence in which products were added to cart.
- reordered : product is reordered ? (1/0)
products.csv — details of a product, shape: (49688, 4)
- product_id : product ID of item
- product_name : name of product
- aisle_id : aisle id of the product
- department_id : department id of the product
Aisles.csv — details of aisles, shape: (134,2)
- aisle_id : aisle ID of item
- aisle_name : name of aisle
department.csv — details of department, shape: (21,2)
- department_id : product ID of item
- department_name : name of department
Why this is a Machine Learning Problem?
For the user to get product recommendations based on his past N orders, we need to observe patterns and generate rules which will give recommendations with high probability. Since we have over 3 Million data points, we need to automate this learning process and using Machine Learning we can achieve this to give probabilistic prediction. Machine Learning works better on large sets of data and generates rules from patterns learned from features.
Other Alternative would be a rule based system, which works best when we know the rules. But it’s very difficult to generate rules by going over all data samples manually and make sense of the patterns. This can’t guarantee in high predictive power
How to approach this problem ?
Based on these order history and user preferences of the product, we need to predict the products which could be reordered.
Type of Machine Learning Problem
At first, it seems like Multi-Label Classification, but there are 49688 products, and total product recommendations could be anywhere from None to N. Therefore, this problem is restructured into a binary classification problem, where we will predict the probability of an item being reordered by a user.
For each order, we will group these probabilities to pick top K probable products which will be reordered, and recommend those to the user.
Cost Function
Since we have converted this to a binary classification problem, we will use logloss as cost function, as we want the model to penalize incorrect classification and want high probability of correct class.
Performance Metric
Mean F1-Score — Mean of all F1 Scores for every order.
Since we need to know how many of the actual recommended products match with predicted ones, we will use F1 score on each order. For all orders combined, we will take mean of those F1-scores.
Existing Solutions
Kaggle Solution by Paulantoine (fixed threshold)
Here Paulantoine build 4 types of features:
- User Features
- Product Features
- Time Features
- User-Product Features
After converting the data into binary classification problem, he used fixed threshold ( using GridSearchCV ) to predict the possible set of recommendations for each order. Probability Threshold is used here to decide which product will be reordered by a user and same is recommended to the user.
Generally,
If P( X) > 0.5 → class 1 ( reorder in this case) else class 0. So, we can select those products which belong to class 1 ( i.e. P(X) > 0.5) and recommend them to user. But this threshold 0.5 can be changed in order to improve the performance of the model.
Paulantoine picked one global threshold by running GridSearchCV, i.e. 0.22
Problem with this approach:
Same threshold value can’t be used for every order by different users.
For example, likelihood for suggesting 5 products to user A is high with threshold = 0.2 since he has ordered many items in past, but same is not true for user B who has ordered very few items in past. Hence, there should be different thresholds for Users based on their order history.
Possible solution: Variable threshold for each order by each user
reference discussion forum: https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/35048
Kaggle 2nd Place Solution by Kazuki Onodera ( Variable Threshold )
Score : 0.4082 on private LB
Github Link : Repo
Onodera build 4 category of features:
- User Features
- Product Features
- Time Features
- User-Product Features
Image ref: https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
23 models were built with over 70+ features to solve this problem using XGBoost. 17 models were used to predict None ( probability of a reordered product to be in user’s next order ) and 6 models were used to predict the reorder probability (probability of a product being reordered by user in next order). Weighted average of all these probabilities were used to generate recommendation such that they have high F1 Score.
To convert these probabilities to binary label and computing threshold, a special F1 Score Maximization algorithm is used which outputs the combination of products which resulted in maximum f1-score . The probability threshold is decided based on F1 score for each order.
F1 Score Maximization
To decide what will the probability threshold for each user (instead of, if P(X) > 0.5 → class 1)
Assuming we can choose from product A or product B for order 1
Let, reorder probability of product , P(A) = 0.9 and P(B)= 0.3,
Then, P(picking only A from A and B)= 0.9(1–0.3),
P(picking only B from A and B) = (1–0.9)0.3,
P(picking both A and B) = 0.9*0.3.
Now ,Expectation F1(A), E(F1(A)) =
expectation (F1 score of A | ground truth is A) * P (ground truth is A) + expectation (F1 score of A | ground truth is B) * P (ground truth is B) + expectation (F1 score of A | ground truth is A B) * P (ground truth is A B)
E(F1(A)) = 1.0 * 0.9(1–0.3) + 0.0 * 0.3 * (1–0.9) + (2/3) * 0.9 * 0.3 = 0.81
Similarly, E(F1(A B)) = (2/3) * 0.9(1–0.3) + (2/3)* 0.3 * (1–0.9) + 1.0* 0.9 * 0.3 = 0.71
We can see that F1 score of selecting product A only is greater than F1 score of selecting product A and B together.
so if probability threshold >= 0.9, only product A will get selected, as F1(A) > F1(A B)
But it is a tedious job to get E(F1) for every combination on every order ( Test set has 75000 orders). We will use below research paper which calculates F1 score for every possible combinations in O(n2) time using Dynamic programming
Research paper: Optimizing F-Measures: A Tale of Two Approaches
To get an item set I Є {I1, I2, I3…, In} which maximizes the F1 score for any order, we can employ the MLE algorithm which has been reduced down to dynamic programming equation described in this paper.
Author solves this optimization problem using dynamic programming in O(n2) time, O(n) space.
In this case study we will use the implementation provided by Faron, refer kernel.
Exploratory Data Analysis
EDA is important as it helps to answer many questions about the data which further leads to better feature engineering.
Missing Values
The dataset has no missing values except for in orders.csv
As we can see only 6% of values are missing from days_since_prior_order column.
Also there are 6% of unique users (206209) compared to total data in orders.csv
It can be seen that for every user’s 1st order ( order_number = 1) the days_since_prior_order is Nan, which makes sense. We can impute 0 here.
Merge Data
We will merge all order_products__prior.csv, order_products__train.csv, products.csv, orders.csv ( only train and prior set), aisle.csv and departments.csv for perform EDA on.
Univariate Analysis
Q : What is distribution of Target variable ?
- Distribution is similar in both Prior and Train Set
- Around 60% of time product has been reordered
Q : How many users are there ?
- Every User in Test set has prior orders in order_products_prior.csv, similarly for every user in Train has order history in order_products_prior.csv.
- So , we can conclude the train-test data is split on users
Q : How many orders were placed by every user ?
- For every user we have around 4–100 order details ( including train and test)
- Histogram on left shows that there are very few users who have placed orders more than 60 , and maximum order for any user is 100.
Q: How many Orders with no reordered products ?
- 12 % of orders have no reordered items, while rest ~88 % of orders contains reordered items
Q: Which are the most frequently ordered / reordered products ?
- There are total 49685 products which were ordered.
- It can be seen that most of the products which are ordered are organic foods / fresh fruits (especially Bananas)
- Bananas have highest order rate of 0.14.
- Top 5 frequently ordered products are organic in nature (This could be an important feature)
Q : How frequently products are ordered and reordered from each aisles ?
- There are total 134 different aisles
- As we can see, most products are ordered from Fresh Fruits and Fresh Vegetables aisles.
- Other frequently ordered items are from Yogurt , Packaged Vegetables and packaged cheese aisles.
- Least frequently ordered items are from Air fresheners, Baby accessories, Baby bath body care etc. aisles
- Milk, sparkling water, fruits, eggs, yogurt are most common aisles the product is reordered from, as they are items which are daily consumed, and one rarely switches from their usual meal plan. Also these are the products that lasts only few days , thus high reorder rate.
- On the other hand hair care, skin care, kitchen supplies are the one which lasts longer than other, hence low reorder rate.
Q : How frequently products are ordered and reordered from each departments ?
- There are total 21 departments
- As seen from departments analysis , most ordered products are from produce department which have fresh vegetables, fruits, herbs etc. But most reordered product department is dairy eggs having yogurt, milk, eggs, cheese etc.
- We see high reorder rate in organic foods and daily consumed items.
- Low reorder rate in personal care departments
Q : What is the cart size on different orders ?
- We have a right skewed distribution of maximum cart size for every order
- There are 237225 orders with cart size = 5, also, mode = 5.
- There are very few order with cart size > 40 and all the way up to 145.
Q : How many products were ordered and reordered on a particular day of week ?
- Assuming that the week starts from Sunday, most shopping is done on Sundays and Mondays. Also least orders were placed on Thursday. People tend to restock there supplies on Sundays.
- Reorders w.r.t to days of week is proportionally same as all orders.
Q : How many products were ordered and reordered on a particular hour of day ?
- Most orders are placed from Early morning to midnight, and very few orders placed from midnight to early morning.
Q : After how many days user ordered / reordered a product ?
- Most people restock after a week or a month. It seems, some people prefer buy a week / month supplies at once.
- People who are buying at Day 0, are probably new customers, but we can see a small rate of reorder implying that users tend to place multiple orders on Day 0 too.
- Probably here 30 days represents the upper limit, and not necessarily any particular month.
- There is a continuous spike in orders from day 1 to day 6, shows that some people are frequent buyers with short window of restocking.
Q : Which products are frequently brought by weekly buyers and monthly buyers ?
- Weekly/monthly buyers tend to buy similar products
- These products, in general, have highest reorder rate irrespective of day of purchase
Bivariate Analysis
Q : How day of week and hour of day impact product order / reorder ?
- first plot describes reorder rate of every day w.r.t to orders placed at that hour
- second plot describes reorder rate of every day w.r.t to orders placed on that day
- from first plot we can see that of all orders that were placed on any hour, most reorders were placed on day 1 ( probably Monday) from 5 AM — 9 AM.
- Same pattern can be seen on any day between 5 AM — 9 AM.
- from second plot we can see that of all orders that were placed on any day, most reorders were placed from 8 AM — 4 PM, on any given day
Other Insights
Q : Which products were never ordered / reordered?
These 3 products (Protein Granola Apple Crisp, Unpeeled Apricot Halves in Heavy Syrup , Single Barrel Kentucky Straight Bourbon Whiskey) which were never ordered. May be their alternatives were ordered.
- Of all 49685 products 4082 products were never reordered.
- Products which were never reordered, includes very specific items such as “.5” Waterproof Tape”, “007 Vodka With Martini Glasses”, ‘Swingtop’ Premium Lager”, we can assume that either consumer didn’t like them or he switched his preferences. Looking at household items like “Zipper Quart Size Freezer Bags”, “flings! Laundry Detergent Pac’s, Original”, we cant definitely say that preferences were switched, since they are type of products which lasts long and consumer might consider reordering them in future.
Q : Maximum cart size after N days since prior order ?
- As it was expected, users with 30 days gap between consecutive orders have maximum cart size. i.e. they tends to order more products
- Same observation can be seen with 29 days, 11 days, 5 days gap also.
- Users placing order on Day 0, have maximum cart size of around 100, these include both first time orders and multiple orders
- Also, average cart size for orders on any day is around 8.
Q: How much position of product in the cart impact reorder rate?
- We see a normal decrease in reorder probability when the position of product is increased till 71.
- But probability fluctuates rapidly after position is increased from 71.
- The lowest reorder probability is somewhere around 0.2 when position is 109
- Position from 128 to 137 shows continuous 50 % reorder probability.
- We can assume that for a product with position greater than 100 have very low probability of being reordered (below 0.3)
- Since the reorder probability of a product will depend on product position in the cart and product itself, the vague behavior can be assumed to be the result of reshuffled position of a product with high reorder probability.
Q: Are there any users whose order contains only reordered products ?
- User_id 99753 have 99 orders which contains only reordered items
- Followed by User 26489 and 100935
Now that we have completed our EDA , we can move on to next stage
Next Up,
- Part 2: Feature Engineering and Modelling
- Part 3: Deployment
References
- Faron’s implementation of F1- Maximization, kernel.
- Research Paper : Optimizing F-Measures: A Tale of Two Approaches
- Solution by Paulantoine for this kaggle challenge.
- 2nd Place Solution by Kazuki Onodera.
- Kaggle discussion thread by saggie anthony on how to improve the model.
- AppliedAI Course