import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
pd.set_option('display.max_columns', 500)
from IPython.display import HTML
from IPython.display import display_html
from zipfile import ZipFile
HTML('''
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
One of the common challenges for companies is how to optimize their inventory. Although a number of inventory management techniques and methods already exist, these do not incorporate the interactions/relationship between products. This study aims to implement association pattern mining to discover relationships between pairs of products purchased together that will further enhance inventory management. A publicly available grocery dataset from TaFeng grocery that contains 817,741 transactions from November 2000 to February 2001 was consolidated, cleaned, pre-processed, examined, and analyzed using frequent itemset mining and association pattern mining. 560 frequent itemsets are mined from the shopping transactions using the ECLAT algorithm, and 22 association rules are generated. The results uncovered related products that are frequently bought together by customers. From the results identified, frequent itemsets must never be out of stock else the company will lose the opportunity to maximize profit.
Inventory management is one of the most common problems every company, small and large, deals with most often. If the company does not maintain enough inventory products to fill customer orders, they lose the opportunity to maximize profits. On the other hand, if they stock too many products, storage, and maintenance costs increase, thus decreasing sales. This problem becomes more challenging for grocery stores that carry thousands, if not millions, of SKUs. Although many techniques are being implemented in inventory control, these do not incorporate the interactions/relationship between products.
Ta-Feng is a Chinese membership retailer warehouse that sells a wide variety of goods [1]. The TaFeng dataset is publicly available grocery dataset from Stackoverflow that contains 817,741 shopping records from November 2000 to February 2001. Each row in the dataset represents an item purchased on a given day by a particular customer. Therefore, all items purchased by the same customer on the same day can be treated as a set/basket. At that time, the supermarket had 23,812 SKUs available, and 32,266 customers transacted at the store.
For this study, we explore association rules to understand customer behavior and uncover relationships between pairs of products. Specifically, this study answers the question: Which goods are frequently sold together within the same transaction? By identifying frequent itemsets bought by the customers, we can use this data to stock up effectively and adjust the specified products' inventory requirements. The most frequent products must never be out of stock; else, the company will lose the opportunity to maximize their sales, gain new customers, and gain the maximum profits.
This study primarily focuses on market basket analysis of TaFeng supermarket transactions to uncover patterns and relationships between products. 817,741 records were analyzed containing 23,812 products in a span of 4 months. The general workflow for mining the dataset involves the following steps:
Data was downloaded from Stackoverflow, and then stored in Jojie for easier retrieval. The first line in the files are the column definitions in Chinese characters. The rest of the file contains the data separated by ";". The zipped folder contains these files:
D11: Transaction data collected in November, 2000 \ D12: Transaction data collected in December, 2000 \ D01: Transaction data collected in January, 2001 \ D02: Transaction data collected in February, 2001
zip_file = ZipFile('/mnt/data/public/tafeng/D11-02.ZIP')
df1=pd.DataFrame()
for text_file in zip_file.infolist():
df1=df1.append((pd.read_csv(zip_file.open(text_file.filename),
encoding="ISO-8859-1", sep=';')))
df1.reset_index(drop=True, inplace=True)
# df1.head()
Datasets were combined in one dataframe, and columns were renamed for readability. Some columns were derived from the input fields that were used in the analysis. A total of 817,741 rows were collected, representing an item purchased by a particular customer on a given day. For example, if a customer buys three items on one shopping trip, there will be three rows for each product id with the same customer number and date. The information in each row is detailed in Table 1 while sample cleaned data is displayed in Table 2.
| Data Field | Description |
|---|---|
| Transaction Date | Date when transaction was made |
| Customer ID | unique ID of the customer |
| Age Group | Age group the customer belongs to. A <25,B 25-29,C 30-34,D 35-39,E 40-44,F 45-49,G 50-54,H 55-59,I 60-64,J >65 |
| Residence Area | Customer's residence area. A-F: zipcode area: 105,106,110,114,115,221,G: others, H: Unknown Distance to store, from the closest: 115,221,114,105,106,110 |
| Product Subclass | Product subclass of the bought item, (i.e. office supplies, personal items, ) |
| Product ID | Product ID of the sold item |
| Quantity | Number of units sold |
| Cost | Total cost of the goods sold |
| Sales Price | Total revenue from the goods sold |
| Profit* | Total sales minus the total cost of the product/s |
| Month* | Month |
| Day* | Day of the week of the transaction date |
| Week* | Week number of the transaction date |
*fields derived from other fields
def clean_data(df):
""""Return clean dataframe with new derived columns"""
#update column names of the data
cols=["Transaction Date", "CustomerID", "Age Group", "Residence Area",
"Product Subclass", "Product ID", "Quantity", "Cost", "Sales Price"]
df.columns=cols
#new column for profit
df['Profit'] = df['Sales Price'] - df['Cost']
#new column for month
df['Month'] = pd.to_datetime(df['Transaction Date']).dt.month
#new column for week of the year
df['Week'] = pd.to_datetime(df['Transaction Date']).dt.isocalendar().week
#new column for day of the week
df['Day'] = pd.to_datetime(df['Transaction Date']).dt.dayofweek
dmap = {0:'Mon', 1:'Tue', 2:'Wed', 3:'Thu', 4:'Fri', 5:'Sat', 6:'Sun'}
df['Day'] = df['Day'].map(dmap)
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])
return df
df = clean_data(df1)
df.head()
Before analyzing the relationship between SKUs available at TaFeng Grocery store, a preliminary analysis of the data was done. TaFeng grocery store carries 23,812 SKUs across 2,012 product subclass. In the span of 4 months, a total of 119,578 shopping trips were completed by 32,266 customers adding up to a total sales revenue of 107,840,076 yuan. Daily sales and profit are shown in Figure 1. There is evidence of weekly periodicity based on the plot as most shopping trips are done over the weekends, especially Sundays. It can also be observed that the highest sale in a day happened during January 21, 2001 (which was around Chinese 2001 new year).
fig, (ax1,ax2) = plt.subplots(1,2, gridspec_kw={'width_ratios': [3.5, 1.5]})
# daily profit and sales
(df.groupby('Transaction Date')['Sales Price'].sum()
.plot(figsize=(15,5), color='g', ax=ax1))
(df.groupby('Transaction Date')['Profit'].sum()
.plot(figsize=(15,5), color='r', ax=ax1))
ax1.legend()
ax1.set_ylabel("Sales/Profit in yuan")
ax1.set_title('Ta Feng Daily Total Sales vs Profit Margin')
# daily number of items sold heatmap
x = df.groupby(['Week','Day']).size().unstack().fillna(0)
x = x.reindex([44, 45, 46, 47, 48, 49, 50, 51, 52, 1, 2, 3, 4, 5, 6, 7, 8, 9])
cols = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
x = x[cols]
sns.heatmap(x, ax=ax2)
ax2.set_title('Number of items sold per day');
Around 10% of customers only visited once in the span of 4 months, as demonstrated in Figure 2, in contrast to the average store visit, which was around 3.7 shopping trips. Moreover, in the majority of shopping trips, only a few items are being bought each visit. Merely a small fraction of total transactions involves purchasing a large number of items. The histograms also suggest that there are outliers in the data. By removing the outliers and plotting it, we get a clearer view of the basket size distribution and unique items per basket. The average basket size contained 9.45 items and 6.83 unique products. Mean values are represented by the red triangle in Figure 3.
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(15,5))
# customer visits
df_customer = df.groupby(['CustomerID'])
customer_visits = df_customer['Transaction Date'].nunique()
sns.histplot(x=customer_visits.values, binwidth = 1, ax=ax1);
ax1.set_xlabel("Number of customer visits")
ax1.set_ylabel("Frequency")
ax1.set_title("Frequency of Customer Visits");
# basket size per shopping trip
df_trans = df.groupby(['CustomerID', 'Transaction Date'])
items_per_visit = df_trans['Quantity'].sum()
sns.histplot(x=items_per_visit.values, binwidth = 10, ax=ax2)
ax2.set_xlabel("Basket size")
ax2.set_ylabel("Frequency")
ax2.set_title("Quantity of items bought per visit");
x1 = df.groupby(['CustomerID','Transaction Date'])['Quantity'].sum().values
x2 = df.groupby(['CustomerID','Transaction Date'])['Quantity'].size().values
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,4))
(sns.boxplot(data=x1,orient='h', color='#98FB98', showfliers=False,
meanprops={"markerfacecolor":"red"},
showmeans=True, ax=ax1))
(sns.boxplot(data=x2,orient='h', showfliers=False, showmeans=True,
meanprops={"markerfacecolor":"red"},
color='#77dd77', ax=ax2))
ax1.set_title("Distribution of basket size")
ax2.set_title("Distribution of unique products per basket");
Digging deeper and plotting the top products based on sales (refer to Figure 4), profit (refer to Figure 5) and popularity (refer to Figure 6) across the four months covered, we noticed that top products changes every month. It can also be noted that product ID 4710114128038 is consistently present on the three kinds of charts.
fig, axes = plt.subplots(2,2, figsize=(22,12))
g = df.groupby('Month')
color=['#f39385','#90b7b6','#ebbe20','#c2756a']
m={1: 'January', 2: 'February', 11:'November', 12:'December'}
for i, ax ,c in zip([11,12,1,2], axes.flatten(), color):
(g.get_group(i).groupby('Product ID',sort=True)['Sales Price']
.sum().sort_values(ascending=False)[:10].sort_values()
.plot(kind ='barh', ax=ax, color=c))
ax.set_title(m[i])
ax.set_xlabel("Total Sales")
plt.suptitle("Monthly Top 10 Products with Highest Sales",fontsize=20);
fig, axes = plt.subplots(2,2, figsize=(22,12))
g= df.groupby('Month')
color=['#f39385','#90b7b6','#ebbe20','#c2756a']
m={1: 'January', 2: 'February', 11:'November', 12:'December'}
for i, ax ,c in zip([11,12,1,2], axes.flatten(), color):
(g.get_group(i).groupby('Product ID',sort=True)['Profit']
.sum().sort_values(ascending=False)[:10].sort_values()
.plot(kind ='barh', ax=ax, color=c))
ax.set_title(m[i])
ax.set_xlabel("Total Profit")
plt.suptitle("Monthly Top 10 Most Profitable Products",fontsize=20);
color=['#e1f7d5','#ffbdbd','#c9c9ff','#f1cbff']
fig, axes = plt.subplots(2,2, figsize=(22,12))
for i, ax,c in zip([11,12,1,2], axes.flatten(),color):
(g.get_group(i).groupby('Product ID',sort=True)['Quantity']
.sum().sort_values(ascending=False)[:10].sort_values()
.plot(kind ='barh', ax=ax, color = c))
ax.set_title(m[i])
ax.set_xlabel("Number of Units Sold")
plt.suptitle("Monthly Top 10 Most Popular Products",fontsize=20);
Market basket analysis (MBA) is a technique that uncovers relationships between pairs of products purchased together. In this study, MBA was done using Association Pattern Mining. In order to implement this, we first find the frequent itemsets, and then we generate the association rules between these frequent itemsets. Equivalence Class Clustering and bottom up Lattice Traversal (ECLAT) is one of the algorithms that can be used to identify frequent itemsets. Apriori is a not good choice since the maximum basket size is large (112 items). ECLAT uses a depth-first search for frequent itemsets given a minimum occurrence threshold. Once the item sets have been generated, association rules were be mined.
Since TaFeng supermarket offers a wide array of products, around 23,812 products, the minimum support threshold is nominally set to 0.05 for this study. This means that an item will be considered as frequent if it is present in at least 0.05% of the total baskets. So in our case, an item will be considered as frequent if it is found in at least 59 baskets (out of 119,578 baskets). Since we are only interested in the pair of products frequently bought together so we can properly adjust inventory levels, we set 2 as the minimum number of items per rule. Table 3 summarizes the metric and default values used.
| Metric | Value | Description |
|---|---|---|
| Support | 0.05 | Percentage of orders that contains the item set relevant to an association |
| Confidence | 0.8 | How likely it is that purchasing product A results in a purchase of Product B |
| Lift | - | How the chances of product A being purchased increased given that product B is purchased. |
Implementing ECLAT in our dataset resulted to 560 frequent itemsets (481 2-itemsets, 61 3-itemsets, 15 4-itemsets and 3 5-itemsets). The algorithm provided output with very low support values, and the maximum support is only at 0.585%. The top 15 most frequent itemsets are displayed in Figure 7 with their corresponding support values. Based on this, we can say that for 0.585% of the total transactions, purchase of product 4710011401135 results to purchase of product 4710011401128 as well at least 80% of the time. It can also be noted that Product IDs 4710011401135 and 4710011401128 occur frequently and are paired with other products as well.
import fim
db = (df.groupby(['CustomerID','Transaction Date'])['Product ID']
.apply(lambda x: set(x)).reset_index(drop=True))
freq_items = fim.eclat(db, zmin=2, supp=0.05, conf=0.8, report = 'S',)
#plot frequent itemsets
cols=['itemsets','support']
frequent_itemsets = pd.DataFrame(data=freq_items, columns=cols)
f = frequent_itemsets.sort_values(by='support', ascending=False)[:15]
plt.subplots(figsize=(8,5))
g = sns.barplot(data=f, x='support', y='itemsets')
g.set_title("Top 15 Frequent Itemsets");
On the other hand, only 22 association rules are generated using the same defined thresholds up to 0.293% support, and these are tabulated in Table 4.
The top rule has very low support but has very high confidence and lift. This rule implies that products 4719090790000 and 4719090790017 are frequently purchased together with 80% confidence for 0.293% of the population. Figure 8 depicts the support and confidence distribution of every association rule.
res = fim.arules(db, supp = 0.05, report = 'SCL', zmin=2)
cols=['Antecedent', 'Consequent', 'Support', 'Confidence', 'Lift']
ar=pd.DataFrame(res, columns=cols).sort_values('Support', ascending=False)
ar
#plot
plt.figure(figsize=(10, 6))
plt.scatter(ar['Support'], ar['Confidence'], c=ar['Lift']);
plt.title('Associaton rules confidence vs support color mapped by lift');
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.colorbar();
A strong rule is defined with higher support and confidence values, but with this dataset, we got really low support values, which means the rules generated cannot be classified as strong ones. The very low support values inflated the value of the lift that increased up to 70,000%, but we can still use the generated association rules to make recommendations to balance and adjust inventory levels of different products. For example, we have identified the co-occurrence between products 4710011401135 and 4710011401128. And from our EDA, we have noted that product 4710011401128 is in the top popular products, but product 4710011401135 isn't. Therefore, we can suggest to TaFeng to make sure that identified frequent itemsets must never be out of stock else the company will lose the opportunity to maximize profit.
Aside from the inventory requirement solution, here are other recommendations for TaFeng as well based on the results of FIM and APM:
4719090790000 and 4719090790017 can be placed close together for easier finds. 4710011401128, 4710011409056, 4710011401135 were bought together. They can also package these items together.For future studies, TaFeng can consider the following:
[1] Kruiff, K. (2019, July 5). Recommendations on grocery shopping: customer or product similarities?. Retrieved October 30, 2020, from https://thesis.eur.nl/pub/50337/Kruiff-de_444677.pdf
[2] Heaton, J. (2017, January). Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms. Retrieved October 30, 2020, from https://arxiv.org/pdf/1701.09042.pdf
[3] Kumar, D.,et al. (2012, September). Market Basket Analysis for a Supermarket based on Frequent Itemset Mining. Retrieved October 30, 2020, from https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.401.8884&rep=rep1&type=pdf
[4] Download link for Ta Feng Grocery dataset [closed]. Retrieved October 30, 2020, from https://stackoverflow.com/questions/25014904/download-link-for-ta-feng-grocery-dataset