import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from tqdm import tqdm 

pd.set_option('display.max_columns', 500)
from IPython.display import HTML
from IPython.display import display_html
from zipfile import ZipFile


HTML('''
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}


</style>


<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')

Market Basket Analysis on Ta Feng Grocery Store

$image.png$

EXECUTIVE SUMMARY

One of the common challenges for companies is how to optimize their inventory. Although a number of inventory management techniques and methods already exist, these do not incorporate the interactions/relationship between products. This study aims to implement association pattern mining to discover relationships between pairs of products purchased together that will further enhance inventory management. A publicly available grocery dataset from TaFeng grocery that contains 817,741 transactions from November 2000 to February 2001 was consolidated, cleaned, pre-processed, examined, and analyzed using frequent itemset mining and association pattern mining. 560 frequent itemsets are mined from the shopping transactions using the ECLAT algorithm, and 22 association rules are generated. The results uncovered related products that are frequently bought together by customers. From the results identified, frequent itemsets must never be out of stock else the company will lose the opportunity to maximize profit.

INTRODUCTION

Inventory management is one of the most common problems every company, small and large, deals with most often. If the company does not maintain enough inventory products to fill customer orders, they lose the opportunity to maximize profits. On the other hand, if they stock too many products, storage, and maintenance costs increase, thus decreasing sales. This problem becomes more challenging for grocery stores that carry thousands, if not millions, of SKUs. Although many techniques are being implemented in inventory control, these do not incorporate the interactions/relationship between products.

Ta-Feng is a Chinese membership retailer warehouse that sells a wide variety of goods [1]. The TaFeng dataset is publicly available grocery dataset from Stackoverflow that contains 817,741 shopping records from November 2000 to February 2001. Each row in the dataset represents an item purchased on a given day by a particular customer. Therefore, all items purchased by the same customer on the same day can be treated as a set/basket. At that time, the supermarket had 23,812 SKUs available, and 32,266 customers transacted at the store.

For this study, we explore association rules to understand customer behavior and uncover relationships between pairs of products. Specifically, this study answers the question: Which goods are frequently sold together within the same transaction? By identifying frequent itemsets bought by the customers, we can use this data to stock up effectively and adjust the specified products' inventory requirements. The most frequent products must never be out of stock; else, the company will lose the opportunity to maximize their sales, gain new customers, and gain the maximum profits.

METHODOLOGY

This study primarily focuses on market basket analysis of TaFeng supermarket transactions to uncover patterns and relationships between products. 817,741 records were analyzed containing 23,812 products in a span of 4 months. The general workflow for mining the dataset involves the following steps:

Data Extraction
Data Preprocessing
Exploratory Data Analysis
Association Pattern Mining

1. Data Extraction and Storage¶

Data was downloaded from Stackoverflow, and then stored in Jojie for easier retrieval. The first line in the files are the column definitions in Chinese characters. The rest of the file contains the data separated by ";". The zipped folder contains these files:

D11: Transaction data collected in November, 2000 \ D12: Transaction data collected in December, 2000 \ D01: Transaction data collected in January, 2001 \ D02: Transaction data collected in February, 2001

zip_file = ZipFile('/mnt/data/public/tafeng/D11-02.ZIP')

df1=pd.DataFrame()
for text_file in zip_file.infolist():
    df1=df1.append((pd.read_csv(zip_file.open(text_file.filename), 
                             encoding="ISO-8859-1", sep=';')))
    
df1.reset_index(drop=True, inplace=True)
    
# df1.head()

2. Data Preprocessing¶

Datasets were combined in one dataframe, and columns were renamed for readability. Some columns were derived from the input fields that were used in the analysis. A total of 817,741 rows were collected, representing an item purchased by a particular customer on a given day. For example, if a customer buys three items on one shopping trip, there will be three rows for each product id with the same customer number and date. The information in each row is detailed in Table 1 while sample cleaned data is displayed in Table 2.

Table 1. Data description¶

Data Field	Description
Transaction Date	Date when transaction was made
Customer ID	unique ID of the customer
Age Group	Age group the customer belongs to. A <25,B 25-29,C 30-34,D 35-39,E 40-44,F 45-49,G 50-54,H 55-59,I 60-64,J >65
Residence Area	Customer's residence area. A-F: zipcode area: 105,106,110,114,115,221,G: others, H: Unknown Distance to store, from the closest: 115,221,114,105,106,110
Product Subclass	Product subclass of the bought item, (i.e. office supplies, personal items, )
Product ID	Product ID of the sold item
Quantity	Number of units sold
Cost	Total cost of the goods sold
Sales Price	Total revenue from the goods sold
Profit*	Total sales minus the total cost of the product/s
Month*	Month
Day*	Day of the week of the transaction date
Week*	Week number of the transaction date

*fields derived from other fields

Table 2. Sample data¶

def clean_data(df):
    """"Return clean dataframe with new derived columns"""
    
    #update column names of the data 
    cols=["Transaction Date", "CustomerID", "Age Group", "Residence Area", 
    "Product Subclass", "Product ID", "Quantity", "Cost", "Sales Price"]
    df.columns=cols

    #new column for profit
    df['Profit'] = df['Sales Price'] - df['Cost']
    
    #new column for month
    df['Month'] = pd.to_datetime(df['Transaction Date']).dt.month
    #new column for week of the year 
    df['Week'] = pd.to_datetime(df['Transaction Date']).dt.isocalendar().week
    #new column for day of the week
    df['Day'] = pd.to_datetime(df['Transaction Date']).dt.dayofweek
    dmap = {0:'Mon', 1:'Tue', 2:'Wed', 3:'Thu', 4:'Fri', 5:'Sat', 6:'Sun'}
    df['Day'] = df['Day'].map(dmap)


    df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])
    
    
    return df


df = clean_data(df1)
df.head()

3. Exploratory Data Analysis¶

Before analyzing the relationship between SKUs available at TaFeng Grocery store, a preliminary analysis of the data was done. TaFeng grocery store carries 23,812 SKUs across 2,012 product subclass. In the span of 4 months, a total of 119,578 shopping trips were completed by 32,266 customers adding up to a total sales revenue of 107,840,076 yuan. Daily sales and profit are shown in Figure 1. There is evidence of weekly periodicity based on the plot as most shopping trips are done over the weekends, especially Sundays. It can also be observed that the highest sale in a day happened during January 21, 2001 (which was around Chinese 2001 new year).

fig, (ax1,ax2) = plt.subplots(1,2, gridspec_kw={'width_ratios': [3.5, 1.5]})

# daily profit and sales
(df.groupby('Transaction Date')['Sales Price'].sum()
 .plot(figsize=(15,5), color='g', ax=ax1))
(df.groupby('Transaction Date')['Profit'].sum()
 .plot(figsize=(15,5), color='r', ax=ax1))
ax1.legend()
ax1.set_ylabel("Sales/Profit in yuan")
ax1.set_title('Ta Feng Daily Total Sales vs Profit Margin')

# daily number of items sold heatmap
x = df.groupby(['Week','Day']).size().unstack().fillna(0)
x = x.reindex([44, 45, 46, 47, 48, 49, 50, 51, 52, 1, 2, 3, 4, 5, 6, 7, 8, 9])
cols = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
x = x[cols]
sns.heatmap(x, ax=ax2)
ax2.set_title('Number of items sold per day');

Figure 1. Ta Feng Daily Sales and Profit from November 2000 to February 2001.

Around 10% of customers only visited once in the span of 4 months, as demonstrated in Figure 2, in contrast to the average store visit, which was around 3.7 shopping trips. Moreover, in the majority of shopping trips, only a few items are being bought each visit. Merely a small fraction of total transactions involves purchasing a large number of items. The histograms also suggest that there are outliers in the data. By removing the outliers and plotting it, we get a clearer view of the basket size distribution and unique items per basket. The average basket size contained 9.45 items and 6.83 unique products. Mean values are represented by the red triangle in Figure 3.

fig, (ax1,ax2) = plt.subplots(1,2, figsize=(15,5))

# customer visits
df_customer = df.groupby(['CustomerID'])
customer_visits = df_customer['Transaction Date'].nunique()
sns.histplot(x=customer_visits.values, binwidth = 1, ax=ax1);
ax1.set_xlabel("Number of customer visits")
ax1.set_ylabel("Frequency")
ax1.set_title("Frequency of Customer Visits");

                              
# basket size per shopping trip
df_trans = df.groupby(['CustomerID', 'Transaction Date'])
items_per_visit = df_trans['Quantity'].sum()
sns.histplot(x=items_per_visit.values, binwidth = 10, ax=ax2)
ax2.set_xlabel("Basket size")
ax2.set_ylabel("Frequency")
ax2.set_title("Quantity of items bought per visit");

Figure 2. Histogram of frequency of customer visits and items bought per visit.

x1 = df.groupby(['CustomerID','Transaction Date'])['Quantity'].sum().values
x2 = df.groupby(['CustomerID','Transaction Date'])['Quantity'].size().values
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,4))
(sns.boxplot(data=x1,orient='h', color='#98FB98', showfliers=False, 
            meanprops={"markerfacecolor":"red"},
            showmeans=True, ax=ax1))
(sns.boxplot(data=x2,orient='h', showfliers=False, showmeans=True, 
             meanprops={"markerfacecolor":"red"},
             color='#77dd77', ax=ax2))
ax1.set_title("Distribution of basket size")
ax2.set_title("Distribution of unique products per basket");

Figure 3. Distribution of basket size vs unique number of items per basket.

Digging deeper and plotting the top products based on sales (refer to Figure 4), profit (refer to Figure 5) and popularity (refer to Figure 6) across the four months covered, we noticed that top products changes every month. It can also be noted that product ID 4710114128038 is consistently present on the three kinds of charts.

fig, axes = plt.subplots(2,2, figsize=(22,12))
g = df.groupby('Month')
color=['#f39385','#90b7b6','#ebbe20','#c2756a']
m={1: 'January', 2: 'February', 11:'November', 12:'December'}
for i, ax ,c in zip([11,12,1,2], axes.flatten(), color):
    (g.get_group(i).groupby('Product ID',sort=True)['Sales Price']
     .sum().sort_values(ascending=False)[:10].sort_values()
     .plot(kind ='barh', ax=ax, color=c))
    ax.set_title(m[i])
    ax.set_xlabel("Total Sales")
plt.suptitle("Monthly Top 10 Products with Highest Sales",fontsize=20);

Figure 4. Monthly Top 10 Products sold in Ta Feng based on Total Sales.

fig, axes = plt.subplots(2,2, figsize=(22,12))
g= df.groupby('Month')
color=['#f39385','#90b7b6','#ebbe20','#c2756a']
m={1: 'January', 2: 'February', 11:'November', 12:'December'}
for i, ax ,c in zip([11,12,1,2], axes.flatten(), color):
    (g.get_group(i).groupby('Product ID',sort=True)['Profit']
     .sum().sort_values(ascending=False)[:10].sort_values()
     .plot(kind ='barh', ax=ax, color=c))
    ax.set_title(m[i])
    ax.set_xlabel("Total Profit")
plt.suptitle("Monthly Top 10 Most Profitable Products",fontsize=20);

Figure 5. Monthly Top 10 Most Profitable Products sold in Ta Feng.

color=['#e1f7d5','#ffbdbd','#c9c9ff','#f1cbff']
fig, axes = plt.subplots(2,2, figsize=(22,12))
for i, ax,c in zip([11,12,1,2], axes.flatten(),color):
    (g.get_group(i).groupby('Product ID',sort=True)['Quantity']
     .sum().sort_values(ascending=False)[:10].sort_values()
     .plot(kind ='barh', ax=ax, color = c))
    ax.set_title(m[i])
    ax.set_xlabel("Number of Units Sold")
plt.suptitle("Monthly Top 10 Most Popular Products",fontsize=20);

Figure 6. Monthly Top 10 Products sold in Ta Feng based on Quantity Sold.

4. Association Pattern Mining¶

Market basket analysis (MBA) is a technique that uncovers relationships between pairs of products purchased together. In this study, MBA was done using Association Pattern Mining. In order to implement this, we first find the frequent itemsets, and then we generate the association rules between these frequent itemsets. Equivalence Class Clustering and bottom up Lattice Traversal (ECLAT) is one of the algorithms that can be used to identify frequent itemsets. Apriori is a not good choice since the maximum basket size is large (112 items). ECLAT uses a depth-first search for frequent itemsets given a minimum occurrence threshold. Once the item sets have been generated, association rules were be mined.

Since TaFeng supermarket offers a wide array of products, around 23,812 products, the minimum support threshold is nominally set to 0.05 for this study. This means that an item will be considered as frequent if it is present in at least 0.05% of the total baskets. So in our case, an item will be considered as frequent if it is found in at least 59 baskets (out of 119,578 baskets). Since we are only interested in the pair of products frequently bought together so we can properly adjust inventory levels, we set 2 as the minimum number of items per rule. Table 3 summarizes the metric and default values used.

Table 3. Association Rule Metrics¶

Metric	Value	Description
Support	0.05	Percentage of orders that contains the item set relevant to an association
Confidence	0.8	How likely it is that purchasing product A results in a purchase of Product B
Lift	-	How the chances of product A being purchased increased given that product B is purchased.

RESULTS

Implementing ECLAT in our dataset resulted to 560 frequent itemsets (481 2-itemsets, 61 3-itemsets, 15 4-itemsets and 3 5-itemsets). The algorithm provided output with very low support values, and the maximum support is only at 0.585%. The top 15 most frequent itemsets are displayed in Figure 7 with their corresponding support values. Based on this, we can say that for 0.585% of the total transactions, purchase of product 4710011401135 results to purchase of product 4710011401128 as well at least 80% of the time. It can also be noted that Product IDs 4710011401135 and 4710011401128 occur frequently and are paired with other products as well.

import fim

db = (df.groupby(['CustomerID','Transaction Date'])['Product ID']
      .apply(lambda x: set(x)).reset_index(drop=True))

freq_items = fim.eclat(db, zmin=2, supp=0.05, conf=0.8, report = 'S',)

#plot frequent itemsets
cols=['itemsets','support']
frequent_itemsets = pd.DataFrame(data=freq_items, columns=cols)
f = frequent_itemsets.sort_values(by='support', ascending=False)[:15]

plt.subplots(figsize=(8,5))
g = sns.barplot(data=f, x='support', y='itemsets')
g.set_title("Top 15 Frequent Itemsets");

Figure 7. Top 15 most frequent 2-itemsets and corresponding support.

On the other hand, only 22 association rules are generated using the same defined thresholds up to 0.293% support, and these are tabulated in Table 4. The top rule has very low support but has very high confidence and lift. This rule implies that products 4719090790000 and 4719090790017 are frequently purchased together with 80% confidence for 0.293% of the population. Figure 8 depicts the support and confidence distribution of every association rule.

Table 4. Association Rules¶

res = fim.arules(db, supp = 0.05, report = 'SCL', zmin=2)
cols=['Antecedent', 'Consequent', 'Support', 'Confidence', 'Lift']
ar=pd.DataFrame(res, columns=cols).sort_values('Support', ascending=False)
ar

#plot
plt.figure(figsize=(10, 6))
plt.scatter(ar['Support'], ar['Confidence'], c=ar['Lift']);
plt.title('Associaton rules confidence vs support color mapped by lift');
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.colorbar();

Figure 8. Scatter plot for Association Rules.

CONCLUSION AND RECOMMENDATIONS

A strong rule is defined with higher support and confidence values, but with this dataset, we got really low support values, which means the rules generated cannot be classified as strong ones. The very low support values inflated the value of the lift that increased up to 70,000%, but we can still use the generated association rules to make recommendations to balance and adjust inventory levels of different products. For example, we have identified the co-occurrence between products 4710011401135 and 4710011401128. And from our EDA, we have noted that product 4710011401128 is in the top popular products, but product 4710011401135 isn't. Therefore, we can suggest to TaFeng to make sure that identified frequent itemsets must never be out of stock else the company will lose the opportunity to maximize profit.

Aside from the inventory requirement solution, here are other recommendations for TaFeng as well based on the results of FIM and APM:

Product placement: Products 4719090790000 and 4719090790017 can be placed close together for easier finds.
Pricing: Offer a discount if products 4710011401128, 4710011409056, 4710011401135 were bought together. They can also package these items together.
Promotion: Target customers who bought one of the products to buy the other one. This can be done through an advertisement campaign/recommender system.

For future studies, TaFeng can consider the following:

Apply these rules in creating the store's recommender system.
Perform predictive analysis on the inventory's requirements/needs for each product using a neural network or other machine learning algorithms.

REFERENCES

[1] Kruiff, K. (2019, July 5). Recommendations on grocery shopping: customer or product similarities?. Retrieved October 30, 2020, from https://thesis.eur.nl/pub/50337/Kruiff-de_444677.pdf

[2] Heaton, J. (2017, January). Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms. Retrieved October 30, 2020, from https://arxiv.org/pdf/1701.09042.pdf

[3] Kumar, D.,et al. (2012, September). Market Basket Analysis for a Supermarket based on Frequent Itemset Mining. Retrieved October 30, 2020, from https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.401.8884&rep=rep1&type=pdf

[4] Download link for Ta Feng Grocery dataset [closed]. Retrieved October 30, 2020, from https://stackoverflow.com/questions/25014904/download-link-for-ta-feng-grocery-dataset

	Antecedent	Consequent	Support	Confidence	Lift
13	4719090790000	(4719090790017,)	0.293532	80.875576	17116.707312
2	4710011401128	(4710011409056, 4710011401135)	0.271789	80.246914	5861.799287
0	4710011401128	(4710011401135, 4710011406123)	0.245028	80.273973	5863.775868
4	4710011401128	(4710011409056, 4710011401135, 4710011405133)	0.157220	83.185841	6076.479206
1	4710011401128	(4710011401135, 4710011405133, 4710011406123)	0.135476	83.076923	6068.523096
3	4710011401128	(4710011409056, 4710011401135, 4710011406123)	0.133804	83.769634	6119.123540
8	4710011401128	(4710011401142, 4710011409056, 4710011401135)	0.094499	85.606061	6253.269099
17	4719090790000	(4719090701051, 4719090790017)	0.092826	84.732824	17933.064919
5	4710011401128	(4710011409056, 4710011401135, 4710011405133, ...	0.086136	86.554622	6322.558687
6	4710011401128	(4710011401142, 4710011401135, 4710011406123)	0.071083	80.188679	5857.545441
11	4710011401128	(4710011401142, 4710011409056, 4710011401135, ...	0.061884	87.058824	6359.389126
12	4710011401128	(4710857000059, 4710011401135)	0.056030	83.750000	6117.689371
14	4710011401128	(4710857000028, 4710011401135)	0.053522	83.116883	6071.442058
20	9310022724008	(9310022724305, 9310022724206)	0.053522	82.051282	61707.724561
21	9310022724206	(9310022724305, 9310022724008)	0.053522	80.000000	70340.000000
15	4710011401128	(4710011432856, 4710011401135)	0.052685	82.894737	6055.214931
10	4710011401135	(4710011401142, 4710011409056, 4710011406123, ...	0.051013	87.142857	11204.697389
9	4710011401128	(4710011401142, 4710011409056, 4710011401135, ...	0.051013	85.915493	6275.872216
19	9310022724008	(9310022724107, 9310022724206)	0.051013	81.333333	61167.781971
18	4710030346103	(4710030346110, 4710030346097, 4710030346059)	0.047668	80.281690	37353.789664
16	7231254880206	(723125488064, 723125485032)	0.043486	83.870968	36206.218703
7	4710011401128	(4710011401142, 4710011401135, 4710011405133, ...	0.042650	83.606557	6107.211312

	Transaction Date	CustomerID	Age Group	Residence Area	Product Subclass	Product ID	Quantity	Cost	Sales Price	Profit	Month	Week	Day
0	2001-02-01	557818	H	E	500210	4710114105046	1	123	135	12	2	5	Thu
1	2001-02-01	1677683	C	B	711310	4902520163103	6	840	894	54	2	5	Thu
2	2001-02-01	1900910	A	D	500206	4710036003598	1	26	33	7	2	5	Thu
3	2001-02-02	1833003	F	G	110507	4710980000056	1	58	72	14	2	5	Fri
4	2001-02-02	2158013	B	A	560402	8712045013038	1	441	469	28	2	5	Fri