tutorial: Customers and purchase: predictions to guide business. Part I

By Julien Hernandez Lallement, 2020-01-28, in category Tutorial

marketing, python, sales

In [8]:
import os
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

display_settings = {
    'max_columns': 10,
    'expand_frame_repr': True,  # Wrap to multiple pages
    'max_rows': 50,
    'precision': 2,
    'show_dimensions': True
}

for op, value in display_settings.items():
    pd.set_option("display.{}".format(op), value)

from IPython.display import clear_output

# Set figure size for easier reading
plt.rcParams['figure.figsize'] = (16,8)

Predictive Purchase Computation in Python

Table of Content

Background

Typical Data

Heuristic approaches

BetaGeo Model

References

Background:

If you work with transactional data, you probably have faced the challenge of computing meaningful KPIs that allow you to quantify how promising a given customer is / will be (based on past data). One computation that comes in handy when evaluating customers is the predicted purchase probability, i.e., a meaningful value that allows you to assess whether it might be worth investing in this customer, from a marketing perspective.

One might frame this questions as follows: given the purchase history of different customers, at the present moment, what will be the purchase pattern of these customers in a given future time window?
The drawing below attempts to illustrate the question's framework:

In [1]:
from IPython.display import Image
PATH = "/home/julien/website/content/images/2020.03_sales_prediction/"
Image(filename = PATH + "purchase_history.png", width=800, height=500)
Out[1]:

This post is the first of a series of posts on the topic of predictive purchase.

Here, I use one method discussed in a paper published by Fader and colleagues in 2005. (Link at the end of this notebook)
Feel free to browse through the following posts if you are interested in reading about other models (hierarchical bayesian models) and exploratory analysis of model fits.

Important note
While I used this model in a business project, and it produced useful results that went into production, I am not an expert on financial computations, so take this as it is: an honest attempt to compute a useful metric! Keep critical and don't take this approach as text-book!

Typical Data:

We are going to use a python library called lifetimes, written by Cam Davidson-Pilon.

Let's import an example dataset of what transactional data could look like:

In [5]:
from lifetimes.datasets import load_dataset

df = load_dataset(
    'CDNOW_sample.txt',
    header=None,
    names=['customer_id', 'customer_index','transaction_date', 'quantity', 'amount'], 
    delim_whitespace = True,
    converters={'transaction_date': lambda d: pd.to_datetime(d, format='%Y%m%d')}
).drop(columns={'customer_index'})

df.head(2)
Out[5]:
customer_id transaction_date quantity amount
0 4 1997-01-01 2 29.33
1 4 1997-01-18 2 29.73

2 rows × 4 columns

This is what a typical transactional (already clean ;) ) dataset would look like. We have access to three data features:
(i) date of transaction,
(ii) quantity purchased,
(iii) customer ID.

That is all you need to get started.

In order to train the model, you need to boil down a dataframe at the customer level with three parameters:
1) frequency: number of purchase for that customer.
2) recency: time between first and last purchase, in days, weeks or any meaningful scale for your business.
3) T: age of the customer (birth = first purchase)

This would be a so-called RFM table (Recency-Frequency-Monetary; note that I will be discussing the monetary component in a separate post). So to be more accurate, a RF table ;)

In [6]:
from lifetimes.datasets import load_transaction_data
from lifetimes.utils import summary_data_from_transaction_data

# Load transaction dataset
transaction_data = load_transaction_data()
# Transform to RFM table
rf = summary_data_from_transaction_data(transaction_data, 
                                        customer_id_col = 'id', 
                                        datetime_col='date'
                                       )
rf.head(3)
Out[6]:
frequency recency T
id
0 0.0 0.0 298.0
1 0.0 0.0 224.0
2 6.0 142.0 292.0

3 rows × 3 columns

In [7]:
rf.frequency.plot(kind='hist', bins=100)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f234c217290>

We can see that customer #2 performed 6 purchases, that his first purchase happened 292 days ago, and that there was a delay of 142 days between his first and last purchase.
The drawing below depicts the meaning of each parameter in the RFM table.

In [6]:
PATH = "/home/julien/website/content/images/2020.03_sales_prediction/"
Image(filename = PATH + "recency_frequency.png", width=800, height=500)
Out[6]:

We can have a quick look at these features, see whether the distributions make sense

In [6]:
pd.plotting.scatter_matrix(rf, hist_kwds={'bins':50});

A few observations:
(i) As expected from the data features, we can see that the majority of customers purchased very few times (frequency).
(ii) As a result of course, the recency feature has a major peak at 0, which contains all customers that purchased only one time from that business.
(iii) The business had some stability in acquiring new customers, as suggested by the uniformity of the age (T) distribution.

Heuristic Approach: RFM score

The dataset above is an example dataset of how transactional data might look like.

Now a very simple and easy approach to implement would be to calculate a so-called RFM score at the customer level. Data points from each data feature are categorized according to the distribution's quartile of that feature, which provides a score for each customer (in our case only RF score, since we do not have access to monetary value data).

In [7]:
quantiles = rf.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()

segmented_rf = rf.copy()

Let's write some crude functions to compute the Recency and Frequency score according to simple heuristic RF model

In [8]:
def RScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4
    
def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1
In [9]:
# Categorize data points according to quartile
segmented_rf = segmented_rf.assign(r_quartile = segmented_rf['recency']
                                     .apply(RScore, args=('recency',quantiles)))
segmented_rf = segmented_rf.assign(f_quartile = segmented_rf['frequency']
                                     .apply(FMScore, args=('frequency',quantiles)))
# Create RS score per customer
segmented_rf['RFScore'] = (segmented_rf.r_quartile.map(str) 
                           + segmented_rf.f_quartile.map(str) 
                            )
segmented_rf.head()
Out[9]:
frequency recency T r_quartile f_quartile RFScore
id
0 0.0 0.0 298.0 1 4 14
1 0.0 0.0 224.0 1 4 14
2 6.0 142.0 292.0 4 1 41
3 0.0 0.0 147.0 1 4 14
4 2.0 9.0 183.0 3 2 32

5 rows × 6 columns

We have produced, for each customer, a RF score that would already inform the business on which customers it might be worth focusing on.

However, while there is some virtue to this method (for instance interpretability, simplicity and speed of implementation), one might wonder whether it is wise to assume that past behavior maps onto future behavior that straightforwardly.

Instead, we could use machine learning to predict purchase based on past purchase patterns for each customer.

BetaGeo Model

Model specification

Pareto/NBD vs BetaGeo/NBD

From an historical perspective, the Pareto/NBD model has been an attractive model to compute predicted purchase in different business models. However, the complexity of the Pareto/NBD model made it quite complex for businesses to implement for regular update of predictive purchase indexes. As an alternative, Fader & colleagues proposed the BetaGeo/NBD model, which produces an arguably comparable fit to the Pareto, but is simpler in implementation.

We will first focus on the BG/NBD model and test its implementation on a public dataset. In a follow-up post (An Approach to Computing Predicted Purchase - Part 3/3), I will use the Pareto/NBD model on the same dataset, and explore it from another perspective using hierarchical Bayesian modeling to predict purchases.

In [10]:
from lifetimes import BetaGeoFitter
bgf = BetaGeoFitter(penalizer_coef=0.0)

I wrote down the assumptions for parameter distribution of the BetaGeo model, according to the original research paper.

Assumptions of the BG/NBD model

(i) While active, the number of transactions made by a customer follows a Poisson process with transaction rate lambda.
This is equivalent to assuming that the time between transactions is distributed exponential with transaction rate $\lambda$


\begin{equation*} \left(\ f(t_j | t_j-1 ; \lambda \right) = \left( \ \lambda e{}^{-\lambda(t_j - t_j-1) } \right) \left( t_j ; t_j-1 > 0\right) \end{equation*}

(ii) Heterogeneity in lambda follows a gamma distribution

\begin{equation*} \left( \ f(\lambda | r, \alpha \right) = \left( \ \frac{\alpha^r \lambda^{r-1} e^{-\lambda \alpha}}{\Gamma r} \right) \left(\lambda > 0\right) \end{equation*}

(iii) After any transaction, a customer becomes inactive with probability p. Therefore the point at which the customer “drops out” is distributed across transactions according to a (shifted) geometric distribution

\begin{equation*} \left( \ p(1-p)^{j-1}\right) \left(\ j = 1, 2, 3 ...\right) \end{equation*}

(iv) Heterogeneity in p follows a beta distribution

\begin{equation*} \left(\ f(p | a,b \right) = \left( \ \frac{p^{a-1} (1-p)^{b-1}}{B(a,b)} \right) \left( 0 \leq p \leq 1\right) \end{equation*}

(v) The transaction rate lambda and the dropout probability p vary independently across customers

Let's check whether the model has an acceptable performance to predict data. To do so, we will train the model on a training set and ask the model to predict data from a test set (AKA validation / holdout period).

You could do it yourself using SciKit Learn or a custom function, but Lifetimes proposed a utility function to split original transactional data (which is often the format available at different companies).

Model validation

In [11]:
from lifetimes.utils import calibration_and_holdout_data
transaction_data = load_transaction_data()

calibration_period_end='2014-09-01'
observation_period_end='2014-12-30'

summary_train_test = calibration_and_holdout_data(transaction_data, 'id', 'date',
                                                   calibration_period_end=calibration_period_end,
                                                   observation_period_end=observation_period_end)
summary_train_test.head(2)
Out[11]:
frequency_cal recency_cal T_cal frequency_holdout duration_holdout
id
0 0.0 0.0 177.0 0.0 120
1 0.0 0.0 103.0 0.0 120

2 rows × 5 columns

From the output above, we see that the function 1) splits the data into two sets and 2) produces a RF table for these periods.

In [18]:
# Fit model to the data to the test / calibration period
In [19]:
bgf.fit(summary_train_test['frequency_cal'], 
        summary_train_test['recency_cal'], 
        summary_train_test['T_cal']
       )
Out[19]:
<lifetimes.BetaGeoFitter: fitted with 5000 subjects, a: 1.96, alpha: 2.26, b: 3.48, r: 0.19>

We run some standard diagnostics, where we plot the average number of transactions in the test set vs training set, for observed ("frequency_holdout") and predicted data ("model predictions")

In [21]:
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_train_test)

We can see that the model is quite good in picking up the purchase dynamics.

Most customers make low number of purchases, which is probably why we get a better fit at lower bins, compared to higher bins.

We could plot a model-free estimation of future purchases (or rather, model-based, the model being simple heuristics to estimate that value based on historical values).

See plot below: we would be overestimating purchases, which would impact our business in the future.

In [23]:
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_train_test)
# Age in training set
age_train = np.mean(summary_train_test.T_cal)
# Age in testing set is the difference between the two time points taken to define test set
age_test = time_window = (pd.to_timedelta(pd.to_datetime(observation_period_end) 
                        - pd.to_datetime(calibration_period_end)).days)
# X axis limit from observed output
xlim=[0,7]
x_axis_no_model = np.linspace(xlim[0],xlim[1], 100)
no_model_line = (age_test/age_train) * x_axis_no_model

plt.plot(x_axis_no_model, no_model_line, linestyle='--', color='black', label='no model')
Out[23]:
[<matplotlib.lines.Line2D at 0x7fecec822b90>]

Now that we have validated the model using a validation set, let's train the model on the whole dataset.

Model prediction

In [186]:
from lifetimes import BetaGeoFitter
bgf = BetaGeoFitter(penalizer_coef=0.0)
bgf.fit(data['frequency'], data['recency'], data['T'])
print(bgf)
<lifetimes.BetaGeoFitter: fitted with 5000 subjects, a: 1.85, alpha: 1.86, b: 3.18, r: 0.16>
In [187]:
# Model summary
bgf.summary
Out[187]:
coef se(coef) lower 95% bound upper 95% bound
r 0.16 3.99e-03 0.16 0.17
alpha 1.86 9.54e-02 1.68 2.05
a 1.85 1.30e-01 1.60 2.10
b 3.18 2.94e-01 2.61 3.76

4 rows × 4 columns

Lifetime provides some nice matrix plots, that allow you to visualize the probability of customers being alive given the parameters, and some other features of your dataset. I won't use them here but I recommend you explore these plots, they might have an added value when explaining the process to stakeholders.

Now that our model has been validated and fitted on the whole dataset, we can go on and predict the predictive purchase value of the customers in the dataset, for a given time window t:

In [190]:
# Predicted purchases are calculated for a given time window t
t = 12
rf['pred_purch_coming_week=' + str(t)] = (bgf.conditional_expected_number_of_purchases_up_to_time
                                               (
                                                t, 
                                                rf['frequency'], 
                                                rf['recency'], 
                                                rf['T']
                                               )
                                              )
In [195]:
# Plot the predictive purchase histogram
rf['pred_purch_coming_week=' + str(t)].plot(kind='hist',bins=50)
plt.xlabel('proba of number of purchases in coming weeks=' +str(t))
plt.title('histo, customer count per proba of purchase');

We obtained a distribution of predicted purchases for each customer in the 12 weeks following the end of the data collection.
We can see that, since most customers performed low purchases, we have a strong peak at 0, with a right tail regrouping customers that we can expect will bring value to the business.



In the following post, I will explore the robustness of this model by playing with the data shape.

References

Original research paper from Fader et al. 2005: http://brucehardie.com/papers/018/fader_et_al_mksc_05.pdf
Lifetimes library documentation: https://lifetimes.readthedocs.io/en/latest/
GitHub repo with some of the code presented in this post: https://github.com/juls-dotcom/purchase_prediction