tutorial: An Approach to Computing Predicted Purchase - Part 2

By Julien Hernandez Lallement, 2020-04-12, in category Tutorial

marketing, python, sales

In [4]:
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import lifetimes

Predictive Purchase Computation in Python

Table of Content

Background

Dataset

Model

Model Check

Background:

This post follows an initial post here on how to use a BetaGeo model to predict customer purchases.

In the previous post, we implemented a BetaGeo model to predict purchases of single customers using the python package "lifetimes". Lifetimes proposes the traditional analytical approach to model fitting to compute purchase prediction and other related predictions useful for businesses. While most of the math behind the models is pre programed, the analytical approach can be quite complex.

Probabilistic programming packages such as PyMC, or what could be the R equivalent Stan enables the calculation of these complex parameters using Markov chain Monte Carlo and give us more flexibility in prior specification. The bayesian concept of prior allows you to already add to the model what you know about the way a process (in this case, purchases) works.
In addition, the Markov chain yields the full posterior distributions for model parameters and hyperparameters. One of the major advantages of using this bayesian approach is that we can express our predictions as probabilistic distributions rather than single estimates, which provides some insight into our level of uncertainty in our predictions.

Here, I will be using a bayesian version of the Pareto/NBD model, discussed in the same paper from Fader et al. 2005 (not original Pareto paper, that model was defined in the 80s').. This post is inspired by Ben Van Dyke and Jean-Rene Gauthier in PyData Seattle 2017, and uses some of their material.

Dataset:

Let's first load the dataset provided by the lifetimes package, which contains data about the (sometimes repeated) purchased on some customers, with data features such as date of transaction, quantity purchased and the amount spent in a currency, probably dollars. See here for a source of this dataset using the Pareto model.

In [54]:
from lifetimes.datasets import load_dataset

cdnow_transactions = load_dataset(
    'CDNOW_sample.txt', 
    header=None, 
    delim_whitespace=True, 
    names=['customer_id', 'customer_index', 'date', 'quantity', 'amount'],
    converters={'date': lambda x: pd.to_datetime(x, format="%Y%m%d")}
)

cdnow_transactions.head()
Out[54]:
customer_id customer_index date quantity amount
0 4 1 1997-01-01 2 29.33
1 4 1 1997-01-18 2 29.73
2 4 1 1997-08-02 1 14.96
3 4 1 1997-12-12 2 26.48
4 21 2 1997-01-01 3 63.34

Life in the previous post, the Pareto model requires a simple recency/frequency matrix, which can be easily produced thanks to some useful util functions if the lifetimes package:

In [56]:
# lifetimes provides a transaction log -> rfm util function
rf = lifetimes.utils.summary_data_from_transaction_data(
    cdnow_transactions,
    'customer_id',
    'date',
    observation_period_end=pd.to_datetime('1997-09-30'),
    freq='W'
)

rf.head()
Out[56]:
frequency recency T
customer_id
4 2.0 30.0 39.0
18 0.0 0.0 39.0
21 1.0 2.0 39.0
50 0.0 0.0 39.0
60 0.0 0.0 35.0

Model:

Using the paper from Fader et al. 2005, I understood that the Pareto/NBD and BetaGeo model differ mainly in that the Pareto/NBD assumes that customer/purchase dropout can occur at any moment in time, while the BetaGeo model assumes instead that dropouts occur right after a purchase (assumption number 3 in post 1).

Let's look at the assumptions of the Pareto/NBD model (taken from Ben Van Dyke and Jean-Rene Gauthier).

Purchasing Rate

The number of purchases made in a period $t$ follows a Poisson distribution:

$$ p(x~|~\lambda, \Delta t ) = \frac{(\lambda \Delta t)^x}{x!} e^{-\lambda \Delta t} $$

Purchasing rate parameter $\lambda$ is distributed Gamma with parameters $r$ and $alpha$:

$$\lambda \sim \Gamma(r, \alpha)$$

A customer with a transaction rate $\lambda$ will make on average $\lambda \times \Delta t$ transactions in a period of time $\Delta t$.

Lifetime

At the customer level, the lifetime $\tau$ is distributed according to an exponential distribution

$$ p(\tau~|~\mu) = \mu e^{-\mu \tau } $$

where $\tau > 0$. In other words, each customer has its own lifetime distribution. Note that the expectation value for the lifetime $\tau$ is $E[\tau~|~\mu] = \frac{1}{\mu}$.

The value of $\mu$ varies across the customers according to another gamma distribution with shape $s$ and rate $\beta$ :

$$\mu \sim \Gamma(s, \beta)$$

Likelihood

Likelihood for an individuals purchasing rate and lifetime conditional on purchasing frequency, recency, and time since initial purchase:

$$ L(\lambda, \mu~|~x,t_x,T) = \frac{\lambda^x \mu}{\lambda+\mu}e^{-(\lambda+\mu)t_x}+\frac{\lambda^{x+1}}{\lambda+\mu}e^{-(\lambda+\mu)T} $$

where $x$ is the repeat purchase frequency, $t_x$ is the recency and $T$ is the length of the calibration/training period.

Info
One thing that might be worth emphasizing is the Poisson assumption about the purchase rate. If your customer follows some kind of pattern in its purchasing rate, which might stem from different causes (seasonality for instance), then you might want to use a model where this Poisson assumption is lifted, so that you can include some kind of pattern in the purchases of customers.

The ParetoGGG (http://www.reutterer.com/papers/platzer&reutterer_pareto-ggg_2016.pdf) might be a good fit if your business is in that kind of space.

PyMC Implementation

We need to create a custom distribution to express the likelihood function to implement the model in PyMC. below, we create a ParetoNBD class based on the pymc3.Continuous class. We simply add the parameterization to the constructor and implement a log-likelihood function.

In [24]:
from pymc3.math import exp, log

class ParetoNBD(pm.Continuous):
    """
    Custom distribution class for Pareto/NBD likelihood.
    """
    
    def __init__(self, lambda_, mu, *args, **kwargs):
        super(ParetoNBD, self).__init__(*args, **kwargs)
        self.lambda_ = lambda_
        self.mu = mu
        
    def logp(self, x, t_x, T):
        """
        Loglikelihood function for and indvidual customer's purchasing rate \lambda
        and lifetime \mu given their frequency, recency and time since first purchase.
        """
        
        log_lambda = log(self.lambda_)
        log_mu = log(self.mu)
        mu_plus_lambda = self.lambda_ + self.mu
        log_mu_plus_lambda = log(mu_plus_lambda)
        
        p_1 = x * log_lambda + log_mu - log_mu_plus_lambda - t_x * mu_plus_lambda
        p_2 = (x + 1) * log_lambda - log_mu_plus_lambda - T * mu_plus_lambda
        
        return log(exp(p_1) + exp(p_2))

We can now define the model.
We first extract data from the recency/frequency table created before with Lifetimes. Then, we place a positive constrained, non-informative priors on the $r$, $\alpha$, $s$, and $\beta$ hyperparameters.

I will use half-Cauchy priors, as argued by Scott Polson in here. I am not an expert on this matter, and am following relatively accepted proceedings. We should play with other priors though, and explore how that might affect the model fit. But for now, let's stick with this. A half-Cauchy is one of the symmetric halves of the Cauchy distribution. The half-Cauchy is has a heavy tail and has been argued as a fairly weakly informative prior (see here) Another classic prior is the inverse gamma, but half-Cauchy is sometimes preferred because they have better behavior for small parameter values but only regards it as wealy informative when a large scale parameter is used. More information in the first paper by Polson.

Note as well the gamma priors on lambda & mu. These come from the initial Pareto/NBD model, but should be challenged in future use cases.

In [25]:
# Extract data for model following notation from Fader/Hardie
N = rfm.shape[0] # number of customers
x = rfm['frequency'].values
t_x = rfm['recency'].values
T = rfm['T'].values # length of training period

n_draws = 2000

pnbd_model = pm.Model()

with pnbd_model:
    
    # Uninformative priors on model hyperparameters
    r = pm.HalfCauchy('r', beta=2)
    alpha = pm.HalfCauchy('alpha', beta=2)
    s = pm.HalfCauchy('s', beta=2)
    beta = pm.HalfCauchy('beta', beta=2)
    
    ## Attach priors to individual parameters
    # Gamma prior on purchasing rate parameter lambda
    # Adding the shape = number of customers will give one lambda and mu parameter for each customer
    # The gamma prior are quite useful for conjugacy, but one might wonder what other prior could be
    lambda_ = pm.Gamma('lambda', 
                       alpha=r, 
                       beta=alpha, 
                       shape=N, 
                       testval=np.random.rand(N)
                      )
    # Gamma prior on lifetime parameter mu
    mu = pm.Gamma('mu', 
                  alpha=s, 
                  beta=beta, 
                  shape=N, 
                  testval=np.random.rand(N)
                 )

    # Custom distribution for Pareto-NBD likelihood function
    loglikelihood = ParetoNBD("loglikelihood", 
                              mu=mu, 
                              lambda_=lambda_, 
                              observed={'x': x, 't_x': t_x, 'T': T}
                             )
    
    # Sample the model
    # Train de model and extract the parameters
    trace = pm.sample(n_draws, init=None)
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, lambda, beta, s, alpha, r]
Sampling 4 chains, 0 divergences: 100%|██████████| 10000/10000 [06:08<00:00, 27.16draws/s]
The acceptance probability does not match the target. It is 0.8920781163121272, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.2 for some parameters.
The estimated number of effective samples is smaller than 200 for some parameters.

We can go and visualize the trace and posterior generated by the model. The PyMC3 traceplot displays the parameter distribution and sampled values, stored in trace

In [26]:
# Traceplots to check for convergence
_ = pm.traceplot(trace, varnames=['r','alpha','s','beta'])
/home/julien/anaconda3/envs/data_blog/lib/python3.7/site-packages/pymc3/plots/__init__.py:21: UserWarning: Keyword argument `varnames` renamed to `var_names`, and will be removed in pymc3 3.8
  warnings.warn('Keyword argument `{old}` renamed to `{new}`, and will be removed in pymc3 3.8'.format(old=old, new=new))

Plot Posterior densities in the style of John K. Kruschke’s book.
HPD = Highest posterior density (region of confidence level α is a (1−α)-confidence region Iα for which holds that the posterior density for every point in this set is higher than the posterior density for any point outside of this set)

In [27]:
_ = pm.plot_posterior(trace, varnames=['r','alpha','s', 'beta'])
/home/julien/anaconda3/envs/data_blog/lib/python3.7/site-packages/pymc3/plots/__init__.py:21: UserWarning: Keyword argument `varnames` renamed to `var_names`, and will be removed in pymc3 3.8
  warnings.warn('Keyword argument `{old}` renamed to `{new}`, and will be removed in pymc3 3.8'.format(old=old, new=new))

We can calculate other KPIs that are relevant for business. For instance, let's look at the probability a customer is still alive at a future period. we can write a function that will calculate that probability. These functions were taken form a talk by Ben Van Dyke and Jean-Rene Gauthier

In [28]:
def prob_alive_at_T(lambda_, mu, t_x, T):
    """
    Probability a customer's time of death \tau occurs after T.
    
    Pr(\tau > T \> | \> \lambda, \mu, x, t_x,T) = \dfrac{1}{1+\dfrac{\mu}{\mu+\lambda}
    \big\{e^{(\lambda+\mu)(T-t_x)}-1\big\}}
    
    See expression (7) in technical appendix of Abe.
    
    :param lambda_: lambda parameter at the customer level :
    :type lambda_: scalar
    :param mu: mu parameter at the customer level
    :type mu_: scalar
    :param t_x: recency of transactions
    :type t_x: float
    :param T: duration of the calibration/training period
    :type T: float
    
    :return: probability of being alive at time T.
    """

    den = 1 + (mu / (lambda_ + mu)) * (np.exp((lambda_ + mu) * (T - t_x)) - 1)
    return 1 / den

We can also calculate the cumulative expected purchases of a customer in a future period:

In [29]:
def likelihood(lambda_, mu, x, t, T):
    """Pareto/NBD likelihood function.

    :param lambda_: lambda parameter at the customer-level.
    :param mu: mu parameter at the customer level
    :param x: number of repeat transactions
    :param t: recency
    :param T: length of the calibration/training period.

    :return: likelihood value.
    """

    p1 = x * np.log(lambda_) + np.log(mu) - np.log(mu + lambda_) - t * (mu + lambda_)
    p2 = (x + 1) * np.log(lambda_) - np.log(mu + lambda_) - T * (mu + lambda_)
    return np.exp(p1) + np.exp(p2)

def predict(t, lambda_, mu, x, tx, T):
    """Conditional expected purchases at end of time (T, T+t).

    Used to assess holdout period performance and to make predictions
    for future time period t. Conditional on purchase history (x, t_x, T).

    E[X(T,T+t) \> | \> \lambda, \mu, x, t_x, T] = \dfrac{1}{L(\lambda, \mu)|(x, t_x, T)}
    \times \dfrac{\lambda^{x+1}}{\mu}e^{-(\lambda+\mu)T}(1-e^{-\mu t})

    http://brucehardie.com/notes/034/corr_Pareto-NBD.pdf

    :param t: time period
    :type t: scalar

    :param lambda_: lambda parameter at the customer level :
    :type lambda_: scalar

    :param mu: mu parameter at the customer level
    :type mu_: scalar

    :param x: number of repeat transactions
    :type x: integer

    :param tx: recency of transactions
    :type tx: float

    :param T: duration of the calibration/training period
    :type T: float

    :return expected number of purchases (scalar)
    """
    like = likelihood(lambda_, mu, x, tx, T)
    p2 = lambda_ ** (x + 1) / mu * np.exp(-(lambda_ + mu) * T) * (1 - np.exp(-mu * t))

    return 1 / like * p2

Model Check

Out-of-sample Evaluation

As seen in Post I, Lifetimes proposes a few utility function that splits the dataset to obtain an out-of-sample set somteimes referred to as the holdout period. Think train/test split SKlearn.

In [30]:
from lifetimes.utils import calibration_and_holdout_data

rfm = calibration_and_holdout_data(
    cdnow_transactions, 
    customer_id_col='customer_id',
    datetime_col='date',
    calibration_period_end='1997-09-30',
    observation_period_end='1998-06-30',
    freq='W'
)

rfm.describe()
Out[30]:
frequency_cal recency_cal T_cal frequency_holdout duration_holdout
count 2357.000000 2357.000000 2357.000000 2357.000000 2357.0
mean 0.940178 6.842597 32.985999 0.758591 39.0
std 1.835391 10.729357 3.346091 1.862830 0.0
min 0.000000 0.000000 27.000000 0.000000 39.0
25% 0.000000 0.000000 30.000000 0.000000 39.0
50% 0.000000 0.000000 33.000000 0.000000 39.0
75% 1.000000 12.000000 36.000000 1.000000 39.0
max 23.000000 39.000000 39.000000 26.000000 39.0

To generate predictions for a given customer, we can use the posterior distribution of the individual lambda and mu parameters from the model trace, and input them into the predict function previously defined.
t would be equal to the time window corresponding to the holdout period, here 39 weeks. The function returns one prediction for each draw in the x4 Markov chain (2000 in this example) and represents the posterior predictive distribution of future purchases.

Generating Predictions for a Single Customer

Let's select the 150th user in the dataset and step through the results of model prediction. This customer made two purchases in the calibration period and three purchases in the holdout period.

In [31]:
customer_index = 150
# show purchasing behavior
rfm.iloc[customer_index]
Out[31]:
frequency_cal         2.0
recency_cal          17.0
T_cal                38.0
frequency_holdout     3.0
duration_holdout     39.0
Name: 1623, dtype: float64

Customer #150 made 2 purchases in the calibration period (17 weeks difference), and made another 3 purchases in the holdout period.

In [32]:
# Extract the posterior for lambda and mu from the trace
lambda_post = trace['lambda']
mu_post = trace['mu']

# Select distributions of lambda and mu for a customer
# Get the entire 2000 data points for that customer
lambda_individual = lambda_post[:,customer_index]
mu_individual = mu_post[:,customer_index]

# predict purchases for the user at t = 39
t = 39 # Hold out window
predict(t, 
        lambda_individual, 
        mu_individual, 
        x[customer_index], 
        t_x[customer_index], 
        T[customer_index]).mean()
Out[32]:
0.9396028735385188

We underforecast the number of purchases for this customer in the holdout period, but we correctly identified that the customer was still alive.

Holdout Period Predictions for Entire Customer Cohort

To get a picture of model performance for the entire

In [51]:
# holdout period is 39 weeks
t = 39
# predictions are size of customer base x number of draws
# Loop through the entire array of posterior parameters
holdout_predictions = np.empty([N, n_draws*4])

for i in np.arange(N):
    holdout_predictions[i] = predict(
        t, 
        lambda_post[:,i], 
        mu_post[:,i], 
        x[i], 
        t_x[i], 
        T[i]
    )

# predictions are posterior mean for each customer, 
# and put it back in the initial dataframe
rfm['frequency_predicted'] = holdout_predictions.mean(axis=1)

In the holdout period we predicted 1,724 purchases and observed 1,788.

In [52]:
rfm[['frequency_holdout','frequency_predicted']].sum()
Out[52]:
frequency_holdout      1788.000000
frequency_predicted    1647.947865
dtype: float64

The plot below shows the average purchases in the holdout period grouped by the number of repeat purchases made in the calibration period.

In [53]:
# Diagnostic plot
# Binned by number of transactions in the calibration period,
# plot the average  number of transactions in the holdout period
# for observed data and what the model predicted

# the black line is a simple heuristic, what would be observed based on the previous rate
xlim=(1, 20)
mean_frequencies = rfm.groupby('frequency_cal')[['frequency_holdout',
                                     'frequency_predicted']].mean()
mean_frequencies.rename(columns={'frequency_holdout': 'Observed',
                                 'frequency_predicted': 'Model'},
                        inplace=True)
mean_frequencies.plot(kind='line',
                      title='Conditional Expected Purchases in Holdout Period', 
                      figsize=(16, 8))

# Generate a dummy model with holdout freq = t_holdout/t_calib
t_calib = np.mean(rfm['T_cal'])
t_holdout = t
x_heuristics = np.linspace(xlim[0],xlim[1],100)
heuristics = t_holdout/t_calib * x_heuristics

plt.plot(x_heuristics, 
         heuristics, 
         linestyle='--', 
         color='black', 
         label='Simple Heuristics'
        )
plt.legend(loc=0)
plt.xlim(xlim)
plt.ylabel('Mean Number of Transactions in Holdout Period')
plt.xlabel('Number of Transactions in Calibration Period')
plt.grid(False)
plt.show()

Most of the customer don't make much repeated transactions, so the model makes a better fit at lower levels of transactions. This probably explains why the model diverges slightly from the hold out data for higher purchase numbers

In [364]:
from sklearn.metrics import mean_squared_error
print("RMSE: %s" % np.sqrt(mean_squared_error(rfm['frequency_holdout'], 
                                              rfm['frequency_predicted']
                                             )
                          )
     )
RMSE: 1.41695547193