By Julien Hernandez Lallement, 2020-04-12, in category Tutorial
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import lifetimes
This post follows an initial post here on how to use a BetaGeo model to predict customer purchases.
In the previous post, we implemented a BetaGeo model to predict purchases of single customers using the python package "lifetimes". Lifetimes proposes the traditional analytical approach to model fitting to compute purchase prediction and other related predictions useful for businesses. While most of the math behind the models is pre programed, the analytical approach can be quite complex.
Probabilistic programming packages such as PyMC, or what could be the R equivalent Stan enables the calculation of these complex parameters using Markov chain Monte Carlo and give us more flexibility in prior specification. The bayesian concept of prior allows you to already add to the model what you know about the way a process (in this case, purchases) works.
In addition, the Markov chain yields the full posterior distributions for model parameters and hyperparameters. One of the major advantages of using this bayesian approach is that we can express our predictions as probabilistic distributions rather than single estimates, which provides some insight into our level of uncertainty in our predictions.
Here, I will be using a bayesian version of the Pareto/NBD model, discussed in the same paper from Fader et al. 2005 (not original Pareto paper, that model was defined in the 80s').. This post is inspired by Ben Van Dyke and Jean-Rene Gauthier in PyData Seattle 2017, and uses some of their material.
Let's first load the dataset provided by the lifetimes
package, which contains data about the (sometimes repeated) purchased on some customers, with data features such as date of transaction, quantity purchased and the amount spent in a currency, probably dollars. See here for a source of this dataset using the Pareto model.
from lifetimes.datasets import load_dataset
cdnow_transactions = load_dataset(
'CDNOW_sample.txt',
header=None,
delim_whitespace=True,
names=['customer_id', 'customer_index', 'date', 'quantity', 'amount'],
converters={'date': lambda x: pd.to_datetime(x, format="%Y%m%d")}
)
cdnow_transactions.head()
Life in the previous post, the Pareto model requires a simple recency/frequency matrix, which can be easily produced thanks to some useful util functions if the lifetimes
package:
# lifetimes provides a transaction log -> rfm util function
rf = lifetimes.utils.summary_data_from_transaction_data(
cdnow_transactions,
'customer_id',
'date',
observation_period_end=pd.to_datetime('1997-09-30'),
freq='W'
)
rf.head()
Using the paper from Fader et al. 2005, I understood that the Pareto/NBD and BetaGeo model differ mainly in that the Pareto/NBD assumes that customer/purchase dropout can occur at any moment in time, while the BetaGeo model assumes instead that dropouts occur right after a purchase (assumption number 3 in post 1).
Let's look at the assumptions of the Pareto/NBD model (taken from Ben Van Dyke and Jean-Rene Gauthier).
The number of purchases made in a period $t$ follows a Poisson distribution:
$$ p(x~|~\lambda, \Delta t ) = \frac{(\lambda \Delta t)^x}{x!} e^{-\lambda \Delta t} $$Purchasing rate parameter $\lambda$ is distributed Gamma with parameters $r$ and $alpha$:
$$\lambda \sim \Gamma(r, \alpha)$$A customer with a transaction rate $\lambda$ will make on average $\lambda \times \Delta t$ transactions in a period of time $\Delta t$.
At the customer level, the lifetime $\tau$ is distributed according to an exponential distribution
$$ p(\tau~|~\mu) = \mu e^{-\mu \tau } $$where $\tau > 0$. In other words, each customer has its own lifetime distribution. Note that the expectation value for the lifetime $\tau$ is $E[\tau~|~\mu] = \frac{1}{\mu}$.
The value of $\mu$ varies across the customers according to another gamma distribution with shape $s$ and rate $\beta$ :
$$\mu \sim \Gamma(s, \beta)$$Likelihood for an individuals purchasing rate and lifetime conditional on purchasing frequency, recency, and time since initial purchase:
$$ L(\lambda, \mu~|~x,t_x,T) = \frac{\lambda^x \mu}{\lambda+\mu}e^{-(\lambda+\mu)t_x}+\frac{\lambda^{x+1}}{\lambda+\mu}e^{-(\lambda+\mu)T} $$where $x$ is the repeat purchase frequency, $t_x$ is the recency and $T$ is the length of the calibration/training period.
We need to create a custom distribution to express the likelihood function to implement the model in PyMC. below, we create a ParetoNBD
class based on the pymc3.Continuous
class. We simply add the parameterization to the constructor and implement a log-likelihood function.
from pymc3.math import exp, log
class ParetoNBD(pm.Continuous):
"""
Custom distribution class for Pareto/NBD likelihood.
"""
def __init__(self, lambda_, mu, *args, **kwargs):
super(ParetoNBD, self).__init__(*args, **kwargs)
self.lambda_ = lambda_
self.mu = mu
def logp(self, x, t_x, T):
"""
Loglikelihood function for and indvidual customer's purchasing rate \lambda
and lifetime \mu given their frequency, recency and time since first purchase.
"""
log_lambda = log(self.lambda_)
log_mu = log(self.mu)
mu_plus_lambda = self.lambda_ + self.mu
log_mu_plus_lambda = log(mu_plus_lambda)
p_1 = x * log_lambda + log_mu - log_mu_plus_lambda - t_x * mu_plus_lambda
p_2 = (x + 1) * log_lambda - log_mu_plus_lambda - T * mu_plus_lambda
return log(exp(p_1) + exp(p_2))
We can now define the model.
We first extract data from the recency/frequency table created before with Lifetimes. Then, we place a positive constrained, non-informative priors on the $r$, $\alpha$, $s$, and $\beta$ hyperparameters.
I will use half-Cauchy priors, as argued by Scott Polson in here. I am not an expert on this matter, and am following relatively accepted proceedings. We should play with other priors though, and explore how that might affect the model fit. But for now, let's stick with this. A half-Cauchy is one of the symmetric halves of the Cauchy distribution. The half-Cauchy is has a heavy tail and has been argued as a fairly weakly informative prior (see here) Another classic prior is the inverse gamma, but half-Cauchy is sometimes preferred because they have better behavior for small parameter values but only regards it as wealy informative when a large scale parameter is used. More information in the first paper by Polson.
Note as well the gamma priors on lambda & mu. These come from the initial Pareto/NBD model, but should be challenged in future use cases.
# Extract data for model following notation from Fader/Hardie
N = rfm.shape[0] # number of customers
x = rfm['frequency'].values
t_x = rfm['recency'].values
T = rfm['T'].values # length of training period
n_draws = 2000
pnbd_model = pm.Model()
with pnbd_model:
# Uninformative priors on model hyperparameters
r = pm.HalfCauchy('r', beta=2)
alpha = pm.HalfCauchy('alpha', beta=2)
s = pm.HalfCauchy('s', beta=2)
beta = pm.HalfCauchy('beta', beta=2)
## Attach priors to individual parameters
# Gamma prior on purchasing rate parameter lambda
# Adding the shape = number of customers will give one lambda and mu parameter for each customer
# The gamma prior are quite useful for conjugacy, but one might wonder what other prior could be
lambda_ = pm.Gamma('lambda',
alpha=r,
beta=alpha,
shape=N,
testval=np.random.rand(N)
)
# Gamma prior on lifetime parameter mu
mu = pm.Gamma('mu',
alpha=s,
beta=beta,
shape=N,
testval=np.random.rand(N)
)
# Custom distribution for Pareto-NBD likelihood function
loglikelihood = ParetoNBD("loglikelihood",
mu=mu,
lambda_=lambda_,
observed={'x': x, 't_x': t_x, 'T': T}
)
# Sample the model
# Train de model and extract the parameters
trace = pm.sample(n_draws, init=None)
We can go and visualize the trace and posterior generated by the model. The PyMC3 traceplot
displays the parameter distribution and sampled values, stored in trace
# Traceplots to check for convergence
_ = pm.traceplot(trace, varnames=['r','alpha','s','beta'])
Plot Posterior densities in the style of John K. Kruschke’s book.
HPD = Highest posterior density (region of confidence level α is a (1−α)-confidence region Iα for which holds that the posterior density for every point in this set is higher than the posterior density for any point outside of this set)
_ = pm.plot_posterior(trace, varnames=['r','alpha','s', 'beta'])
We can calculate other KPIs that are relevant for business. For instance, let's look at the probability a customer is still alive at a future period. we can write a function that will calculate that probability. These functions were taken form a talk by Ben Van Dyke and Jean-Rene Gauthier
def prob_alive_at_T(lambda_, mu, t_x, T):
"""
Probability a customer's time of death \tau occurs after T.
Pr(\tau > T \> | \> \lambda, \mu, x, t_x,T) = \dfrac{1}{1+\dfrac{\mu}{\mu+\lambda}
\big\{e^{(\lambda+\mu)(T-t_x)}-1\big\}}
See expression (7) in technical appendix of Abe.
:param lambda_: lambda parameter at the customer level :
:type lambda_: scalar
:param mu: mu parameter at the customer level
:type mu_: scalar
:param t_x: recency of transactions
:type t_x: float
:param T: duration of the calibration/training period
:type T: float
:return: probability of being alive at time T.
"""
den = 1 + (mu / (lambda_ + mu)) * (np.exp((lambda_ + mu) * (T - t_x)) - 1)
return 1 / den
We can also calculate the cumulative expected purchases of a customer in a future period:
def likelihood(lambda_, mu, x, t, T):
"""Pareto/NBD likelihood function.
:param lambda_: lambda parameter at the customer-level.
:param mu: mu parameter at the customer level
:param x: number of repeat transactions
:param t: recency
:param T: length of the calibration/training period.
:return: likelihood value.
"""
p1 = x * np.log(lambda_) + np.log(mu) - np.log(mu + lambda_) - t * (mu + lambda_)
p2 = (x + 1) * np.log(lambda_) - np.log(mu + lambda_) - T * (mu + lambda_)
return np.exp(p1) + np.exp(p2)
def predict(t, lambda_, mu, x, tx, T):
"""Conditional expected purchases at end of time (T, T+t).
Used to assess holdout period performance and to make predictions
for future time period t. Conditional on purchase history (x, t_x, T).
E[X(T,T+t) \> | \> \lambda, \mu, x, t_x, T] = \dfrac{1}{L(\lambda, \mu)|(x, t_x, T)}
\times \dfrac{\lambda^{x+1}}{\mu}e^{-(\lambda+\mu)T}(1-e^{-\mu t})
http://brucehardie.com/notes/034/corr_Pareto-NBD.pdf
:param t: time period
:type t: scalar
:param lambda_: lambda parameter at the customer level :
:type lambda_: scalar
:param mu: mu parameter at the customer level
:type mu_: scalar
:param x: number of repeat transactions
:type x: integer
:param tx: recency of transactions
:type tx: float
:param T: duration of the calibration/training period
:type T: float
:return expected number of purchases (scalar)
"""
like = likelihood(lambda_, mu, x, tx, T)
p2 = lambda_ ** (x + 1) / mu * np.exp(-(lambda_ + mu) * T) * (1 - np.exp(-mu * t))
return 1 / like * p2
from lifetimes.utils import calibration_and_holdout_data
rfm = calibration_and_holdout_data(
cdnow_transactions,
customer_id_col='customer_id',
datetime_col='date',
calibration_period_end='1997-09-30',
observation_period_end='1998-06-30',
freq='W'
)
rfm.describe()
To generate predictions for a given customer, we can use the posterior distribution of the individual lambda
and mu
parameters from the model trace, and input them into the predict
function previously defined.
t
would be equal to the time window corresponding to the holdout period, here 39 weeks. The function returns one prediction for each draw in the x4 Markov chain (2000 in this example) and represents the posterior predictive distribution of future purchases.
Let's select the 150th user in the dataset and step through the results of model prediction. This customer made two purchases in the calibration period and three purchases in the holdout period.
customer_index = 150
# show purchasing behavior
rfm.iloc[customer_index]
Customer #150 made 2 purchases in the calibration period (17 weeks difference), and made another 3 purchases in the holdout period.
# Extract the posterior for lambda and mu from the trace
lambda_post = trace['lambda']
mu_post = trace['mu']
# Select distributions of lambda and mu for a customer
# Get the entire 2000 data points for that customer
lambda_individual = lambda_post[:,customer_index]
mu_individual = mu_post[:,customer_index]
# predict purchases for the user at t = 39
t = 39 # Hold out window
predict(t,
lambda_individual,
mu_individual,
x[customer_index],
t_x[customer_index],
T[customer_index]).mean()
We underforecast the number of purchases for this customer in the holdout period, but we correctly identified that the customer was still alive.
To get a picture of model performance for the entire
# holdout period is 39 weeks
t = 39
# predictions are size of customer base x number of draws
# Loop through the entire array of posterior parameters
holdout_predictions = np.empty([N, n_draws*4])
for i in np.arange(N):
holdout_predictions[i] = predict(
t,
lambda_post[:,i],
mu_post[:,i],
x[i],
t_x[i],
T[i]
)
# predictions are posterior mean for each customer,
# and put it back in the initial dataframe
rfm['frequency_predicted'] = holdout_predictions.mean(axis=1)
In the holdout period we predicted 1,724 purchases and observed 1,788.
rfm[['frequency_holdout','frequency_predicted']].sum()
The plot below shows the average purchases in the holdout period grouped by the number of repeat purchases made in the calibration period.
# Diagnostic plot
# Binned by number of transactions in the calibration period,
# plot the average number of transactions in the holdout period
# for observed data and what the model predicted
# the black line is a simple heuristic, what would be observed based on the previous rate
xlim=(1, 20)
mean_frequencies = rfm.groupby('frequency_cal')[['frequency_holdout',
'frequency_predicted']].mean()
mean_frequencies.rename(columns={'frequency_holdout': 'Observed',
'frequency_predicted': 'Model'},
inplace=True)
mean_frequencies.plot(kind='line',
title='Conditional Expected Purchases in Holdout Period',
figsize=(16, 8))
# Generate a dummy model with holdout freq = t_holdout/t_calib
t_calib = np.mean(rfm['T_cal'])
t_holdout = t
x_heuristics = np.linspace(xlim[0],xlim[1],100)
heuristics = t_holdout/t_calib * x_heuristics
plt.plot(x_heuristics,
heuristics,
linestyle='--',
color='black',
label='Simple Heuristics'
)
plt.legend(loc=0)
plt.xlim(xlim)
plt.ylabel('Mean Number of Transactions in Holdout Period')
plt.xlabel('Number of Transactions in Calibration Period')
plt.grid(False)
plt.show()
Most of the customer don't make much repeated transactions, so the model makes a better fit at lower levels of transactions. This probably explains why the model diverges slightly from the hold out data for higher purchase numbers
from sklearn.metrics import mean_squared_error
print("RMSE: %s" % np.sqrt(mean_squared_error(rfm['frequency_holdout'],
rfm['frequency_predicted']
)
)
)