By Julien Hernandez Lallement, 2020-10-15, in category Course
Generally speaking, regression methods are supervised learning techniques. They use continous scaled variables (independent variables) to predict the behavior of a dependent variable. They can use different equations that will fit straight lines (linear regression), polynomial functions (detecting interaction effects) or other funcftions to predict the dependent variable.
In this post, I will be focusing on logistic regressions, which are typically used as classification algorithms. The logistic regression is however not a classsification algorithm per se, but can be used as such when a probability threshold is set to differentiate between classes.
Logisti regressions can be used in different scenarios:
Let's first explain why the most common use of Logistic Regression is classification. Again, to be clear, Logistic Regression is not a classification algorithm per se. It can be used for classification is a threshold is set, above and below which the vector of features will categorize the data point as belonging to one class or the other.
Look at the graph below:
plt.scatter(positive[1],positive[0]);
plt.scatter(negative[1],negative[0]);
sigmoid = lambda x: 1 / (1 + np.exp(-x))
x=np.linspace(-10,10,100)
plt.plot(y,sigmoid(x),'b', label='linspace(-10,10,100)')
plt.plot(np.linspace(-7.5,6,100), np.linspace(0,1,100), 'g')
plt.grid()
plt.ylabel('Class')
Here, the y axis shows the classified nature of the data points (spam or not spam), and the x axis shows a feature of these data points (say, the number of spelling mistakes in the mail's object).
As you can imagine, a linear fit (green line; not real linear fit) would be quite poor in this case. If new data comes in, with y=1
, but with a high value of x, your linear fit will show a tremendous increase in residuals, and fit the data poorly.
Moreover, if the y axis does not reflect classes but probabilities of belonging to a class, you will have issues because your linear fit might predict values >1 or <0, which would be nonsense.
One way to circumvent this problem is to use a sigmoidal function
, such as the logistic function:
$ h_0(x) = \frac{1}{1 + e^{-\theta_o - \theta_1x}} $
This function is shown in the figure above as blue line. Using it, a new comer data point would be classified as 1 or 0 based on the probability threshold for $h_0$.
Let's have a closer look at it:
import math
x_values = np.linspace(-5, 5, 100)
y_values = [1 / (1 + math.e**(-x)) for x in x_values]
plt.plot(x_values, y_values)
plt.axhline(.5)
plt.axvline(0)
You can see why this is a great function for a probability measure. The y-value represents the probability and only ranges between 0 and 1. Also, for an x value of zero you get a .5 probability and as you get more positive x values you get a higher probability and more negative x values a lower probability.
The aforementionned threshold should be decided based on business knowledge and common sense. A threshold = 0.5 would be quite flexible, allowing some "smart" spams (if that exists) to land in the mailbox. In turn, we might increase our threshold to avoid any spam to land in the mailbox, while taking the risk of getting false positive (good mails that land in the spam folder). Trial and error is a good technique to monitor the threshold and its efficiency.
Once we have our features, we need to use in our logistic function.
For example:
$$x = \beta_{0} + \beta{1}Feat1 + \beta_{2}Feat2 $$Where Feat1 is our value for Feature 1 and Feat2 is our value for Feature 2. For those of you familiar with Linear Regression this looks very familiar. Basically we are assuming that x is a linear combination of our data plus an intercept.
For example, take the iris dataset. Say we have a plant with a sepal width of 3.5 and a sepal length of 5.Say we already know the parameters $\beta$(s) do that $\beta_{0} = 1$, $\beta_{1} = 2$, and $\beta_{2} = 4$. This would imply:
$$x = 1 + (2 * 3.5) + (4 * 5) = 28$$Plugging this into our logistic function gives:
$$\frac{1}{1 + e^{-28}} = .99$$So we would give a 99% probability to a plant with those dimensions as being Setosa.
While we entered static values of $\beta$ here, these paremeters are the ones that the machine should learn, and what Logistic Regression will do for you.
While for the Linear Regression
& variants, the cost function was typically a Residual Sum of Squares
, this becomes a bit more tricky with the Logistic Regression
. Indeed, while the cost function of Linear Regression have a single minimum, that is not always the case for logistic functions.
One typically used cost function is the following, which do have a single minimum:
$ J(\theta) = -\frac{1}{N} \sum \limits _{n=1} ^{N} (y^n Ln(h_\theta * x^n)) + (1 - y^n) Ln(1 - h_0 * x^n) $
This function might seem quite complex at first, but when scrutunizing it, some might already see an analogy with the Maximum Likelihood Estimation using probabilities (see my post on Introduction to Machine Learning). And indeed, there is a clear link since minimizing this cost function means that you would have to maximize the likelihood of that data belonging to a class.
In other words, one can find that when y = 1, $J(0)$ decreases monotonically towards 0 when $h_\theta$ is equal to 1. Similarly, when y = 0, the cost function is also equal to 0 when $h_\theta$ is equal to 0.
We will use Logistic regression to predict gamers that will churn. Check this repo for the full data source, or get a small version of the data from my own repo, in the datasets folder. Check as well this nice video for a full explanation of the data.
To be honest about the code, the data pipe below was developed during a course I attended at the Netherlands Institute for Neuroscience in 2019. I modified a few bits here and there add some data features, but the general logic was defined by group during the course.
As you will see, most of the functions are preparatory function.
One important one is the add_churn_label
function, which is where the status of churner is defined. This is important, since it is the pattern that the algorithm will try to predict in unseen data.
The definition is a classic churn definition, with users having played in a certain time window, and having not logged into the game in a future time window.
Note that I focus on the implementation of the algorithm, so as usual in this series of post, i won't be going in data exploration and complex feature engineering. I leave the fun to you to explore that dataset, and increase the accurccy of the model.
import os
os.chdir('/home/julien/website/datasets')
df = pd.read_csv("wowah_data_small.csv")
import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
import plotnine as p9
def start_pipeline(dataf):
return dataf.copy()
def fix_columns(dataf):
dataf.columns = [c.lower().replace(" ","") for c in dataf.columns]
return dataf
def fix_types(dataf):
return dataf.assign(timestamp=lambda d:
pd.to_datetime(d['timestamp'],
format="%m/%d/%y %H:%M:%S"))
def add_ts_features(dataf):
return (dataf
.assign(hour = lambda d: d['timestamp'].dt.hour)
.assign(day_of_week = lambda d: d['timestamp'].dt.dayofweek))
def add_month_played(dataf):
return (dataf
.assign(month_played = lambda d: d['timestamp'].dt.month_name()))
def following_timestamp(dataf):
return (dataf
.sort_values(by=['char','timestamp'])
.assign(following_timestamp = lambda d: (d['timestamp'].shift()))
.assign(following_char = lambda d: (d['char'].shift()))
)
def played_time(dataf):
return (dataf
.assign(played_time = lambda d: (d['timestamp'] - d['following_timestamp'])))
def add_churn_label(dataf, before_period=("2008-01-01", "2008-03-01"),
after_period=("2008-04-01", "2008-06-01"), min_rows=10):
before_df = (dataf
.loc[lambda d: d['timestamp'] >= pd.to_datetime(before_period[0])]
.loc[lambda d: d['timestamp'] < pd.to_datetime(before_period[1])])
after_df = (dataf
.loc[lambda d: d['timestamp'] >= pd.to_datetime(after_period[0])]
.loc[lambda d: d['timestamp'] < pd.to_datetime(after_period[1])])
before_chars = (before_df
.groupby("char")
.count()
.loc[lambda d: d['level'] > min_rows]
.reset_index()['char'])
after_chars = (after_df
.groupby("char")
.count()
.reset_index()['char'])
return (before_df
.loc[lambda d: d['char'].isin(before_chars)]
.assign(churned = lambda d: d['char'].isin(after_chars) == False))
df_clean = (df
.pipe(start_pipeline)
.pipe(fix_columns)
.pipe(fix_types)
.pipe(add_ts_features)
.pipe(add_month_played)
.pipe(following_timestamp)
.pipe(played_time)
.pipe(add_churn_label)
)
The feature char
defines the user in the dataframe, together with the entries for playing. Let's group by that and compute a few relevant data features to feed the model:
ml_df = (df_clean
.groupby(['char'])
.apply(lambda d: pd.Series({
"time_played": d.shape[0],
"churned": d['churned'].any(),
"mean_level": d['level'].mean(),
"var_level": d['level'].var(),
"guild_bool": d['guild'].any(),
"max_level": d['level'].max().astype(float),
"min_level": d['level'].min().astype(float)
}))
.assign(level_speed = lambda d: (d['max_level'] - d['min_level']) / d['time_played'])
)
OK! We are ready to fit a model to predict churn. This post is about Logistic Regression
but I will use another classification algorithm, the KNeighbors Classifier
, as comparison.
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklego.preprocessing import ColumnSelector
# Pre Processing Pipeline
panda_grabber = Pipeline([
("union", FeatureUnion([
("continous", Pipeline([
("select", ColumnSelector(["max_level", "min_level"])),
("scale", StandardScaler())
])),
("discrete", Pipeline([
("select", ColumnSelector(["guild_bool"])),
("encode", OneHotEncoder(sparse=False))
]))
]))
])
# Main Pipeline
pipeline = Pipeline([
("grab", panda_grabber),
("poly", PolynomialFeatures(interaction_only=True)),
#("scale", StandardScaler()),
("model", KNeighborsClassifier(10, weights='distance'))
])
Below, I define the different models to be tested.
Note that I will be using LBFGS as optimization method. See here for information. For gradient descent optimization, see the post on Linear Regression method.
# Define polynomial
param_poly = [PolynomialFeatures(interaction_only=True), None]
# Define models to test
param_model = [KNeighborsClassifier(1),
KNeighborsClassifier(10),
LogisticRegression(solver='lbfgs')]
# Define Grid Search
mod = GridSearchCV(pipeline,
#iid=True,
return_train_score=True,
param_grid={"model": param_model, # This overwrites the pipeline declaration of the model
"poly": param_poly},
cv=10)
let's fit the different models
mod.fit(ml_df,ml_df['churned']);
# look at the cross validation results and add the number of neighbors used.
pd.DataFrame(mod.cv_results_).T
pd.DataFrame(mod.cv_results_)[['param_model', 'mean_train_score', 'mean_test_score','std_test_score']]
As we can see, the Logistic Regression
shows a nice score in the test set. It is not much better than the KNN classifier, but in my opinion, is quite intuitive to explain to non tech stakeholders and might have therefore higher explainability power.
That was it for Logistic Regression
. Hope you enjoyed it, and drop a mail if you have any comments, questions or (constructive_ criticism).