By Julien Hernandez Lallement, 2020-09-15, in category Course
This post is part of a series of post where I discuss different typical ML algorithms. Here, we will be discussing the relatively simple K Nearest Neighbours (KNN) algorithm, focusing on classification use.
KNN is a supervised learning technique, which means that it uses labelled data. The data is represented by vector of features, such as shape, length, and color. The dimension of the space in which these data points evolve is the number of features you are dealing with. In other words, each data point is represented by a vector of $X$ features.
Now say you get a new data point, and you want to know which shape, length or color it has based on its location in that space.
What KNN does is very simple from a mathematical perspective: it measures the distances between the new data point and the nearest K data points in the labelled dataset. Based on that, it will conclude to which class our new data point belongs by majority. See here for reference.
As mentioned before, KNN could be used for both regression and classification problems.
With KNN, you could do the following:
As introduced before, a new data point is classified based on a majority vote of its neighbors, with the data point being categorized as the class most common amongst its $K$ nearest neighbors measured by a distance function. If $K$ = 1, the data point is simply assigned to the class of its nearest neighbor.
Different distance functions can be used, and they may provide slightly different results:
It should be noted that the above distance functions are only valid for continuous variables. If you are working with categorical variables, then you should use the Hamming distance instead, but I will not be covering it in this post.
Moreover, there are other distances, such as the Chebyshev distance (absolute value of the differences between all the entries in two vectors AND the maximum of these values), which could be useful for multi-dimenstional space. I have not yet tried such computations in a KNN, but maybe you find the time to try it out ;)
Finally, note that all three distance functions can be easily implemented using the KNN from SciKit Learn (see below).
For very large datasets, it can become extremely computionally demanding to run a KNN. Other search algorithms sacrifice an exhaustive search for a faster run time. Structures such as KDTrees (see https://en.wikipedia.org/wiki/K-d_tree) or Ball trees (see https://en.wikipedia.org/wiki/Ball_tree) are used for faster run times. Have a look at them to see how they can optimize your process.
This is the same idea as a $K$ nearest neighbors classifier, but instead of finding the $K$ nearest neighbors, it will look at all the neighbors within a given radius. Decision on radius value should be taken after closely looking at the data distribution.
To use the same logic on a regression problem (as opposed to classification as it is now), we simply find the weighted average of the $K$ nearest neighbors. Instead of taking the majority class, we calculate a weighted average of these nearest values, using the same weighting method as above.
Let's have a look at a few use cases.
I will first be using the typical height & weight dataset, given for each gender. You can find the dataset on Kaggle, or in my repository in the dataset folder, "datasets_26073_33239_weight-height.csv".
In a follow up example, I will be using a Heroes dataset to show a more refined approach to run the KNN and compare its accuracy with another classification model.
import pandas as pd
import plotnine as p9
df = pd.read_csv('datasets_26073_33239_weight-height.csv'
)
df.sample(3)
df.groupby('Gender').count()
We have the same number of observations for each category. That is good, but not happening in real world situations. If you data is skewed (and it most probably will be), that is, if you have imbalanced data, you might have an issue.
Since the classification as one category is based on the majority among the K neighboring data points, if say "Male" are overrepresented, then KNN would classify the new data point as "Male" with higher likelihood. However, if the classes would be balanced, you might actually see that you data point would be classified as "Female".
One of the most common approach would be to weigh the number of K nearest data points by the inverse of their distance. As noted in the SciKit Learn documentation, "in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away".
This is thus an option you can choose when implementing KNN using Sklearn.
Let's first look at the distributions
from hide import toggle_code as show_solution
show_solution()
def plot_data(data, data_to_pred, pred):
if pred:
# Plot the data
fig = (p9.ggplot()
+ p9.geom_point(data=data,
mapping=p9.aes(x='Height', y='Weight', color='Gender'),
alpha=0.1)
+ p9.geom_point(data = data_to_pred,
mapping=p9.aes(x='Height', y='Weight'),
shape='o',
color='k'
)
)
else:
# Plot the data
fig = (p9.ggplot()
+ p9.geom_point(data=data,
mapping=p9.aes(x='Height', y='Weight', color='Gender'),
alpha=0.1)
)
return fig
plot_data(data=df,
data_to_pred = pd.DataFrame([[65,150], [75,225]], columns=['Height', 'Weight']),
pred=False)
Good. We have relatively clear clusters. That is seldom the case in real world situations :D
Say you now get a new meaure from height and weight, but the label is missing. How can you determine which class it belongs to (example with the black dots below).
plot_data(data=df,
data_to_pred = pd.DataFrame([[65,150], [75,225]], columns=['Height', 'Weight']),
pred=True)
I will scale the data since the two variables are relatively different in scale. Because of the obvious positive correlation between the two variables, I could also have done a PCA which would have spread the data. But since I am focusing on the KNN, I will simply scale using z-scores.
from sklearn.preprocessing import StandardScaler
# Load scaler
scaler = StandardScaler()
# Fit Scaler
scaler.fit(df[['Height', 'Weight']])
# Apply to values
scaled_values = scaler.transform(df[['Height', 'Weight']])
# Prepare scaled data
df_scaled = pd.concat([df.Gender, pd.DataFrame(scaled_values)], axis=1).rename(columns={0:'Height', 1:'Weight'})
# Prepare data to predict
test_dat = pd.DataFrame([[-1,-1], [2,2]], columns=['Height', 'Weight'])
plot_data(data=df_scaled,
data_to_pred = test_dat,
pred=True)
Good, our data is now scaled around 0 for each feature.
As you can see from above, we have two black data points to classify. From their location, the first one is most likely a female, and the second one almost for sure a male.
Let's verify that
from sklearn.neighbors import KNeighborsClassifier,NeighborhoodComponentsAnalysis
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import numpy as np
There are many parameters you can adapt in KNN. Check this for more detailed description.
I am only defining the value K (number of neighbors to take into account), the weight (uniform cause we do not have an unbalanced dataset) and the distance function (here the Euclidian equation).
# Load the classifier
mod = KNeighborsClassifier(n_neighbors=5, # K value
weights='uniform', # Weights to be used, in case of imbalanced datasets
p = 2 # Distance function, here Euclidian distance
)
I then fit the model
# Fit
mod.fit(df_scaled[['Height', 'Weight']], df_scaled.Gender)
I can finally predict the category for each of these two data points
print(mod.predict(test_dat))
We can see that as expected from our eye-balled prediction, these two data points fall in different categories. That was relatively easy, but for more complex data sets, that contain more data features or unclearer clusters, it might be useful to use that a general rule to cluster unseen data.
We have seen how we can simply cluster data based on previously existing labelled data. Good.
However, can we get an estimation of where the separation would lie?
To do so, we can use a Neighborhood Component Analysis (NCA), which is a supervised machine learning algorithm based on linear transformation.
As noted in the SciKit Learn documentation, "NCA is attractive for classification because it can naturally handle multi-class problems without any increase in the model size, and does not introduce additional parameters that require fine-tuning by the user."
Since the KNN can create irregular decision boundaries (if the data is highly irregular itself), NCA can be useful because it does not make any assumptions about the way each category is distributed.
What is nice about the NCA is that it can help to detect classes that are unlikely to encompass unlabelled data, although their relative euclidian distance might suggest they do.
FYI, mathematically, NCA is based on a (squared) Mahalanobis distance metric, which is the distance between two points in multivariate space. Think of measuring distance between more than 2 variables, which becomes not impossible with a simple ruler.
\begin{equation} \left (d(Mahalanobis \right) = \left (((x_i - x_j)T * C - 1 * (x_i - x_j))^{0.5} \right) \end{equation}Where $x_i$ and $x_j$ is a pair of data points, and $C$ is the sample covariance matrix.
Let's use an NCA combined with a KNN, and visualize the added value of running that combination:
# Get the two features
features = df_scaled[['Height', 'Weight']].values
# Isolate the target
target = df_scaled.Gender == 'Female'
from hide import toggle_code as show_solution
show_solution()
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
from matplotlib.colors import ListedColormap
from numpy import sqrt
from sklearn.metrics import mean_squared_error
I am using an elbow method to determine the optimal number of K. It will correspond to the lowest error rate
# Split data
X_train, X_test, y_train, y_test = train_test_split(features, target, stratify=target, test_size=0.7, random_state=42)
error_rates = []
for K in range(50):
K=K+1
# Build model
model = KNeighborsClassifier(n_neighbors = K)
# Fit model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
# Store error rate
error_rates.append(np.mean(predictions != y_test))
print("Optimal number of K = " + str(np.argmin(error_rates)))
plt.plot(error_rates)
Looks like K=11 is a good number of neighbors to classify new data points in this dataset.
Let's move on an run different versions of the KNN using this K value
from hide import toggle_code as show_solution
show_solution()
n_neighbors = 11
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# Name for final plots
names = ['KNN', 'NCA, KNN Euclidian', 'NCA, KNN Manhattan']
# Prepare two pipelines, one using the NCA, and one without it.
classifiers = [Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors,p = 2))
]),
Pipeline([('scaler', StandardScaler()),
('nca', NeighborhoodComponentsAnalysis()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors,p = 2))
]),
Pipeline([('scaler', StandardScaler()),
('nca', NeighborhoodComponentsAnalysis()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors,p = 1))
])
]
# Fetch the min & max from the data features, and expand by one for plot limits
x_min, x_max = features[:, 0].min() - 1, features[:, 0].max() + 1
y_min, y_max = features[:, 1].min() - 1, features[:, 1].max() + 1
# Step size in the mesh
h = .01
# Prepare meshgrid
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
from hide import toggle_code as show_solution
show_solution()
for name, clf in zip(names, classifiers):
# Fit on the train data
clf.fit(X_train, y_train)
# Get score using the test data for final display on plot
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=.8)
# Plot also the training and testing points
plt.scatter(features[:, 0], features[:, 1], c=target, cmap=cmap_bold, edgecolor='k', s=20, alpha=0.2)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("{} (k = {})".format(name, n_neighbors))
plt.text(0.9, 0.1, '{:.2f}'.format(score), size=15,
ha='center', va='center', transform=plt.gca().transAxes)
plt.show()
As one can see from the plots above, using the NCA creates decision boundaries that are not bound to the irregularities in the data. We additionally observe some kind of grey area where data the background color oscillates between the two categories
The accucary score as been very slightly improved. In this case, we might not want to overcomplexify things by adding a NCA, but in other situations, it might be a useful approach.
I was curious as to whether using a different distance function would yield very different results but that does not seem to be the case
Some time ago, I attented a class where we worked with an interesting dataset containing information about different character classes in a famous game ;)
Let's try to see whether we can classify data points using KNN. In order to compare its efficiency, I will enter a Logistic Regression in the pipeline. I won't be discussing what a Logistic Regression is here, that will be for a separate post.
from sklego.datasets import load_heroes
df = load_heroes(as_frame=True)
df.head(2)
Let's combine some features, such as attack and attack speed which could be summarized under a "damage" feature.
def make_ml_data(dataf):
return (dataf
.dropna() # Drop missing values
.assign(dmg=lambda d: d['attack'] * d['attack_spd']) # Create damage feature
[['health', 'dmg', 'attack_type', 'role']] # Select features
.assign(health = lambda d: d['health'].astype(float))) # Transform to float for model fit
ml_df = df.pipe(make_ml_data)
ml_df.head(2)
Let's visualize the data before we move to modelling
(p9.ggplot() +
p9.geom_point(data=ml_df, mapping=p9.aes("health", "dmg", shape="attack_type", color="attack_type")))
Looks like there is a relatively clear pattern in the data, with Melee types having higher health and damage scores.
Now, let's try to fit a KNN and Logistic Regression (for comparison) to the data
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklego.preprocessing import ColumnSelector
I will be using the Pipeline method from Sklearn to create a pipeline that will treat the features in the dataset at once.
The continous features (health and dmg) will be scaled for comparison.
The categorical feature (role) will be one hot encoded.
Below, the data preparation pipeline
panda_grabber = Pipeline([
("union", FeatureUnion([
("continous", Pipeline([
("select", ColumnSelector(["health", "dmg"])),
("scale", StandardScaler())
])),
("categorical", Pipeline([
("select", ColumnSelector(["role"])),
("encode", OneHotEncoder(sparse=False))
]))
]))
])
One thing that one can control for is non linear interacitons between features in the data, which can be unexpected and difficult to predict.
One useful approach to reveal such interactions is to create new features that enable you to use them and check whether they improve model performance. This is quite a useful approach since it allows you to use simple model while still capturing complex interaction patterns, since these interactions are placed direclty in the data as features. One potential down side is the addition of a large number of features to the model, so a trade off should be found there
Such features are typically referred to as polynomial
features.
Below, I am adding this step in the ML pipeline, and selecting only cases for which the features have significant interactions. For the ones not versed in statistical terms, interactions happen when the effect of one variable is dependent on another variable in the dataset.
pipeline = Pipeline([
("grab", panda_grabber),
("poly", PolynomialFeatures(interaction_only=True)),
("model", KNeighborsClassifier(10))
])
Good. Now, we can fit the model.
In order to tune the hyperparameters, I will be using GridSearch, which is a technique that tries to compute the optimum values of hyperparameters. Hyperparameters are parameters of the model that are not affected by the training. They are pre defined by the user, and should therefore be carefully chosen.
# Set a Polynomial Features and no parameters
param_poly = [PolynomialFeatures(interaction_only=True),
None]
# test different K for KNN and one Logistic Regression for comparison
param_model = [KNeighborsClassifier(1),
KNeighborsClassifier(10),
KNeighborsClassifier(20),
LogisticRegression(solver='lbfgs')]
Above, I have defined the different parameters to be tested by GridSearch
Now, let's run the GridSearch on these parameters... Note that I use 4 cross validations splitting strategy, which will automatically run the train/test split for me a number of times.
mod = GridSearchCV(pipeline,
#iid=True,
return_train_score=True,
param_grid={"model": param_model,
"poly": param_poly},
cv=4)
...and fit the model to predict attack type.
mod.fit(ml_df, ml_df['attack_type']);
pd.DataFrame(mod.cv_results_)
As you can see above, we get one line per model and hyper parameter tested by the GridSearch (that is 4 models, times 2 parameters [PolynomialFeatures & None], so 8 lines in total)
We can see from the "rank_test_score" column that a KNN with 10 neighbors and interaction effect taken into account is the best model. However, the following line which is the same model without interaction effects is extremely close, and more simple, so I would tend to select this model over the first one, despite lower accuracy (expainability/interpretability has very high value as well ;) )
ml_df
Let's generate random data, drawn from a uniform distribution for both features "health" and "dmg", and one single role feature. The goal here is to simulate new data points, for which we will predict whether they would be Melee or Ranged attack types.
n = 10000
simulated_df = pd.DataFrame({
"health": np.random.uniform(ml_df.health.min()-20, ml_df.health.max()+20, n),
"dmg": np.random.uniform(ml_df.dmg.min()-5, ml_df.dmg.max()+5, n),
"role": "Bruiser"
})
Let's now compute the probability of being Melee
pltr = simulated_df.assign(pred = lambda d: mod.predict_proba(d)[:, 0])
Note that I am selecting the first column on the predict_proba
output, since the sum of both columns would equal to 1, as shown below:
mod.predict_proba(simulated_df)
pltr.sample(4)
We have 10.000 simulated entries in pltr
, for which we have predicted the probability of beeing classified as Melee.
Let's visualize that:
(p9.ggplot() +
p9.geom_point(data=pltr, mapping=p9.aes('health', 'dmg', color='pred')) +
p9.geom_point(data=ml_df, mapping=p9.aes('health', 'dmg', shape='attack_type'), size=3))
The black dots are the real data points, and the coloured dots are the simulated data, color coded by probability of being Melee.
We can see an interesting and consistent overlap with the real data, which confidence decreasing as we approach the zone where both attack types start to overlap, and reversing for Ranged attack type data points.
That's it for now regarding K Nearest Neighbors!
Keep posted for further posts on other ML algorithms!