course: (Stats Series) Introduction to Statistics

By Julien Hernandez Lallement, 2021-04-03, in category Course

python, statistics

First real touch with Scary Stats...

Welcome to my first attempt to teach and summarize a topic that I dreaded being in my early years of university: Statistics!

Now, since I am a statistics users, but not a real teacher, I will do my best to describes the concepts and equations behind statistical testing as best as I can. By the same token, that will provide a nice refresh for me ;)

I started using statistics for real when I started my PhD. Before that, I did not know much more than t-tests, mean and median, and maybe ANOVA, although not sure what the math was behind... My recommendation is thus to get a use case and try things out. You will run tests that are wrong, you will not respect assumptions, but at least, you will get increasingly familiar with the outputs and the parameters that one can play with. Relatively fast, you will get familiar with the kind of tests you could or nor could not run on a given dataset, and feel more confident :)

First statistics use case

Say you obtain some IQ data of a camel population. Per individual, you have a score ranging between 0 and 100, and you observe that 75% of these camels are above 80! Not too bad for a camel eh? That is quite strinking because you know for a fact that dromedars have an IQ around 30... (sorry to my dromedar readers, this is a joke example, of course).

Theory

The next thing you might try to do is to explain this data, coming up with a theory, i.e., a set of principles that explains a range of phenomenon (and in the best case, that is backed up by empirical data ;) )

In our particular case, you might argue that camels have evolved in different environmental conditions than dromedars, which have pressured them to develop certain cognitive abilities that allows them to outperform dromedars in an IQ test.

Hypothesis

Given this theory, you know can generate hypothesis that will allow you to put your theory to test. An hypothesis is an informed, theory- (and when possible data-) driven explanation for a given observation.

As an example, we could hypothesize that, if the theory were to be true, camels should outperform dromedars in other tests that measure similar processes than IQ. Similarly, we could hypothesize that dromedars' environment has a less rich food source than camels, which make them spend more time scavenging for foor, and in turn did not allow for IQ development.

As you can see, hypothesis can be quite proximal or quite distal, meaning that some of them can feel quite far fetched. While scientists typically prefer to build the puzzle bit by bit, with more reasonable hypothesis, some breakthroughs have come from a far away piece of the puzzle that confirmed or was in line with the theory.

Predictions

Now that you have an hypothesis, you can make a prediction about it. In our case, the predition is that camels outperform dromedars in IQ related test. By doing this subtle, but instrumental step from hypothesis to prediction, we move from the conceptual world into a observable domain, where data comes in handy ;)

Data

Now, given our hypothesis and we get a second sample of camels & dromedars (always good to resample from the overall population, but more about that later) and test them on IQ related tests. If the camels outperform the dromedars in these tests, then our hypothesis is confirmed; otherwise, it is not, of course :D

In turn this provides more information to complete our theory. Say the camels do not outperform the dromedars in these new tests. That is very interesting stuff! Why would they be particularly good and one test, and not others? One might want to look into the specifics of the tests, to try and pinpoint the actual processes camels are good at. In turn, you refine your theory and generate new hypothesis.

Collecting data

Defining variables

So you are about to start your data collection on calmels & dromedars, to test your hypothesis. In order to do so, you collect so-called variables. As the name suggests, variables are measures than can change between individuals and situations, such as the height of people, the weight, the IQ, the location, etc...

Typically, one would vary a particular variable and observe how the measurements changes according to these variations. In other words, we believe that some variables cause variations in a measurement or outcome variable.

The variables that we think cause a variation are called independent variables.
The variables that we think is affected by a variation are called dependent variables.

Type of data

Depending on the type of measurement you are performing, your data can be classified in different types. Let's do through them:

  • Categorical Variable: this data is made up of categories. You can have gender (men/women), hair color, specie, etc....When only two options are possible, you would call this binary data. Otherwise, we typically talk about nominal data, if the categories are equivalent. In the last case, your categories would be somehow ordered (say you have different camel species, going from most furry to less furry), you would talk about ordinal data. This last one I have rarely encountered in my work, but depending on your field, you might stumble upon it often.
  • Continous Variable: this data contains data point that can take any value on the scale on which it ranges. Here as well you have different categories of continous variables (interval, which contains continous data separated by distances equivalent to the one they represent, ratio, etc...), but I found that this level of details is not very important to begin with.
Designing data collection plan

If you are collecting data yourself, there are a couple of things you should pay attenion for:

  • Seeked link: you should already known ahead whether you are looking for a correlational or causal link between the variables. For example, in fMRI research, we know that more neural activation (blood flow actually) goes hand in hand with observing a certain image for example. However, we cannot disentangle what caused what exactly. In order to so, we would have to design an experiment allowing for causal inference.
  • Design type: are you working with a between subject or within subject design. That will be define the types of statistics you area allowed to run on the data. As suggested by the name, between subject designs feature separate group of individuals that undergo different treatment, while within subject designs feature individuals that all undergo all conditions. Each designs will have advantages and inconvenients, and should be chosen carefully
  • Randomization: one nightmare of researchers is noise. Noise refers to unsystematic variations in the data, that is not due to the treatment or effect that one is applying or seeking. Keeping this variation to a minimum will ensure more sensitive measures during the data collection. For example, randomization should be applied to avoid order effects (you have two conditions, one following the other; participants might be more tired in the second condition, so you should randomize the starting condition across participants)

Data Quality

This aspect is often disregarded in actual work, except if you work in research ;) How can you be sure that your data actually reflects some truth about the process you are interested in?

If you are a researcher, who spends many hours in dark rooms collecting data, then you will do your best to ensure your data is correct and reliable, because you will publish it and build your career upon it, and you want it to be accurate and useful. Moreover, you will have collected the data, so you will known the pitfalls and advantages of it.

However, if you work on other sectors, data is handed over to you, in different forms, and there might be, most likely there is, error in the measurements made when collecting that data. Say someone gives you the results of a camel IQ test. Maybe you do not know which test was used, which in turn might explain why another person with a separate dataset on the same topic has different results. Or maybe someone used camels from a different area, and this was again not reported.

Measurement errors can and will happen. Scientists try to minimize that by reporting Method section as thoroughly as possible, to make sure all parameters and conditions are clear to everyone who read the results. They also try to minimize that by making their results as valid as possible (by replicating the results a few times, but see here) and reliable as possible (by showing that the effect is or is not present under different conditions).

Data Analysis

Let's use a famous example (Boston Housing dataset) to visualize some data, and see how preliminary visualizations and so-called descriptive statistics can already inform quite a bit on data trends.

In [3]:
df
Out[3]:
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88 11.9

506 rows × 14 columns

Frequency distribution

Let's start by plotting a frequency distribution. That is almost always the first thing I do when getting new data, and in my opinion, nothing is more informative than a histogram. It plots the value of observations on the x-axis, and the count in each bin (which you can adapt) is displayed as a bar

In [35]:
plt = df.rm.plot(kind='hist', bins=20)
plt.set_ylabel('count rm')
Out[35]:
Text(0, 0.5, 'count rm')

This visualization already tells you a lot about how th rm variable is distributed. You can see that most values are between 5 and 7, and that some data points have more extreme values, either above or below this central values.

What you see here is called a normal distribution, also known as a gaussian distribution. It is characterized by a so-called bell-shaped curve, which implies, as written above, that the majority of data points lie around the center of the distribution. There are a series of statistical test that you can use to confirm that a distribution followed normality. I will talk about these in a follow up post, since I do not want to put any math in this introduction.

In some cases, the distribution takes on peculiar aspects which can be due to a lack of symmetry (called skewness) or a "pointyness" (called kurtosis).

In [24]:
lt = df.dis.plot(kind='hist', bins=10)
plt.set_ylabel('count dis')
Out[24]:
Text(17.200000000000003, 0.5, 'count dis')

Above, a positively skewed distribution (the opposive would be a symmetric flip of the distribution).

In [38]:
plt = df.rm.plot(kind='hist', bins=10)
plt.set_ylabel('count rm')
Out[38]:
Text(0, 0.5, 'count rm')

Above, a distribution with a positive kurtosis. Distribution with negative kurtosis are characterized by a flattening of the distribution).

Here again, quick note on the histogram as maybe the most powerful tool of any data guy. Below you can see how the data looks relatively normal.

In [42]:
plt = df.medv.plot(kind='hist', bins=5)
plt.set_ylabel('count rm')
Out[42]:
Text(0, 0.5, 'count rm')

Now let's increase the bin size to observe more fine grained dynamics

In [43]:
plt = df.medv.plot(kind='hist', bins=100)
plt.set_ylabel('count rm')
Out[43]:
Text(0, 0.5, 'count rm')

Same distribution but we can see that many data points lay at the highest possible value, 50. Why is that? This would have been missed with the first histogram, which emphasizes the need of taking your time when exploring data.

Calculating the center of distributions

One might want to come up with a single value to summarize distributions, which can be useful to make comparisons or reporting. While distribution spread seems like an important feature, let's focus here on how to summarize distributions:

  • Mode: The mode is the value that occurs the most frequently in the distribution, so basically, the tallest bar.
  • Median: The median is the middle value in the distribution when values are ranked in order of magnitude. I like to use this computation when the distribution has a large spread, or more importantly when many outliers might be present. Median, contrary to mean, is more resistant to outliers.
  • Mean: The mean is the sum of all values $x_i$ divided by the number of observations $n$

$\left (\overline{x} \right) = \left (\frac{(\sum x_i)}{(n}\right)$

Calculating the spread of distributions

While mode, median and mean can be useful to quantify distributions, it might be useful to quantify as well the dispersion of distributions. This is very important, since two distributions might have the same mean, but very different spread. In turns this affects the conclusions you can make statistically.

Again, there are several methods that one could use:

  • Range: take the largest value and subtract it from the smallest one. You can already see that, since it takes simply two value in the distribution into account, it is extremely affected by outliers in the data and might not be the right tool in most cases. You would probably have to complete the analysis with some measure on the distribution's shape (interquartile range, see below)
  • Sum of Square: the sum of square is probably the most widely (indirectly) used spread calculation. Before we get to its rationale, let's first talk about the deviance, which is the subtraction between each value in the distribution from the mean of the distribution. In mathematical terms: $dev = \sum(x_i - \overline{x})$ You can already how this can be problematic. Since the mean will automatically separate some value below and above, you will be adding values of opposite signs and cancel out important information in your data. As a result, people tend to square the deviations ( you could also remove the negative signs, from a mathematical point of view, but squaring is the standard method). This is known as the Sum of Squared: $SS = \sum(x_i - \overline{x})^2$
  • Variance: The problem with the SS is that it measures the total dispersion of the population, which we cannot use to compare across samples that differ in size. As a result, people tend to compute the variance, which follow the same logic as the mean, by diving the SS by the number of observations (minus 1, see here for an explanation related to degrees of freedom, which I won't be discussing here). As a result, the variance can be computed as follows: $Var = \frac{\sum(x_i - \overline{x})^2}{N-1}$
  • Standard Deviation: Finally, the measure that can be used to make meaningful comparison between samples. As you can see from the SS method, we circumvent deviations of opposite signs, but that comes at a cost: our unit of explanation is squared (because we squared every error). To obtain a measure that is conceptually easier to grasp and compare, we can square root the variance to obtain the standard deviation: $SD = \sqrt(\frac{\sum(x_i - \overline{x})^2}{N-1})$

Standard deviation (SD) is probably the measure of distribution spread that you will be more familiar with. Low SD represent a distribution where the data points are close to one another. High SD indicates a large spread in the data. A SD = 0 represents a "distribution' where all values are the same.

Inferring probabilities from frequency distributions

We have talked about histograms before. I would like to quickly mention a useful concept which are Probability Density Function (PDF). I want to mention this here because many statistical tests that use t (t-tests), F (ANOVA) or other value have underlying PDFs which will provide much of the statistical power to that particular test. It is therefore important to understand at least the basics of PDFs.

A probability function could be defined as a perfect version of a given distribution, where the little irregularities of the histogram have been smoothed out. Like a histogram, the area under the curve tells you something about the probabilities of occurence of a given event

In [45]:
plt = df.rm.plot(kind='kde')
plt.set_ylabel('count rm')
Out[45]:
Text(0, 0.5, 'count rm')

The plot below uses a Kernel Density Estimation plot to display the distribution of the variable rm. That looks damn much like a pdf of a normal distribution by the way ;)

There was other famous distributions that we will go through in following posts, such as the t-distribution, the F -distribution or the Chi-Squared distribution.

The important thing with PDFs is that for each distribution, an underlying equation allows us to derive the probability of occurence of a given event

Z-scores

The distributions do not have all the same parameters, that is mean and standard deviation. TO simplify things, one can transform the data points underlying the distribution using a z-score transformation which aims at obtaining a distribution with M = 0 and SD = 1.`

That is done quite simply by subtracting the mean from each data point, and dividing by the standard deviation:

$z =\frac{X - M}{SD}$

In turns that helps you to easily determine probability of occurence of certain events.

Say you work with a dataset representing the grades of students, with a M = 12.3 & SD = 2.3. You want to know the probability that a student obtained a grade of 13.

In [8]:
z = (13 - 12.3) / 2.3
print(z)
0.30434782608695626

We can now look at a z-score table, that will provide the probability of occurences.
We obtained 1.17, so we look on the rows 0.3 and on the columns 0.00, and we obtain a probability of 0.62. That is there are 62% probability that a given student you are now considering obtained a score of 13.

In [1]:
from IPython.display import Image
PATH = "/home/julien/website/content/images/"
Image(filename = PATH + "zscore_table.png", width=500, height=500)
Out[1]:

Similarly, one could answer related questions such as: what is the range of grades between which 95% of the grades fall?

Since the area under the curve of the pdf is equal to 1, you can answer this question by asking what value of z cuts off 5% of the scores

As some of you already noticed, we do need to cut off both ends of the distribution, otherwise you would be giving a direction to the answer. Note that it is allowed to do so (so-called one tailed test if you have a hypothesis about what should happened, but we will get back to that in later posts).

In [11]:
PATH = "/home/julien/website/content/images/"
Image(filename = PATH + "standard-normal-distribution-6.png", width=500, height=500)
Out[11]:

In the example we used above, we would have to find the z value that cuts the distribution on the lines covering 95% of the pdf, which turns out z = 1.96. You can find this value by looking at the z-score table, an searching for the z value that corresponds to a probability of 97.5% (since we take out 2.5% on each side of the distribution of obtain 5% cutt off).

That was it for this quick intro about statistical terminology and 101 concepts. I will follow up with more posts where I will talk about models, parameters, intervals and statistical testing.
Cheers!