In [2]:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd

import plotnine as p9

For many years, academics have been using so-called frequentist statistics to evaluate whether experimental manipulations have significant effects. Frequentist statistics is based on the concept of hypothesis testing, which is a mathematical based estimation of whether your results can be obtained by chance. If the results could have happened by chance, then your results are not highly replicable. Most of these "negative" results are disregarded in research, although there is tremendous added value in also knowing what manipulations do not have an effect.

In this post, I am exploring an alternative method to frequentist hypothesis testing which is the Bayesian statistics. I will not go into a direct comparison between the two approaches. There is quite some reading out there, if you are interested. I will rather explore how can bayesian approach benefit a frequentist one, and in which cases that could happen.

In the frequentist world, statistics typically output a p-value. We already discussed the limitations of p-values in another post, which you can read to get familiar with some concepts behing its computation. Briefly, the p-value, if significant (defined by an arbitrarily decided threshold, called alpha level, typically set at 0.05), determines that your results are most likely not obtained by chance.

However, what if (and that happens a lot), your p-value is > 0.05? In the frequentist world, such p-values do not allow you to disentangle between an absence of evidence and an evidence of absence of effect.

Let that sink in for a little bit, because it is the crucial point here. In other words, frequentist statistics are pretty effective at quantifying the presence of an effect, but are quite poor at quantifying evidence for the absence of an effect. See here for literature.

The demonstration below is taken from some work that was performed at the Netherlands Institute for Neuroscience, back when I was working in neuroscience research. A very nice paper was recently published on this topic, that I encourage you to read. The code below is inspired by the paper repository, written in R.

Say we generate a random distribution with mean=0.5 and standard deviation=1.

In [3]:

np.random.seed(42)
mean = 0.5; sd=1; sample_size=1000
exp_distibution = np.random.normal(loc=mean, scale=sd, size=sample_size)
plt.hist(exp_distibution)

Out[3]:

(array([  4.,  22.,  96., 228., 272., 226., 104.,  38.,   9.,   1.]),
 array([-2.74126734, -2.03186746, -1.32246757, -0.61306769,  0.09633219,
         0.80573208,  1.51513196,  2.22453184,  2.93393172,  3.64333161,
         4.35273149]),
 <a list of 10 Patch objects>)

That would be our experimental distribution, and we want to know whether that distribution is significantly different from 0. We could run a one sample t-test that hypothesis (t-test is okay since the distribution seems very gaussian, but you should theoretically prove that parametric testing assumptions are fulfilled; let's assume they are)

In [4]:

t, p = stats.ttest_1samp(a=exp_distibution, popmean=0)
print('t-value = ' + str(t))
print('p-value = ' + str(p))

t-value = 16.771297262472846
p-value = 8.210083235563045e-56

Quite a nice p-value that would make every PhD student's spine shiver with happiness ;)

Now let's try a distribution centered at 0, which should not be significantly different from 0

In [5]:

np.random.seed(42)
mean = 0; sd=1; sample_size=1000
exp_distibution = np.random.normal(loc=mean, scale=sd, size=sample_size)
plt.hist(exp_distibution)

t, p = stats.ttest_1samp(a=exp_distibution, popmean=0)
print('t-value = ' + str(t))
print('p-value = ' + str(p))

t-value = 0.6243089585093194
p-value = 0.5325672436623025

Here, we have as expected a distribution that does not significantly differ from 0. And here is where things get a bit tricky: frequentist statistics cannot really tell whether a p-value > 0.05 is an absence of evidence, and an evidence for absence, although that is a crucial point that would allow you to rule out completely an experimental manipulation from having an effect

Let's take an hypothetical situation:
You want to know whether a manipulation has an effect. It might be a novel marketing approach in your communication, a interference with biological activity or a picture vs no picture in a mail you are sending. You of course have a control group to compare your experimental group to.
When collecting your data, you could see different patterns:

(i) the two groups differ.
(ii) the two groups behave similarly.
(iii) you do not have enough observations to conclude (sample size too small)

While option (i) is an evidence against the null hypothesis H0 (i.e., you have evidence that your manipulation had an effect), situations (ii) (=evidence for H0, i.e, evidence of absence) and (iii) (=no evidence, i.e, absence of evidence) cannot be disentangled using frequentist statistics. But maybe the bayesian approach can add something to this story...

Before we move on to the bayesian framework, I would like to illustrate in which situations the frequentist approach might be limited.
Below I am plotting how the p-values can behave depending on the effect size (i.e., the difference between your control, here with a mean=0, and your experimental distributions) and sample size (which defines whether you are in situation (ii) or (iii) if no difference is found).

In [6]:

def run_t_test(m,n,iterations):
    """
    Runs a t-test for different effect and sample sizes and stores the p-value
    """
    my_p = np.zeros(shape=[1,iterations])
    for i in range(0,iterations):
        x = np.random.normal(loc=m, scale=1, size=n)
        # Traditional one tailed t test
        t, p = stats.ttest_1samp(a=x, popmean=0)
        my_p[0,i] = p
    return my_p

In [7]:

# Defines parameters to be tested
sample_sizes = [5,8,10,15,20,40,80,100,200]
effect_sizes = [0, 0.5, 1, 2]
nSimulations = 1000

In [8]:

# Run the function to store all p-values in the array "my_pvalues"
my_pvalues = np.zeros((len(effect_sizes), len(sample_sizes),nSimulations))

for mi in range(0,len(effect_sizes)):
    for i in range(0, len(sample_sizes)):
        my_pvalues[mi,i,] = run_t_test(m=effect_sizes[mi], 
                                n=sample_sizes[i], 
                                iterations=nSimulations
                               )

I will quickly visualize the data as DataFrames to make sure that the p-values seem correct

In [9]:

# Visualize data as sanity check
df_es_0 = pd.DataFrame(columns=['0', '0.5', '1.0', '2'], index=range(1000))
df_es_1 = pd.DataFrame(columns=['0', '0.5', '1.0', '2'], index=range(1000))

# Extract data for sample size = 5
num = my_pvalues[0,:,:]
for effect_sizes, val in zip(['0', '0.5', '1.0', '2'], num):
    df_es_0[effect_sizes] = val
print('p-values for sample size = 5')   
print('Effect sizes:')
print(df_es_0.head(2))
print(' ');print(' ')
# Extract data for sample size = 15
num = my_pvalues[3,:,:]
for effect_sizes, val in zip(['0', '0.5', '1.0', '2'], num):
    df_es_1[effect_sizes] = val
print('p-values for sample size = 15')  
print('Effect sizes:')
print(df_es_1.head(2))

p-values for sample size = 5
Effect sizes:
          0       0.5       1.0         2
0  0.243322  0.062245  0.343170  0.344045
1  0.155613  0.482785  0.875222  0.152519
 
 
p-values for sample size = 15
Effect sizes:
          0       0.5       1.0             2
0  0.004052  0.010241  0.000067  1.003960e-08
1  0.001690  0.000086  0.000064  2.712946e-07

As you can see, the p-values decrease with effect sizes (since the experimental distributions gets further away from 0).
However, we also see that the p-values are not significant for a small sample size, even if the effect sizes are quite large.
Let's visualize that.

The test is comparing two distributions of equal mean, that is, we have an effect size = 0.

We will count the number of p-values falling in different bins, defined below:

In [1]:

def get_data(loc):
    # Extract data for effect size
    curr_dat = pd.DataFrame(my_pvalues[loc,:,:])
    # Define index as sample sizes
    curr_dat.index = sample_sizes
    # Stack for reformatting
    curr_dat = curr_dat.stack()
    # Format data
    curr_dat = (pd.DataFrame(curr_dat)
    .reset_index()
    .drop(columns={'level_1'})
    .rename(columns={'level_0': 'sample_size', 0: 'p_values'})
    )
    
    bin_breaks = pd.IntervalIndex.from_tuples([(0,   0.01), (0.01, 0.05), (0.05, 0.1),
                                           (0.1, 0.2),  (0.2,  0.3),  (0.3,  0.4),
                                           (0.4, 0.5),  (0.5,  0.6),  (0.6,  0.7),
                                           (0.7, 0.8),  (0.8,  0.9),  (0.9,  1)])

    
    # Assign p-values to bins
    curr_dat = curr_dat.assign(bins=pd.cut(curr_dat['p_values'],bins=bin_breaks))
    # Assign signifiance to color bar plot
    curr_dat = curr_dat.assign(significance = lambda d:d.p_values < 0.05)
    return curr_dat

In [13]:

# Define theme
my_custom_theme = p9.theme(axis_text_x = p9.element_text(color="grey", size=10,
                                                         angle=90, hjust=.5),
                           axis_text_y = p9.element_text(color="grey", size=10))
# Plot data
(p9.ggplot()
+ p9.geom_bar(data=get_data(0), 
              mapping=p9.aes(x='bins', fill='significance'),
              #fill='bins'
             )
+ p9.facet_wrap('sample_size')
+ my_custom_theme
+ p9.labels.ggtitle('Effect size = 0 (same mean as control)')
)

Out[13]:

<ggplot: (8768830943089)>

As we can see from the plot above, most of the p-values computed by the t-test are not significant for an experimental distribution of mean 0.
Let's see what happens if we use a distribution whose mean is 0.5 above the control

In [14]:

# Plot data
(p9.ggplot()
+ p9.geom_bar(data=get_data(1), 
              mapping=p9.aes(x='bins', fill='significance'),
              #fill='bins'
             )
+ p9.facet_wrap('sample_size')
+ my_custom_theme
+ p9.labels.ggtitle('Effect size = 0.5')
)

Out[14]:

<ggplot: (8768826472369)>

Now, we clearly see that increasing sample size dramatically increases the ability to detect the effect, with still many non significant p-values for low sample sizes.
Below, as expected, you see that for highly different distributions, the number of significant p-values increase:

In [15]:

# Plot data
(p9.ggplot()
+ p9.geom_bar(data=get_data(3), 
              mapping=p9.aes(x='bins', fill='significance'),
              #fill='bins'
             )
+ p9.facet_wrap('sample_size')
+ my_custom_theme
+ p9.labels.ggtitle('Effect size = 2')
)

Out[15]:

<ggplot: (8768825945301)>

That was it for an illustrative example of how p-values are affected by sample and effect sizes.

In the following post, I will explore and show how Bayesian statistics (using JASP) can provide additional information when p-values are below the significance threshold.

You can find this notebook in the following repo: https://github.com/juls-dotcom/bayes

In [ ]:

course: Frequentist vs Bayesian: Round 1!