tutorial: Drop the camels...go for the rats. Permutation and real data

By Julien Hernandez Lallement, 2019-12-16, in category Tutorial

analysis, python, statistics

In a past post, I presented the background to run a permutation test using simulated data and a dummy experimental scenario.

Here, I will run the same analysis on actual data that you can find on this repository. Just as a proof of concept that the analysis can be quite helpful to understand dynamics of individuals participants / customers.

In [7]:
import pandas as pd
df = pd.read_excel('data.xlsx',sheet_name='Choice',header=1)
In [8]:
df
Out[8]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Trial 7 ... Trial 16.3 Trial 17.3 Trial 18.3 Trial 19.3 Trial 20.3 bsl shk1 shk2 shk3 total
0 NoHarm 1 NoHarm 2 2 2 2 2 2 2 ... 1 2 2 2 1 0 0 0 0 0.0
1 NoHarm 2 NoHarm 2 2 2 2 1 2 2 ... 2 1 1 2 2 0 0 0 0 0.0
2 NoHarm 3 NoHarm 2 2 2 2 2 2 2 ... 2 2 2 2 2 0 0 0 0 0.0
3 NoHarm 4 NoHarm 2 2 2 2 2 2 2 ... 2 1 2 1 2 0 0 0 0 0.0
4 NoHarm 5 NoHarm 2 2 1 1 2 2 2 ... 1 2 1 2 2 0 0 0 0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73 1vs3Pellets 74 1vs3Pellets 2 2 2 2 2 2 1 ... 2 2 1 2 2 0 0 0 0 NaN
74 1vs3Pellets 75 1vs3Pellets 2 2 2 2 2 1 2 ... 2 2 2 2 2 0 0 0 0 NaN
75 1vs3Pellets 76 1vs3Pellets 2 2 2 2 2 2 2 ... 2 2 2 2 2 0 0 0 0 NaN
76 1vs3Pellets 77 1vs3Pellets 2 2 2 2 2 2 2 ... 2 2 2 2 2 0 0 0 0 NaN
77 1vs3Pellets 78 1vs3Pellets 1 2 2 2 2 2 2 ... 0 0 0 0 0 0 0 0 5 NaN

78 rows × 88 columns

This is the data that I published together with the paper, as it is now always the case.

The data regroups different groups of animals, together with their decisions in a binary choice scenario. If you are interested about the actual scenario, I invite you to read the paper.

For the proof of concept, I am going to focus on the NoHarm and ContingentHarm groups. As in the first post, individuals made choides in a baseline phase 20 times, and then were experimentally manipulated before making a series of 60 consecutive choices.

Data Prep

In [78]:
def format_cols(dataf):
    dataf.columns = [c.lower().replace(' ', '_') for c in dataf.columns]
    return dataf
In [79]:
df_clean = (df
.drop(columns={'Unnamed: 0', 'Unnamed: 1', 'bsl', 'shk1', 'shk2', 'shk3', 'total'})
.rename(columns={'Unnamed: 2': 'groups'})
.pipe(format_cols)
.loc[df['Unnamed: 2'].str.contains('NoHarm|ContingentHarm')]
#.set_index('groups')
.replace(2,0)
)
In [110]:
# Split the groups to run permutation separately. Not necessary, only for demonstration purposes
df_contingent = df_clean.loc[df_clean.groups=='ContingentHarm'].drop(columns={'groups'}).reset_index(drop=True).T.reset_index(drop=True).T
df_no_harm    = df_clean.loc[df_clean.groups=='NoHarm'].drop(columns={'groups'}).reset_index(drop=True).T.reset_index(drop=True).T

Run permutation

In [111]:
import numpy as np

I am going to use the same function as in the previous post.

In [144]:
def indiv_score(data, end_base, end_exp):
    """
    Calculate a normalized score for each participant
    Baseline phase is taken for the first 20 decisions
    Trials 21 to 60 are used as actual experimental choices
    """
    # Baseline is the first 20 choices, experimental is from choice 21 onwards
    score = ((data.loc[end_base:end_exp].mean() - data.loc[0:end_base-1].mean())
        / (data.loc[end_base:end_exp].mean() + data.loc[0:end_base-1].mean())
        )
    return score

def compute_indiv_score(data):
    """
    Compute score for all individuals in the dataset
    """
    # Pre Allocate
    score = pd.DataFrame(columns = ['score'])
    # Loop over individuals to calculate score for each one
    for i in range(0,len(data)):
        # Calculate score
        curr_score = indiv_score(data.loc[i,:])
        # Store score
        score.loc[i,'score'] = curr_score
    return score
In [168]:
def run_permutation(data, direct='two-sided', nReps=1000, print_output=False):
    """
    Run a permutation test.
    For each permutation, a score is calculated and store in an array.
    Once all permutations are performed for that given participants, the function computes the real score
    It then compares the real score with the confidence interval.
    
    The ouput is a datafram containing all important statistical information.
    """
    # PreAllocate significance
    output=pd.DataFrame(columns=['Participant', 'Real_Score', 'Lower_CI', 'Upper_CI', 'Significance'])

    for iParticipant in range(0,data.shape[0]):
        # Pre Allocate
        scores = pd.Series('float')
        # Start repetition Loop
        if print_output == True:
            print('Participant #' +str(iParticipant))
        output.loc[iParticipant, 'Participant'] = iParticipant
        for iRep in range(0,nReps):
            # Store initial choice distribution to compute real true score
            initial_dat = data.loc[iParticipant,:]
            # Create a copy
            curr_dat = initial_dat.copy()
            # Shuffle data
            np.random.shuffle(curr_dat)
            # Calculate score with shuffled data
            scores[iRep] = indiv_score(curr_dat,end_base=20, end_exp=80)
            
        # Sort scores to compute confidence interval
        scores = scores.sort_values().reset_index(drop=True)
        # Calculate confidence interval bounds, based on directed hypothesis
        if direct == 'two-sided':
            upper = scores.iloc[np.ceil(scores.shape[0]*0.95).astype(int)]
            lower = scores.iloc[np.ceil(scores.shape[0]*0.05).astype(int)]
        elif direct == 'one-sided':
            upper = scores.iloc[np.ceil(scores.shape[0]*0.975).astype(int)]
            lower = scores.iloc[np.ceil(scores.shape[0]*0.025).astype(int)]    

        output.loc[iParticipant, 'Lower_CI'] = lower
        output.loc[iParticipant, 'Upper_CI'] = upper
        if print_output == True:
            print ('CI = [' +str(np.round(lower,decimals=2)) + ' ; ' + str(np.round(upper,decimals=2)) + ']')
        # Calculate real score
        real_score = indiv_score(initial_dat,end_base=20, end_exp=80)
        output.loc[iParticipant, 'Real_Score'] = real_score
        if print_output == True:
            print('Real score = ' + str(np.round(real_score,decimals=2)))
        # Check whether score is outside CI bound
        if (real_score < upper) & (real_score > lower):
            output.loc[iParticipant, 'Significance'] =0
            if print_output == True:
                print('Not Significant')
        elif real_score >= upper:
            output.loc[iParticipant, 'Significance'] =1
            if print_output == True:
                print('Significantly above')
        else: output.loc[iParticipant, 'Significance']  = -1
        if print_output == True:
                print('Significantly below')
        if print_output == True:
            print('')
    return output
In [169]:
output_contingent = run_permutation(df_contingent, direct='two-sided', nReps=1000, print_output=False)
In [170]:
output_nocontingent = run_permutation(df_no_harm, direct='two-sided', nReps=100, print_output=False)
In [171]:
output_nocontingent.Significance.sum()
Out[171]:
5
In [172]:
output_contingent.Significance.sum()
Out[172]:
10

We can see that there were twice as much individuals that changed their behavior significantly in the contingent than in the no harm condition. As argued in the first post, that could be then used to look at other unrelated variables separately between the two groups. Check the paper if you want to know more about this story!

I hope that was helpful! You can find the data in this repo: https://github.com/juls-dotcom/permutation