By Julien Hernandez Lallement, 2019-12-16, in category Tutorial
In a past post, I presented the background to run a permutation test using simulated data and a dummy experimental scenario.
Here, I will run the same analysis on actual data that you can find on this repository. Just as a proof of concept that the analysis can be quite helpful to understand dynamics of individuals participants / customers.
import pandas as pd
df = pd.read_excel('data.xlsx',sheet_name='Choice',header=1)
df
This is the data that I published together with the paper, as it is now always the case.
The data regroups different groups of animals, together with their decisions in a binary choice scenario. If you are interested about the actual scenario, I invite you to read the paper.
For the proof of concept, I am going to focus on the NoHarm and ContingentHarm groups. As in the first post, individuals made choides in a baseline phase 20 times, and then were experimentally manipulated before making a series of 60 consecutive choices.
def format_cols(dataf):
dataf.columns = [c.lower().replace(' ', '_') for c in dataf.columns]
return dataf
df_clean = (df
.drop(columns={'Unnamed: 0', 'Unnamed: 1', 'bsl', 'shk1', 'shk2', 'shk3', 'total'})
.rename(columns={'Unnamed: 2': 'groups'})
.pipe(format_cols)
.loc[df['Unnamed: 2'].str.contains('NoHarm|ContingentHarm')]
#.set_index('groups')
.replace(2,0)
)
# Split the groups to run permutation separately. Not necessary, only for demonstration purposes
df_contingent = df_clean.loc[df_clean.groups=='ContingentHarm'].drop(columns={'groups'}).reset_index(drop=True).T.reset_index(drop=True).T
df_no_harm = df_clean.loc[df_clean.groups=='NoHarm'].drop(columns={'groups'}).reset_index(drop=True).T.reset_index(drop=True).T
import numpy as np
I am going to use the same function as in the previous post.
def indiv_score(data, end_base, end_exp):
"""
Calculate a normalized score for each participant
Baseline phase is taken for the first 20 decisions
Trials 21 to 60 are used as actual experimental choices
"""
# Baseline is the first 20 choices, experimental is from choice 21 onwards
score = ((data.loc[end_base:end_exp].mean() - data.loc[0:end_base-1].mean())
/ (data.loc[end_base:end_exp].mean() + data.loc[0:end_base-1].mean())
)
return score
def compute_indiv_score(data):
"""
Compute score for all individuals in the dataset
"""
# Pre Allocate
score = pd.DataFrame(columns = ['score'])
# Loop over individuals to calculate score for each one
for i in range(0,len(data)):
# Calculate score
curr_score = indiv_score(data.loc[i,:])
# Store score
score.loc[i,'score'] = curr_score
return score
def run_permutation(data, direct='two-sided', nReps=1000, print_output=False):
"""
Run a permutation test.
For each permutation, a score is calculated and store in an array.
Once all permutations are performed for that given participants, the function computes the real score
It then compares the real score with the confidence interval.
The ouput is a datafram containing all important statistical information.
"""
# PreAllocate significance
output=pd.DataFrame(columns=['Participant', 'Real_Score', 'Lower_CI', 'Upper_CI', 'Significance'])
for iParticipant in range(0,data.shape[0]):
# Pre Allocate
scores = pd.Series('float')
# Start repetition Loop
if print_output == True:
print('Participant #' +str(iParticipant))
output.loc[iParticipant, 'Participant'] = iParticipant
for iRep in range(0,nReps):
# Store initial choice distribution to compute real true score
initial_dat = data.loc[iParticipant,:]
# Create a copy
curr_dat = initial_dat.copy()
# Shuffle data
np.random.shuffle(curr_dat)
# Calculate score with shuffled data
scores[iRep] = indiv_score(curr_dat,end_base=20, end_exp=80)
# Sort scores to compute confidence interval
scores = scores.sort_values().reset_index(drop=True)
# Calculate confidence interval bounds, based on directed hypothesis
if direct == 'two-sided':
upper = scores.iloc[np.ceil(scores.shape[0]*0.95).astype(int)]
lower = scores.iloc[np.ceil(scores.shape[0]*0.05).astype(int)]
elif direct == 'one-sided':
upper = scores.iloc[np.ceil(scores.shape[0]*0.975).astype(int)]
lower = scores.iloc[np.ceil(scores.shape[0]*0.025).astype(int)]
output.loc[iParticipant, 'Lower_CI'] = lower
output.loc[iParticipant, 'Upper_CI'] = upper
if print_output == True:
print ('CI = [' +str(np.round(lower,decimals=2)) + ' ; ' + str(np.round(upper,decimals=2)) + ']')
# Calculate real score
real_score = indiv_score(initial_dat,end_base=20, end_exp=80)
output.loc[iParticipant, 'Real_Score'] = real_score
if print_output == True:
print('Real score = ' + str(np.round(real_score,decimals=2)))
# Check whether score is outside CI bound
if (real_score < upper) & (real_score > lower):
output.loc[iParticipant, 'Significance'] =0
if print_output == True:
print('Not Significant')
elif real_score >= upper:
output.loc[iParticipant, 'Significance'] =1
if print_output == True:
print('Significantly above')
else: output.loc[iParticipant, 'Significance'] = -1
if print_output == True:
print('Significantly below')
if print_output == True:
print('')
return output
output_contingent = run_permutation(df_contingent, direct='two-sided', nReps=1000, print_output=False)
output_nocontingent = run_permutation(df_no_harm, direct='two-sided', nReps=100, print_output=False)
output_nocontingent.Significance.sum()
output_contingent.Significance.sum()
We can see that there were twice as much individuals that changed their behavior significantly in the contingent than in the no harm condition. As argued in the first post, that could be then used to look at other unrelated variables separately between the two groups. Check the paper if you want to know more about this story!
I hope that was helpful! You can find the data in this repo: https://github.com/juls-dotcom/permutation