Comparison of Profile Models

We sample two profile models:

  1. The isolated profile model (reported)
  2. The profile subcomponent of the full model

This notebook compares those two models to see what impact full-model training has on user inference.

Setup

In [1]:
import os
from pathlib import Path
In [2]:
import pandas as pd
import numpy as np
from scipy.special import expit, logit
from scipy import stats
import seaborn as sns
import plotnine as p
import matplotlib.pyplot as plt
from statsmodels.nonparametric.kde import KDEUnivariate
import zarr
from IPython.display import display, Markdown
In [3]:
import bookgender.datatools as dt
from bookgender.nbutils import *
In [4]:
fig_dir = init_figs('ProfileModelCompare')
using figure dir figures/ProfileModelCompare

Load Data

In [5]:
datasets = list(dt.datasets.keys())
datasets
Out[5]:
['AZ', 'BX-E', 'BX-I', 'GR-E', 'GR-I']
In [6]:
def load(ds, model):
    _zf = zarr.ZipStore(f'data/{ds}/inference/{model}/samples.zarr', mode='r')
    _c = zarr.LRUStoreCache(_zf, 2**30)
    return zarr.group(_c)
In [7]:
p_samp = {}
f_samp = {}
for ds in datasets:
    p_samp[ds] = load(ds, 'profile')
    f_samp[ds] = load(ds, 'full')

Compare Summary Parameters

The primary parameters of interest for profiles are $\mu$ and $\sigma$ - the mean and variance of the (log odds) proportions.

In [8]:
p_mu = pd.DataFrame(dict((ds, p_samp[ds]['mu']) for ds in datasets))
p_mu.index.name = 'Sample'
p_mu.describe()
Out[8]:
AZ BX-E BX-I GR-E GR-I
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean -0.514382 -0.399709 -0.443828 -0.273208 -0.254363
std 0.029801 0.020151 0.017970 0.022427 0.021380
min -0.618320 -0.470822 -0.512055 -0.366509 -0.349278
25% -0.534410 -0.413218 -0.455956 -0.288445 -0.268497
50% -0.514355 -0.399837 -0.443576 -0.273294 -0.254363
75% -0.494401 -0.386026 -0.431779 -0.258152 -0.239769
max -0.407880 -0.317007 -0.374692 -0.187482 -0.152170
In [9]:
f_mu = pd.DataFrame(dict((ds, f_samp[ds]['mu']) for ds in datasets))
f_mu.index.name = 'Sample'
f_mu.describe()
Out[9]:
AZ BX-E BX-I GR-E GR-I
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean -0.422042 -0.386132 -0.434960 -0.267217 -0.244974
std 0.027193 0.020111 0.016926 0.022540 0.019699
min -0.529776 -0.465225 -0.499783 -0.352164 -0.315294
25% -0.439871 -0.399545 -0.446454 -0.282246 -0.258416
50% -0.421924 -0.386269 -0.435103 -0.267251 -0.244704
75% -0.403800 -0.372363 -0.423487 -0.252198 -0.231755
max -0.312990 -0.314502 -0.368429 -0.189616 -0.165044
In [10]:
mu = pd.concat({'Separate': p_mu, 'Full': f_mu}, names=['Model']).reset_index()
mu = mu.melt(id_vars=['Model', 'Sample'], var_name='Set')
sns.boxplot('Set', 'value', hue='Model', data=mu)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8be239790>
In [11]:
p_s = pd.DataFrame(dict((ds, p_samp[ds]['sigma']) for ds in datasets))
p_s.index.name = 'Sample'
p_s.describe()
Out[11]:
AZ BX-E BX-I GR-E GR-I
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.875013 1.185160 1.083392 1.498967 1.444597
std 0.029923 0.018733 0.017045 0.017982 0.016511
min 1.755840 1.108950 1.008110 1.433390 1.381690
25% 1.854365 1.172610 1.071810 1.486530 1.433578
50% 1.874700 1.185165 1.083350 1.498410 1.444095
75% 1.894970 1.197460 1.094915 1.510945 1.455522
max 2.008700 1.257240 1.146390 1.568860 1.513180
In [12]:
f_s = pd.DataFrame(dict((ds, f_samp[ds]['sigma']) for ds in datasets))
f_s.index.name = 'Sample'
f_s.describe()
Out[12]:
AZ BX-E BX-I GR-E GR-I
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.743360 1.206495 1.073051 1.511378 1.371841
std 0.024639 0.018816 0.014803 0.018005 0.014597
min 1.652220 1.139190 1.017560 1.445010 1.316520
25% 1.726837 1.193845 1.063130 1.499287 1.362078
50% 1.743105 1.206190 1.073090 1.511300 1.371835
75% 1.759890 1.219172 1.082910 1.523432 1.381490
max 1.824560 1.285000 1.125910 1.577510 1.428880
In [13]:
sigma = pd.concat({'Separate': p_s, 'Full': f_s}, names=['Model']).reset_index()
sigma = sigma.melt(id_vars=['Model', 'Sample'], var_name='Set')
sns.boxplot('Set', 'value', hue='Model', data=sigma)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8bdeebe50>

Comparing Projected Values

Let's now compare the $\theta$ values - what does each model predict for the distribution of user profile tendencies?

In [14]:
p_th = pd.DataFrame(dict((ds, p_samp[ds]['thetaP']) for ds in datasets))
p_th.index.name = 'Sample'
p_th.describe()
Out[14]:
AZ BX-E BX-I GR-E GR-I
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 0.416734 0.422535 0.412365 0.452728 0.451432
std 0.299767 0.229406 0.216790 0.269713 0.262539
min 0.000527 0.007993 0.009984 0.002151 0.004607
25% 0.139004 0.232165 0.237363 0.216333 0.224470
50% 0.370690 0.402033 0.387333 0.436360 0.433280
75% 0.675581 0.597704 0.572539 0.677640 0.666217
max 0.998827 0.981249 0.975965 0.997260 0.994236
In [15]:
f_th = pd.DataFrame(dict((ds, f_samp[ds]['thetaP']) for ds in datasets))
f_th.index.name = 'Sample'
f_th.describe()
Out[15]:
AZ BX-E BX-I GR-E GR-I
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 0.429019 0.427444 0.411889 0.451355 0.458319
std 0.289965 0.233149 0.212814 0.270317 0.257499
min 0.001118 0.008927 0.012457 0.002978 0.003469
25% 0.170024 0.233544 0.238614 0.214576 0.240607
50% 0.392241 0.408919 0.392352 0.429607 0.439171
75% 0.676072 0.607898 0.569449 0.681477 0.667624
max 0.996645 0.982007 0.982311 0.994626 0.992481
In [16]:
thetaP = pd.concat({'Separate': p_th, 'Full': f_th}, names=['Model']).reset_index()
thetaP = thetaP.melt(id_vars=['Model', 'Sample'], var_name='Set')
In [17]:
grid = sns.FacetGrid(row='Set', hue='Model', data=thetaP, aspect=2)
grid.map(sns.kdeplot, 'value')
grid.add_legend()
Out[17]:
<seaborn.axisgrid.FacetGrid at 0x7fc8bdbbf810>
In [18]:
(pn.ggplot(thetaP, pn.aes('value', color='Model')) 
 + pn.geom_line(stat='density', adjust=0.5)
 + pn.facet_grid('Set ~'))
Out[18]:
<ggplot: (8781259350597)>

Individual User Thetas

Now let's look at individual user's estimated theta values. How different are they? We will start by loading each user's posterior expected $\theta_u$ - by linearity of expectation, the expected difference is the difference in expected values.

In [19]:
p_thu = pd.concat(dict(
    (ds, pd.DataFrame({'nTheta': np.mean(p_samp[ds]['nTheta'], axis=0)}))
    for ds in datasets
), names=['Set', 'User'])
p_thu['Theta'] = expit(p_thu['nTheta'])
p_thu
Out[19]:
nTheta Theta
Set User
AZ 0 1.059797 0.742652
1 -0.936946 0.281518
2 -2.645083 0.066293
3 -1.533403 0.177496
4 -2.658765 0.065451
... ... ... ...
GR-I 4995 1.067174 0.744059
4996 -1.543224 0.176067
4997 -0.918932 0.285175
4998 1.510817 0.819182
4999 -2.024523 0.116652

25000 rows × 2 columns

In [20]:
f_thu = pd.concat(dict(
    (ds, pd.DataFrame({'nTheta': np.mean(f_samp[ds]['nTheta'], axis=0)}))
    for ds in datasets
), names=['Set', 'User'])
f_thu['Theta'] = expit(f_thu['nTheta'])
f_thu
Out[20]:
nTheta Theta
Set User
AZ 0 1.271355 0.780975
1 -0.704679 0.330776
2 -2.010412 0.118114
3 -2.480734 0.077220
4 -1.701148 0.154315
... ... ... ...
GR-I 4995 1.089918 0.748366
4996 -1.428438 0.193342
4997 -0.878642 0.293459
4998 1.544678 0.824144
4999 -2.015445 0.117591

25000 rows × 2 columns

In [21]:
thetaU = p_thu.join(f_thu, rsuffix='_f')
thetaU['ndiff'] = thetaU['nTheta_f'] - thetaU['nTheta']
thetaU['diff'] = thetaU['Theta_f'] - thetaU['Theta']
thetaU.describe()
Out[21]:
nTheta Theta nTheta_f Theta_f ndiff diff
count 25000.000000 25000.000000 25000.000000 25000.000000 25000.000000 25000.000000
mean -0.377055 0.425650 -0.351105 0.426963 0.025950 0.001312
std 1.310954 0.245157 1.343335 0.247269 0.442744 0.073400
min -4.611648 0.009838 -4.463211 0.011394 -3.004567 -0.591574
25% -1.302338 0.213772 -1.274608 0.218470 -0.128474 -0.021950
50% -0.409445 0.399045 -0.403842 0.400390 0.002216 0.000395
75% 0.411312 0.601402 0.414463 0.602157 0.165283 0.026456
max 5.259550 0.994829 5.000654 0.993311 3.257669 0.652368
In [22]:
thetaU['diff'].quantile([0.025, 0.975])
Out[22]:
0.025   -0.167280
0.975    0.161608
Name: diff, dtype: float64
In [23]:
thetaU['diff'].abs().describe()
Out[23]:
count    2.500000e+04
mean     4.549474e-02
std      5.761476e-02
min      2.054315e-07
25%      7.668859e-03
50%      2.407147e-02
75%      6.075029e-02
max      6.523681e-01
Name: diff, dtype: float64

95%ile?

In [24]:
thetaU['diff'].abs().quantile(0.95)
Out[24]:
0.16448341395008292
In [25]:
sns.kdeplot(thetaU['diff'].abs(), cumulative=True)
plt.axvline(thetaU['diff'].abs().quantile(0.95), color='grey')
Out[25]:
<matplotlib.lines.Line2D at 0x7fc8bd80ea10>
In [26]:
sns.kdeplot(thetaU['ndiff'].abs(), cumulative=True)
plt.axvline(thetaU['ndiff'].abs().quantile(0.95), color='grey')
Out[26]:
<matplotlib.lines.Line2D at 0x7fc8bc0f37d0>
In [ ]: