User Profile Bayesian Analysis

This notebook analyzes the results of the Bayesian analysis of the user profile data.

We split our Bayesian analyses into three steps:

  1. Exploratory analysis of data + preparation for sampling (notebook)
  2. Sampling (Python script)
  3. Analysis of sampling results (notebook)

This is to enable the sampling process and its results to be managed and cached by DVC, and to avoid re-running the Bayesian sampling process when we only want to change how we look at its results.

Setup

In [1]:
model = 'profile'
In [2]:
# Parameters
model = "full"
In [3]:
import os
from pathlib import Path
In [4]:
import pandas as pd
import numpy as np
import scipy.special as sps
from scipy import stats
import seaborn as sns
import plotnine as p
import matplotlib.pyplot as plt
from statsmodels.nonparametric.kde import KDEUnivariate
import zarr
from IPython.display import display, Markdown
In [5]:
from bookgender.config import rng_seed
from lenskit.util import init_rng
import bookgender.datatools as dt
from bookgender.nbutils import *
In [6]:
seed = init_rng(rng_seed(), 'ProfileModelAnalysis', propagate=True)
rng = np.random.default_rng(seed)
seed
Out[6]:
SeedSequence(
    entropy=261868553827208103807548308384201786360,
    spawn_key=(2851018176,),
)
In [7]:
if model.startswith('profile'):
    fig_key = model
else:
    fig_key = 'profile-' + model
fig_key
Out[7]:
'profile-full'
In [8]:
fig_dir = init_figs(f'PMA-{fig_key}')
using figure dir figures\PMA-profile-full

Load Data

In [9]:
datasets = list(dt.datasets.keys())
datasets
Out[9]:
['AZ', 'BX-E', 'BX-I', 'GR-E', 'GR-I']

We need to load the sampling results for each data set:

In [10]:
samples = {}
summary = {}
for ds in datasets:
    _zf = zarr.ZipStore(f'data/{ds}/inference/{model}/samples.zarr', mode='r')
    _c = zarr.LRUStoreCache(_zf, 2**30)
    samples[ds] = zarr.group(_c)
    summary[ds] = pd.read_csv(f'data/{ds}/inference/{model}-summary.csv', index_col='name')
summary = pd.concat(summary, names=['Set'])
summary.head()
Out[10]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
Set name
AZ lp__ -523430.000000 6.698110 230.681000 -523811.000000 -523434.000000 -523047.000000 1186.09 0.171128 1.005420
mu -0.404199 0.000296 0.027883 -0.449519 -0.404244 -0.357611 8846.61 1.276380 0.999921
sigma 1.817710 0.000500 0.025899 1.775350 1.817410 1.860390 2683.58 0.387183 1.000170
nTheta[1] 1.286780 0.004743 0.404129 0.623833 1.279880 1.964260 7258.89 1.047310 0.999970
nTheta[2] -0.690522 0.003429 0.316051 -1.215020 -0.687933 -0.183888 8495.49 1.225720 1.000410

Extract and verify sample size:

In [11]:
sample_size = len(samples['AZ']['lp__'])
assert all(len(s['lp__']) == sample_size for s in samples.values())
sample_size
Out[11]:
10000

Quality Checks

Do we have parameters with bad mixing? (high $\hat{R}$)

In [12]:
summary.sort_values('R_hat', ascending=False).head()
Out[12]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
Set name
AZ recV[1] 0.328336 0.000381 0.010875 0.310559 0.328422 0.346121 813.49 0.117369 1.00780
recV[7] 0.591941 0.000377 0.014596 0.567788 0.592029 0.615941 1495.08 0.215708 1.00557
lp__ -523430.000000 6.698110 230.681000 -523811.000000 -523434.000000 -523047.000000 1186.09 0.171128 1.00542
BX-I log_lik[40] -11.763600 0.026478 1.550820 -14.768900 -11.450300 -9.857870 3430.33 0.728409 1.00472
AZ log_lik[1277] -21.237200 0.035902 2.023170 -25.041100 -20.892700 -18.590100 3175.55 0.458165 1.00453

Let's compute LPPD and WAIC for each of our models:

In [13]:
def ll_stats(ds):
    ll_exp = samples[ds]['ll_exp']
    ll_var = samples[ds]['ll_var']
    lppd = np.sum(ll_exp)
    pwaic = np.sum(ll_var)
    return pd.Series({'lppd': lppd, 'pWAIC': pwaic, 'WAIC': -2 * (lppd - pwaic)})
pd.Series(datasets).apply(ll_stats).assign(Set=datasets).set_index('Set')
Out[13]:
lppd pWAIC WAIC
Set
AZ -105183.705730 19923.413558 250214.238575
BX-E -35879.516478 7358.254564 86475.542083
BX-I -65591.503901 12595.357457 156373.722716
GR-E -40095.871232 7517.476264 95226.694992
GR-I -72305.932973 12602.205166 169816.276278

Extract and Summarize

Extract sample data in a form usable for plotting:

In [14]:
samp_df = pd.concat(dict((ds, pd.DataFrame({
    'Mu': samples[ds]['mu'],
    'Sigma': samples[ds]['sigma'],
    'Theta': samples[ds]['thetaP']
})) for ds in datasets), names=['Set']).copy()
samp_df
Out[14]:
Mu Sigma Theta
Set
AZ 0 -0.428029 1.84320 0.014647
1 -0.425898 1.78460 0.943089
2 -0.402148 1.80616 0.164888
3 -0.402499 1.78952 0.436446
4 -0.408635 1.79498 0.820111
... ... ... ... ...
GR-I 9995 -0.256192 1.39657 0.843665
9996 -0.269800 1.41944 0.546385
9997 -0.267748 1.41610 0.789476
9998 -0.200176 1.40139 0.204088
9999 -0.238515 1.41414 0.615352

50000 rows × 3 columns

We also want to show observed distributions:

In [15]:
profiles = pd.read_pickle('data/profile-data.pkl')
profiles.head()
Out[15]:
count linked ambiguous male female dcknown dcyes PropDC Known PropFemale PropKnown
Set user
AZ 529 8 8 2 1 4 8 3 0.375000 5 0.800000 0.625000
1723 25 24 3 15 6 25 14 0.560000 21 0.285714 0.840000
1810 14 6 0 6 0 8 1 0.125000 6 0.000000 0.428571
2781 8 8 1 5 1 8 5 0.625000 6 0.166667 0.750000
2863 6 6 0 6 0 6 4 0.666667 6 0.000000 1.000000

For each sample, we want to sample a predicted observation. We do this by:

  1. Sample $n'$ from the observed profile sizes
  2. Sample $y' \sim \mathrm{Binomial}(n', \theta')$
  3. Compute $y'/n'$
In [16]:
def _sample_ns(df):
    known = profiles.loc[df.name, 'Known']
    # return known.sample(len(df), replace=True).reset_index(drop=True)
    return pd.Series(rng.choice(known, len(df)))
samp_df.groupby('Set').apply(_sample_ns).stack()
Out[16]:
Set       
AZ    0         6
      1         5
      2         9
      3         8
      4        22
             ... 
GR-I  9995    124
      9996     27
      9997    338
      9998     33
      9999     37
Length: 50000, dtype: int32
In [17]:
samp_df['N'] = samp_df.groupby('Set').apply(_sample_ns).stack()
samp_df['Y'] = rng.binomial(samp_df['N'], samp_df['Theta'], len(samp_df))
samp_df['PropFemale'] = samp_df['Y'] / samp_df['N']
samp_df.head()
Out[17]:
Mu Sigma Theta N Y PropFemale
Set
AZ 0 -0.428029 1.84320 0.014647 5 0 0.000000
1 -0.425898 1.78460 0.943089 17 16 0.941176
2 -0.402148 1.80616 0.164888 17 2 0.117647
3 -0.402499 1.78952 0.436446 44 17 0.386364
4 -0.408635 1.79498 0.820111 11 10 0.909091

Summarize Statistics and Parameters

We want to report summary statistics of the various parameters. We will start with observed ratios:

In [18]:
profiles.groupby('Set')['PropFemale'].agg(['mean', 'std']).T
Out[18]:
Set AZ BX-E BX-I GR-E GR-I
mean 0.414445 0.418886 0.407030 0.446998 0.450201
std 0.329436 0.267354 0.254304 0.276480 0.269127
In [19]:
print(profiles.groupby('Set')['PropFemale'].agg(['mean', 'std']).T.to_latex(float_format='%.3f'))
\begin{tabular}{lrrrrr}
\toprule
Set &    AZ &  BX-E &  BX-I &  GR-E &  GR-I \\
\midrule
mean & 0.414 & 0.419 & 0.407 & 0.447 & 0.450 \\
std  & 0.329 & 0.267 & 0.254 & 0.276 & 0.269 \\
\bottomrule
\end{tabular}

What about our est. mean log odds ($\mu$) and variance ($\sigma$)? We also want to the posterior mean tendancy ($\theta$) and its variance.

It's important to distinguish between the estimated variance of users ($\sigma$) and the variance of the estimate (e.g. the variance of $\mu$). Since $\theta$ is a sample of a random unseen user, its variance should be comparable to the variance of the underlying users.

In [20]:
samp_df.groupby('Set')[['Mu', 'Sigma', 'Theta']].agg(['mean', 'std']).T
Out[20]:
Set AZ BX-E BX-I GR-E GR-I
Mu mean -0.404199 -0.387693 -0.436366 -0.268155 -0.239446
std 0.027883 0.019787 0.017108 0.022523 0.020677
Sigma mean 1.817712 1.203551 1.087071 1.510577 1.407563
std 0.025899 0.019206 0.014878 0.018145 0.015249
Theta mean 0.435515 0.424355 0.416129 0.452638 0.454304
std 0.295780 0.231262 0.217583 0.270602 0.261225

Find the 95% confidence intervals of $\mu$:

In [21]:
samp_df.groupby('Set')['Mu'].apply(lambda s: {
    'Low': s.quantile(0.025),
    'High': s.quantile(0.975)
}).unstack()
Out[21]:
Low High
Set
AZ -0.458947 -0.350098
BX-E -0.426526 -0.348723
BX-I -0.469878 -0.403145
GR-E -0.311513 -0.223671
GR-I -0.280169 -0.198707

Let's look at the STAN summary output for these parameters too:

In [22]:
summary.swaplevel().loc['mu', ['Mean', 'StdDev', '5%', '95%']].T
Out[22]:
Set AZ BX-E BX-I GR-E GR-I
Mean -0.404199 -0.387693 -0.436366 -0.268155 -0.239446
StdDev 0.027883 0.019787 0.017109 0.022523 0.020677
5% -0.449519 -0.419934 -0.464393 -0.304782 -0.273687
95% -0.357611 -0.354885 -0.408594 -0.231050 -0.205793
In [23]:
summary.swaplevel().loc['sigma', ['Mean', 'StdDev', '5%', '95%']].T
Out[23]:
Set AZ BX-E BX-I GR-E GR-I
Mean 1.817710 1.203550 1.087070 1.510580 1.407560
StdDev 0.025899 0.019206 0.014878 0.018145 0.015249
5% 1.775350 1.172050 1.062880 1.480830 1.382450
95% 1.860390 1.235130 1.111600 1.541110 1.432750
In [24]:
summary.swaplevel().loc['thetaP', ['Mean', 'StdDev', '5%', '95%']].T
Out[24]:
Set AZ BX-E BX-I GR-E GR-I
Mean 0.435515 0.424355 0.416129 0.452638 0.454304
StdDev 0.295780 0.231262 0.217583 0.270602 0.261225
5% 0.032316 0.089250 0.097087 0.062216 0.070188
95% 0.931554 0.827087 0.798405 0.900807 0.888453

Plot Distributions

Stack simulated proportions from two sources:

In [25]:
simulated_prop = pd.DataFrame({'value': samp_df[['Theta', 'PropFemale']].stack()})
simulated_prop.reset_index([0,2], inplace=True)
simulated_prop['Set'] = simulated_prop['Set'].astype('category')
simulated_prop.rename(columns={'level_2': 'Method'}, inplace=True)
simulated_prop.reset_index(drop=True, inplace=True)
simulated_prop.head()
Out[25]:
Set Method value
0 AZ Theta 0.014647
1 AZ PropFemale 0.000000
2 AZ Theta 0.943089
3 AZ PropFemale 0.941176
4 AZ Theta 0.164888
In [26]:
obs_prop = profiles.groupby('Set').apply(lambda df: df['PropFemale'].sample(sample_size, replace=True))
obs_prop.name = 'value'
obs_prop = obs_prop.reset_index(0)
obs_prop.reset_index(drop=True, inplace=True)
obs_prop['Method'] = 'Observed'
obs_prop = obs_prop[['Set', 'Method', 'value']]
obs_prop.head()
Out[26]:
Set Method value
0 AZ Observed 0.071429
1 AZ Observed 0.411765
2 AZ Observed 0.285714
3 AZ Observed 0.142857
4 AZ Observed 1.000000
In [27]:
all_prop = pd.concat([simulated_prop, obs_prop], ignore_index=True)
all_prop['Method'] = all_prop['Method'].astype('category')
all_prop['Method'].cat.reorder_categories(['Theta', 'PropFemale', 'Observed'], inplace=True)
all_prop['Method'].cat.rename_categories({'Theta': 'Smoothed', 'PropFemale': 'Predicted y/n', 'Observed': 'Observed y/n'}, inplace=True)
all_prop.head()
Out[27]:
Set Method value
0 AZ Smoothed 0.014647
1 AZ Predicted y/n 0.000000
2 AZ Smoothed 0.943089
3 AZ Predicted y/n 0.941176
4 AZ Smoothed 0.164888

We want the mean for each method to show:

In [28]:
prop_means = all_prop.groupby(['Set', 'Method'])['value'].mean().reset_index()
prop_means
Out[28]:
Set Method value
0 AZ Smoothed 0.435515
1 AZ Predicted y/n 0.434736
2 AZ Observed y/n 0.415003
3 BX-E Smoothed 0.424355
4 BX-E Predicted y/n 0.424520
5 BX-E Observed y/n 0.420246
6 BX-I Smoothed 0.416129
7 BX-I Predicted y/n 0.414873
8 BX-I Observed y/n 0.404499
9 GR-E Smoothed 0.452638
10 GR-E Predicted y/n 0.452408
11 GR-E Observed y/n 0.446901
12 GR-I Smoothed 0.454304
13 GR-I Predicted y/n 0.454035
14 GR-I Observed y/n 0.448502

Now finally we can view densities:

In [29]:
grid = sns.FacetGrid(all_prop.assign(hist=all_prop.Method == 'Observed y/n'), row='Set', hue='Method',
                     sharey=False, height=2, aspect=5)
for (r, c, h), df in grid.facet_data():
    if h == 0:
        sns.distplot(df['value'], kde=False, norm_hist=True, ax=grid.axes[r, c])
grid.map(sns.kdeplot, 'value', clip=(0,1)).add_legend()
plt.xlim(0, 1)
# plt.savefig(fig_dir / 'dist.pdf')
Out[29]:
(0.0, 1.0)
In [30]:
from bookgender.nbutils import make_plot
make_plot(all_prop, p.aes('value'),
          p.geom_histogram(p.aes(y='..density..'), data=obs_prop,
                           binwidth=0.05, fill='#CCCCCC'),
          p.geom_line(p.aes(color='Method', linetype='Method'),
                      stat='density', bw='scott', adjust=0.5),
          p.geom_vline(p.aes(xintercept='value', color='Method', linetype='Method'),
                       data=prop_means, show_legend=False),
          p.facet_grid('Set ~', scales='free_y'),
          p.scale_color_brewer('qual', 'Dark2'),
          p.xlab("Profile Proportion of Female Authors"),
          p.ylab("Density"),
          legend_position='top', legend_title=p.element_blank(),
          file='profile-dist', width=8, height=6)
C:\Users\michaelekstrand\Anaconda3\envs\bookfair\lib\site-packages\plotnine\ggplot.py:729: PlotnineWarning: Saving 8 x 6 in image.
C:\Users\michaelekstrand\Anaconda3\envs\bookfair\lib\site-packages\plotnine\ggplot.py:730: PlotnineWarning: Filename: figures\PMA-profile-full\profile-dist.pdf
C:\Users\michaelekstrand\Anaconda3\envs\bookfair\lib\site-packages\plotnine\ggplot.py:729: PlotnineWarning: Saving 8 x 6 in image.
C:\Users\michaelekstrand\Anaconda3\envs\bookfair\lib\site-packages\plotnine\ggplot.py:730: PlotnineWarning: Filename: figures\PMA-profile-full\profile-dist.png
Out[30]:
<ggplot: (-9223371869579011516)>

Robustness of Individual Inferences

The profiling process infers a $g(\theta_u)$ value for each user. What is the variance of those estimates?

In [31]:
def u_stats(ds):
    ths = samples[ds]['nTheta'][...].T
    return pd.DataFrame({
        'mean': np.mean(ths, axis=1),
        'var': np.var(ths, axis=1),
        'lo': np.quantile(ths, 0.025, axis=1),
        'hi': np.quantile(ths, 0.975, axis=1)
    })
user_biases = pd.concat(dict(
    (ds, u_stats(ds)) for ds in datasets
), names=['Set', 'uidx'])
user_biases
Out[31]:
mean var lo hi
Set uidx
AZ 0 1.286784 0.163304 0.509874 2.102445
1 -0.690522 0.099878 -1.324202 -0.083863
2 -2.271919 0.298153 -3.386939 -1.259787
3 -2.178549 0.194462 -3.054623 -1.328799
4 -1.654363 0.160850 -2.455701 -0.878510
... ... ... ... ... ...
GR-I 4995 1.085391 0.002662 0.985604 1.189203
4996 -1.429757 0.098418 -2.040010 -0.822707
4997 -0.853467 0.033602 -1.219776 -0.503246
4998 1.524129 0.026841 1.206189 1.848613
4999 -2.015394 0.023908 -2.325380 -1.719341

25000 rows × 4 columns

In [32]:
(p.ggplot(user_biases.reset_index(), p.aes('mean'))
 + p.geom_histogram()
 + p.facet_grid('Set ~'))
C:\Users\michaelekstrand\Anaconda3\envs\bookfair\lib\site-packages\plotnine\stats\stat_bin.py:93: PlotnineWarning: 'stat_bin()' using 'bins = 91'. Pick better value with 'binwidth'.
Out[32]:
<ggplot: (-9223371869577185636)>
In [33]:
(p.ggplot(user_biases.reset_index(), p.aes('var'))
 + p.geom_histogram()
 + p.facet_grid('Set ~'))
C:\Users\michaelekstrand\Anaconda3\envs\bookfair\lib\site-packages\plotnine\stats\stat_bin.py:93: PlotnineWarning: 'stat_bin()' using 'bins = 95'. Pick better value with 'binwidth'.
Out[33]:
<ggplot: (-9223371869578554436)>

This indicates quite a few $\theta_u$ values have high posterior variance, so we really do want to use the entire distribution.

In [ ]: