Question 1 (STAN Reimplentation of HW3.3 & 4).

In this exercise, we are going to re-implement HW3 Question 3 & 4 using STAN. Given the following data from HW3:

Part I: As previously, table \([1]\) contains quality control measurements from \(6\) machines in a factory. Let \(y_{i j}\) denote the \(j^{\text {th }}\) quality control measurement of machine \(i .\) Assume the following hierarchical Gaussian model with common variance for \(y=\left\{y_{i j}\right\}\) \(p(\mu, \log \sigma, \log \tau) \propto \tau\) \[ \begin{array}{l} \theta_{j} | \mu, \tau^{2} \sim N\left(\mu, \tau^{2}\right) \\ y_{i j} | \theta_{j}, \sigma^{2} \sim N\left(\theta_{j}, \sigma^{2}\right), \quad i=1, \ldots, 5 ; j=1, \ldots, 6 \end{array} \]

  1. Implement this model in STAN and name the file “q1measurep1.stan”.
  2. Fit the model with the table data, report the effective sample sizes, trace plots and \(\mathrm{R}\) hat and comment about the chains’ convergences.
  3. Provide the posterior histogram plots for all model parameters.
  4. Report and plot the \(95 \%\) credible intervals of the model. parameters. Compare with what you got with HW3.
  • For part (b)-(d), include all the codes in the Rscript “q1runp1.R

(a) \(\mathbf{Solution.}\qquad\) See q1measurep1.stan.

   

(b) \(\mathbf{Solution.}\qquad\) Below we print the results of our model fit.

## Inference for Stan model: q1measurep1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##             mean se_mean    sd    2.5%   97.5% n_eff Rhat
## theta[1]   78.91    0.14  6.65   66.02   92.35  2425    1
## theta[2]  104.18    0.13  6.64   91.32  117.52  2589    1
## theta[3]   88.49    0.11  6.18   76.44  100.36  3067    1
## theta[4]  108.53    0.14  6.85   94.45  121.42  2401    1
## theta[5]   90.38    0.10  6.22   78.11  102.61  3615    1
## theta[6]   86.97    0.11  6.43   74.44   99.46  3481    1
## sigma      15.14    0.05  2.39   11.28   20.48  2827    1
## tau        21.26    0.38 13.82    6.27   58.61  1337    1
## mu         93.09    0.25 10.16   71.09  114.23  1696    1
## lp__     -106.54    0.07  2.44 -112.27 -102.96  1394    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 28 18:53:57 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

We also show the trace plots of each of the fitted paramaters.

From the table and trace plots, we see that \(\tau\) has the smallest effective sample size and there is a lot of volatility for the convergence of the chains. All other parameters look to be estimated well.

   

(c) \(\mathbf{Solution.}\qquad\) Below we provide posterior histogram plots for all model parameters.

   

(d) \(\mathbf{Solution.}\qquad\) Below we show the lower and upper posterior intervals for each \(\theta_j\) and compare to results from HW3. From the tables below, we see that the intervals are close to each other.

## # A tibble: 3 x 7
##   Algorithm   theta1_lo theta2_lo theta3_lo theta4_lo theta5_lo theta6_lo
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Gibbs            66.8      91.2      76.8      93.8      78.5      75.5
## 2 MH               66.7      90.3      76.8      93.3      78.7      76.0
## 3 Stan Part I      66.0      91.3      76.4      94.5      78.1      74.4
## # A tibble: 3 x 7
##   Algorithm   theta1_up theta2_up theta3_up theta4_up theta5_up theta6_up
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Gibbs            93.1      116.      101.      121.      103.      99.7
## 2 MH               93.7      116.      101.      121.      102.      99.2
## 3 Stan Part I      92.3      118.      100.      121.      103.      99.5

By graphically comparing our samples to the samples in HW3, we see that indeed they are similar.

Part II: In this part, the model is as follows: \[ \begin{array}{l} p\left(\mu, \tau, \sigma_{0}\right) \propto 1 \\ \sigma_{j}^{2} | \sigma_{0}^{2} \sim \operatorname{Inv} . \chi^{2}\left(5, \sigma_{0}^{2}\right) \\ \theta_{j} | \mu, \tau^{2} \sim N\left(\mu, \tau^{2}\right) \\ y_{i j} | \theta_{j}, \sigma^{2} \sim N\left(\theta_{j}, \sigma^{2}\right), \quad i=1, \ldots, 6 ; j=1, \ldots, 5 \end{array} \] - Repeat part I (a)-(d) and name it as “q1measurep2.stan” and “q1runp2.R” respectively. Compare with HW3 question 4 results.

(a) \(\mathbf{Solution.}\qquad\) See q1measurep2.stan.

   

(b) \(\mathbf{Solution.}\qquad\) Below we print the results of our model fit.

## Inference for Stan model: q1measurep2.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##             mean se_mean     sd     25%    75% n_eff Rhat
## mu         93.75    0.16   7.75   89.11  98.22  2227    1
## tau        16.04    0.22   8.85   10.34  19.63  1649    1
## theta[1]   81.21    0.18   8.06   75.89  86.43  2083    1
## theta[2]  103.60    0.15   6.38   99.63 107.79  1827    1
## theta[3]   89.23    0.12   6.05   85.42  92.98  2501    1
## theta[4]  108.11    0.17   6.44  104.47 112.33  1523    1
## theta[5]   91.02    0.11   5.68   87.62  94.63  2544    1
## theta[6]   88.02    0.14   7.16   83.36  92.76  2638    1
## sigma2[1] 379.91    5.55 244.10  221.85 460.49  1933    1
## sigma2[2] 228.60    3.88 167.71  130.55 272.11  1870    1
## sigma2[3] 227.64    3.70 148.50  135.14 274.29  1609    1
## sigma2[4] 193.23    3.79 158.79  100.75 229.35  1757    1
## sigma2[5] 207.88    3.40 144.55  117.49 249.16  1802    1
## sigma2[6] 345.68    4.47 217.30  211.20 418.97  2361    1
## sigma0     14.03    0.08   2.91   11.92  15.77  1421    1
## sigma[1]   18.74    0.12   5.35   14.89  21.46  1959    1
## sigma[2]   14.44    0.10   4.49   11.43  16.50  1892    1
## sigma[3]   14.51    0.10   4.14   11.63  16.56  1760    1
## sigma[4]   13.14    0.11   4.55   10.04  15.14  1676    1
## sigma[5]   13.80    0.10   4.17   10.84  15.78  1872    1
## sigma[6]   17.93    0.10   4.92   14.53  20.47  2606    1
## lp__      -98.52    0.09   3.10 -100.32 -96.30  1097    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 28 16:18:12 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

We also show the trace plots of each of the fitted paramaters.

From the table and trace plots, we see that all chains are convergent.

   

(c) \(\mathbf{Solution.}\qquad\) Below we provide posterior histogram plots for all model parameters.

   

(d) \(\mathbf{Solution.}\qquad\) Below we show the lower and upper posterior intervals for each \(\theta_j\) and compare to results from HW3. From the tables below, we see that the intervals are close to each other.

## # A tibble: 3 x 7
##   Algorithm    theta1_lo theta2_lo theta3_lo theta4_lo theta5_lo theta6_lo
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Gibbs             66.8      91.2      76.8      93.8      78.5      75.5
## 2 MH                66.7      90.3      76.8      93.3      78.7      76.0
## 3 Stan Part II      66.0      90.3      77.5      93.9      79.6      73.5
## # A tibble: 3 x 7
##   Algorithm    theta1_up theta2_up theta3_up theta4_up theta5_up theta6_up
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Gibbs             93.1      116.      101.      121.      103.      99.7
## 2 MH                93.7      116.      101.      121.      102.      99.2
## 3 Stan Part II      97.4      116.      101.      120.      103.     102.

By graphically comparing our samples to the samples in HW3, we see that indeed they are similar.

Question 2 (Finite Mixture of Gaussian Model):

Recall from the lecture that STAN doesn’t allow any discrete model parameters. In some cases, we can work around the restriction by marginalizing these out. In this question, we are going to explore one such example. One useful model for clustering is the latent Gaussian mixture model: \[ \begin{array}{l} \text { For } x_{i}, \mu_{k} \in \mathbb{R}^{2}, \Sigma_{k} \in \mathbb{R}^{2 \times 2}, \pi \in[0,1]^{3}, i=1, \ldots, 1000 \text { and } k=1,2,3 \\ \qquad \begin{aligned} z_{i} | \pi & \sim \text { Categorial }(z | \pi) \\ & x_{i}\left|z_{i}=k, \mu, \Sigma \sim N\left(x | \mu_{k}, \Sigma_{k}\right)\right.\\ \pi \sim \text { Dirichlet }\left(\pi | \frac{1}{3}, \frac{1}{3}, \frac{1}{3}\right) & \\ \mu_{k}\left|\Sigma_{k} \sim N\left(\mu | 0, \Sigma_{k}\right)\right. \end{aligned} \end{array} \] \(\Sigma_{k} \sim\) Inv-Wishart \((\Sigma | 3, I)\) where \(I_{2 \times 2}:\) identity matrix

  1. Rewrite the above model by marginalizing out the parameter \(z_{i},\) i.e., find \(p\left(x_{i} | \mu, \Sigma, \pi\right)\) for \(i=1,2, \dots, 1000\)
  2. Implement the model from part (a) in STAN and name the file “gmm.stan”. Hint: Use STAN log_sum_exp
  3. Fit the model with the data “hw4q2.csv” in Canvas. The data is the collection of \((x, y)\) coordinates for each data point. Report the effective sample sizes, trace plots and R hat.
  4. Plot the posterior histogram of the model parameter \(\mu_{k}, \Sigma_{k}, \pi_{k}\) for \(k=1,2,3\)
  5. Report and plot the \(90 \%\) credible intervals of the model parameter \(\mu_{k}, \Sigma_{k}, \pi_{k}\) for \(k=1,2,3 . .\)
  6. Use ggplot (or any other packages) to visually plot out the data from “hw4q2.csv” in \(2 \mathrm{D} .\) Are the results from part (d) and (e) in line with the data?
  • Include all the codes in part (c)-(f ) in the script file “q2run.R”.

(a) \(\mathbf{Solution.}\qquad\) We define each \(p\left(x_{i} \mid \mu, \Sigma, \pi\right)\) as,

\[p\left(x_{i} \mid \mu, \Sigma, \pi\right)=\sum_{k=1}^{3} \pi_{k} p\left(x_{i} \mid z_{i}=k\right)\]

   

(b)-(e) \(\mathbf{Solution.}\qquad\) The markov chains won’t converge due to label switching. We can fix the issue if we sort the parameters we are estimating, as can be seen in gmm.stan file. Below we print the effective sample sizes and R hat of each parameter.

##                  n_eff      Rhat
## mu[1,1]       752.9944 0.9992802
## mu[1,2]      1126.9094 1.0006393
## mu[2,1]       725.5882 0.9997860
## mu[2,2]       746.5764 0.9997889
## mu[3,1]      1147.5294 0.9999796
## mu[3,2]      2645.3171 0.9996144
## sigma[1,1,1] 1584.1790 1.0006166
## sigma[1,1,2] 1096.2835 1.0004774
## sigma[1,2,1] 1096.2835 1.0004774
## sigma[1,2,2] 1562.2531 1.0012742
## sigma[2,1,1]  901.2852 0.9991153
## sigma[2,1,2] 1359.8538 1.0008705
## sigma[2,2,1] 1359.8538 1.0008705
## sigma[2,2,2]  878.3457 1.0018428
## sigma[3,1,1] 1589.2433 0.9993557
## sigma[3,1,2] 1737.7172 0.9995815
## sigma[3,2,1] 1737.7172 0.9995815
## sigma[3,2,2] 1671.1224 1.0002212
## Pi[1]         661.3025 0.9998311
## Pi[2]         608.9885 0.9995605
## Pi[3]         651.3841 0.9992514
## P[1]         1018.8506 1.0039840
## P[2]         1050.5729 1.0034844
## P[3]          959.2461 1.0039818
## lp__          765.6541 0.9994990

We also show the trace plots.

   

(f) \(\mathbf{Solution.}\qquad\) Below we provide posterior histogram plots for all model parameters.

   

(g) \(\mathbf{Solution.}\qquad\) Below we plot the posterior 90% intervals of each parametr.

   

(h) \(\mathbf{Solution.}\qquad\) The results from parts (f) and (g) are almost in line with the data. From the data we see that there appears to be three separate clusters. However, our class mean and covariance for the third class seems to actually be fitting between two smaller clusters in the data.

Question 3 (Logit/Probit Regression)

During an election, the voter turnout is an important political issue to investigate in details. For this question, download the “hw4q3.csv” from the Canvas. This file is the turnout data from R Zelig package. The survey data has 6 columns: - age: participant’s age - educate: participant’s education level - income: participant’s income - racewhite: 1 if raceothers \(==0\) and 0 otherwise - raceothers 1 if racewhite \(==0\) and 0 otherwise - vote: 1 if voted and 0 otherwise

We will apply the Bayesian Logit and Probit regression models where “vote” is the response \(y \in\{0,1\}^{N}\) and the others are covariates. N is the number of data points and \(X \in \mathbb{R}^{N \times 5}\) is the design matrix.

  1. (Preprocessing) Create the design matrix \(X \in \mathbb{R}^{N \times 5}\) and response \(y \in \mathbb{R}^{N}\) from “hw4q3.csv”. Standardize (scaling and centering) each column/covariate of \(\mathrm{X}\).
  2. The Bayesian logit regression model can be described as follows. For \(\alpha, \eta_{i} \in \mathbb{R}\) and \(x_{i}, \beta \in \mathbb{R}^{5}\) and \(i=1, \ldots, N:\) \[ \begin{array}{l} y_{i} | \rho_{i} \sim \text { Bernoulli }\left(y | \rho_{i}\right) \\ \rho_{i} | \eta_{i}=\operatorname{inv-logit}\left(\eta_{i}\right) \\ \eta_{i}=\alpha+x_{i} \beta \\ \alpha \sim \mathrm{N}(\cdot | 0,10) \\ \beta_{j} \sim \mathrm{N}(\cdot | 0,2.5) j=1, \ldots, 5 \end{array} \] where the inv-logit \((\eta)=\frac{1}{1+e^{-\eta}} .\) Implement this model in STAN and name the file “q3logit.stan”
  3. Fit the model with the data, report effective sample sizes, trace plots and \(95 \%\) credit intervals for \(\alpha, \beta_{1}, \ldots, \beta_{5}\)
  4. The probit model is identical to the logit model in part (b) except for \(\rho_{i}\) which is now: \(\rho_{i} | \eta_{i}=\Phi\left(\eta_{i}\right)\) where \(\Phi(\cdot)\) is the standard normal cdf Implement the model in STAN and name the file “q3probit.stan”
  5. Repeat part (c) for Probit model (d) \(*\) Include the codes of part (c) and (e) in the script “q3run.R”
  6. Explain how to compare the probit and logit results from part (c) and (e)?
  7. Are the posteriors of (c) and (e) comparable with the MLE estimators for \(\alpha, \beta_{1}, \cdots, \beta_{5} ?\) Hint: Use R’s glm function. Include the code in “q3run.R”