ESS for importance sampling.

In class, we talked about the effective sample (ESS) size of an importance sampling with \(m\) samples defined as \[ E S S(m)=\frac{m}{1+c v[w]^{2}} \]

where \(c v[w]\) is equal to the sample standard deviation of the weights \(\left(w_{1}, \ldots, w_{m}\right)\) divided by the sample mean of the weights \(\left(w_{1}, \ldots, w_{m}\right)\)

  1. Suppose \(\pi(x) \propto\) standard Gaussian density, \(g(x) \propto\) student t distribution with 2 degrees of freedom, \(h(x)=x .\) Implement an importance sampler and calculate \(E S S(m)\) for \(m=50,100,200,500,1000\)
  2. Suppose \(g(x) \propto\) standard Gaussian density, \(\pi(x) \propto\) student t distribution with 2 degrees of freedom, \(h(x)=x .\) Implement an importance sampler and calculate \(E S S(m)\) for \(m=50,100,200,500,1000 .\) Is the \(\mathrm{g}(\mathrm{x})\) a sensible choice?
  3. What do you find by comparing the results above?
  4. Suppose the target density is \[ \pi\left(\mu, \sigma^{2}\right) \propto \sigma^{-5} \exp \left[-\frac{(\mu-1)^{4}+4}{2 \sigma^{2}}\right] \] where \(\left(\mu, \sigma^{2}\right) \in[-3,5] \times[0.01,5]\)
    1. Make a contour plot of the target density in the specified range.
    2. Given the target density and the contour plot, how can you choose a good importance function? Design an importance sampling procedure and estimate the ESS for several different sample sizes. Compare your samples with that obtained from grid sampling.
    3. Optional (bonus points \(\leq 5\) ). If \(\mu\) is the only quantity of interest, does marginalization help with obtaining better importance samples? If so, can you verify it?

1 \(\mathbf{Solution.}\qquad\) Below we implement implement importance sampler and calculate ESS(m).

m=50 m=100 m=200 m=500 m=1000
ESS(m) 34.13 77.27 138.88 351.97 711.67

   

2 \(\mathbf{Solution.}\qquad\) Below we implement implement importance sampler and calculate ESS(m).

m=50 m=100 m=200 m=500 m=1000
ESS(m) 8.61 49.36 16.63 60.09 110.28

In question 2, we see that that the effective sample sizes don’t increase in \(m\) and they are quite small compared to the effective sample sizes of question 1. Thus \(g(x) \propto\) standard Gaussian density is not a sensible choice.

   

3 \(\mathbf{Solution.}\qquad\) We see that that the effective sample sizes are smaller when \(g(x) \propto\) standard Gaussian density compared to when \(g(x) \propto\) student t distribution with 2 degrees of freedom. Thus \(g(x) \propto\) standard Gaussian density is not a sensible choice.

   

4a \(\mathbf{Solution.}\qquad\) Below we create a contour plot of the target density

   

4b \(\mathbf{Solution.}\qquad\) From the contour plot of part a) we can model it as a bivariate normal. First we will estimate the mean and variance parameters of the contour plot.

##     samp1     samp2 
## 0.9943434 1.1929828
##     samp1     samp2 
## 0.4923520 0.2941133

Next we will implement the importance sampling procedure using our estimated bivariate normal distribution. We will compare the effective sample sizes of this importance sampling procedure to grid sampling which draws uniformly from the grid of values.

m=50 m=100 m=200 m=500 m=1000
ESS BVN 17.466 0.602 123.257 363.632 596.39
m=50 m=100 m=200 m=500 m=1000
ESS grid 10.25 16.912 32.491 76.368 158.379

We see that overall, the effective sample sizes are larger for the bivariate normal importance sampling procedure vs the grid sampling. Thus, choosing an importance function that is similar to the target distribution will help in increasing sample efficiency.

MCMC sampling for a hierarchical normal model.

Table 1 contains quality control measurements from 6 machines in a factory. Quality control measurements are expensive and time-consuming, so only 5 measurements were done for each machine. In addition to the existing machines, we are interested in the quality of another machine (the seventh machine). Let \(y_{i j}\) denote the \(j^{\text {th }}\) quality control measurement of machine \(i .\) Assume the following hierarchical Gaussian model with common variance for \(y=\left\{y_{i j}\right\}\) \[ \begin{aligned} P(\mu, \log \sigma, \log \tau) & \propto \tau \\ \theta_{j} | \mu, \tau^{2} & \sim N\left(\mu, \tau^{2}\right) \\ y_{i j} | \theta_{j}, \sigma^{2} & \sim N\left(\theta_{j}, \sigma^{2}\right), \quad i=1, \ldots, 5 ; j=1, \ldots, 6 \end{aligned} \]

  1. Write down the joint distribution \(P\left(\left\{y_{i j}\right\},\left\{\theta_{j}\right\}, \sigma^{2}, \mu, \tau^{2}\right)\)
  2. Gibbs Sampling
    1. Write down the posterior conditionals required for Gibbs sampling. i.e. \[ P\left(\theta_{j} | \mu, \sigma, \tau, y\right), P\left(\mu |\left\{\theta_{j}\right\}, \sigma, \tau, y\right), P\left(\sigma^{2} |\left\{\theta_{j}\right\}, \mu, \tau, y\right), P\left(\tau^{2} |\left\{\theta_{j}\right\}, \mu, \sigma, y\right) \]
    2. Implement a Gibbs sampler to sample from the posterior distribution \(P\left(\left\{\theta_{j}\right\}, \sigma^{2}, \mu, \tau^{2} | y\right) .\) Make sure you discard burn-in samples and your Gibbs sampler has converged by examining convergence diagnostics.
    3. Compute the posterior expectation of the mean of the quality measurements of the sixth machine.
    4. Suppose a new measurement is taken for the sixth machine. Give a \(95 \%\) posterior predictive interval for this new measurement.
    5. Give a \(95 \%\) posterior interval for the mean of the quality control measurements of the seventh machine.
    6. Is it reasonable to assume a model with common variance for this data? Why or why not?
  3. Metropolis-Hasting algorithm
    1. Compute the (unnormalized) posterior distribution of \(\mu, \log \sigma,\) and \(\log \tau\) by marginalizing out the \(\theta_{j}\) ’s in the joint distribution of the data and parameters.
    2. Derive and implement a Metropolis algorithm to sample from the posterior in (a) with a normal proposal distribution for each parameter centered at the previous draw with variances \(\eta^{2}=0.01,0.1,1 .\) Comment on how the variance of the proposal distribution affects sampling efficiency.
    3. Use your samples from (b) to obtain samples from the posterior distribution of each \(\theta_{j}, j=1, \ldots, 6 .\) Compare these samples with the posterior samples obtained using the Gibbs sampler in 2 .
  4. Optional ( 5 bonus points). Extend the model in 1. by adding a hierarchical model for the variances of the machine quality measurements. That is, assume \[ \begin{aligned} P\left(\mu, \tau, \sigma_{0}\right) \propto 1 & \\ \sigma_{j}^{2} | \sigma_{0}^{2} & \sim \operatorname{Inv.} \chi^{2}\left(\nu, \sigma_{0}^{2}\right) \\ \theta_{j} | \mu, \tau^{2} & \sim N\left(\mu, \tau^{2}\right) \\ y_{i j} | \theta_{j}, \sigma^{2} & \sim N\left(\theta_{j}, \sigma^{2}\right), \quad i=1, \ldots, 6 ; j=1, \ldots, 5 \end{aligned} \] where Inv. \(\chi^{2}\left(\nu, \sigma_{0}^{2}\right)\) is the scaled inverse Chi-squared distribution with degrees of freedom \(\nu\) (known) and scale \(\sigma_{0}^{2}\) (unknown) and has density \[ p\left(x ; \nu, \sigma_{0}^{2}\right)=\frac{\left(\sigma_{0}^{2} \nu / 2\right)^{\nu / 2}}{\Gamma(\nu / 2)} \frac{\exp \left[\frac{-\nu \sigma_{0}^{2}}{2 x}\right]}{x^{1+\nu / 2}}, \quad x>0 \]
    1. Assume \(\nu=5 .\) Repeat (a) - (e) in 1. Which posterior intervals are wider and why do you think this is the case?

1a \(\mathbf{Solution.}\qquad\) From page 288 of BDA, the joint distribution of all parameters is given by, \[ P\left(\left\{ y_{ij}\right\} ,\left\{ \theta_{j}\right\} ,\sigma^{2},\mu,\tau^{2}\right)=\tau\prod_{j=1}^{J}\mathrm{N}\left(\theta_{j}\mid\mu,\tau^{2}\right)\prod_{j=1}^{J}\prod_{i=1}^{n_{j}}\mathrm{N}\left(y_{ij}\mid\theta_{j},\sigma^{2}\right) \]

   

2a \(\mathbf{Solution.}\qquad\) From pages 289-290 of BDA, the posterior conditionals are, i). Conditional posterior distribution of each \(\theta_{j}\): \[\begin{align*} \theta_{j}\mid\mu,\sigma,\tau,y & \sim\mathrm{N}\left(\hat{\theta}_{j},V_{\theta_{j}}\right) \end{align*}\]

where, \[\begin{align*} \hat{\theta}_{j} & =\frac{\frac{1}{\tau^{2}}\mu+\frac{n_{j}}{\sigma^{2}}\bar{y}_{.j}}{\frac{1}{\tau^{2}}+\frac{n_{j}}{\sigma^{2}}}\\ V_{\theta_{j}} & =\frac{1}{\frac{1}{\tau^{2}}+\frac{n_{j}}{\sigma^{2}}} \end{align*}\]

ii). Conditional posterior distribution of \(\mu\): \[\begin{align*} \mu\mid\theta,\sigma,\tau,y & \sim \mathrm{N}\left(\hat{\mu},\tau^{2}/J\right) \end{align*}\]

where, \[ \hat{\mu}=\frac{1}{J}\sum_{j=1}^{J}\theta_{j} \]

iii). Conditional posterior distribution of \(\sigma^{2}\): \[\begin{align*} \sigma^{2}\mid\theta,\mu,\tau,y & \sim\mathrm{Inv-}\chi^{2}\left(n,\hat{\sigma}^{2}\right) \end{align*}\]

where, \[ \hat{\sigma}^{2}=\frac{1}{n}\sum_{j=1}^{J}\sum_{i=1}^{n_{j}}\left(y_{ij}-\theta_{j}\right)^{2} \]

iv). Conditional posterior distribution of \(\tau^{2}:\) \[\begin{align*} \tau^{2}|\theta,\mu,\sigma,y & \sim\textrm{Inv}-\chi^{2}\left(J-1,\hat{\tau}^{2}\right) \end{align*}\]

where, \[ \hat{\tau}^{2}=\frac{1}{J-1}\sum_{j=1}^{J}\left(\theta_{j}-\mu\right)^{2} \]

   

2b \(\mathbf{Solution.}\qquad\) Below we implement implement importance sampler and calculate ESS(m). We discard the first 1000 samples as our burn-in out of 10,000 iterations.

# 2. Gibbs Sampling -----------
# Part B ----------
J = 6
y = t(matrix(c(83, 92, 92, 46, 67,
             117, 109, 114, 104, 87,
             101, 93, 92, 86, 67,
             105, 119, 116, 102, 116,
             79, 97, 103, 79, 92,
             57, 92, 104, 77, 100), ncol= 6))
n = rep(ncol(y), nrow(y))
ybar = rowMeans(y)
s = apply(y, 1, sd)


# Gibbs update functions
theta_update <- function(mu, sigma, tau, J, n, ybar) {
  V_theta = 1 / (1 / tau^2 + n / sigma^2)
  theta_hat = V_theta * (mu / tau^2 + (n*ybar) / sigma^2)
  rnorm(J, mean = theta_hat, sd = sqrt(V_theta))
  }

mu_update <- function(theta, tau, J) {
  mu_hat = mean(theta)
  rnorm(1, mean = mu_hat, sd = tau / sqrt(J))
  }

sigma_update <- function(theta, ybar, s) {
  sigma2_hat = sum((n-1)*s^2 + n*(ybar - theta)^2) / sum(n)
  sigma2 = sum(n) * sigma2_hat / rchisq(1, df=sum(n))
  sqrt(sigma2)
  }

tau_update <- function(J, theta, mu) {
  tau2_hat = sum((theta - mu)^2) / (J - 1)
  tau2 = (J - 1) * tau2_hat / rchisq(1, df = J - 1)
  sqrt(tau2)
  }

# Single chain
iter = 10000

# Initialize values
theta_chain = matrix(0, iter, J)
mu_chain = rep(0, iter)
sigma_chain = rep(0, iter)
tau_chain = rep(0, iter);
theta = ybar
mu = mean(theta)
sigma = sqrt(mean(s^2))
tau = sd(ybar)

for (t in 1:iter) {
  # Update Parameters
  theta = theta_update(mu, sigma, tau, J, n, ybar)
  mu = mu_update(theta, tau, J)
  sigma = sigma_update(theta, ybar, s)
  tau = tau_update(J, theta, mu)
  
  # Store Estimates
  theta_chain[t,] = theta;
  mu_chain[t] = mu
  sigma_chain[t] = sigma
  tau_chain[t] = tau
  }

burnIn = 1000
sims = list(theta_chain=theta_chain[burnIn:iter,], 
            mu_chain=mu_chain[burnIn:iter], 
            sigma_chain=sigma_chain[burnIn:iter], 
            tau_chain=tau_chain[burnIn:iter])

After sampling, we create plots for gibbs samples of each \(theta_j\) after discarding.

From the plots of each \(\theta_j\) above, it looks like Gibbs sampler has converged.

   

2c \(\mathbf{Solution.}\qquad\) The posterior expectation of the sixth machine will be the mean of \(\theta_6\) that we sampled. We print this mean below.

## [1] 87.5313

   

2d \(\mathbf{Solution.}\qquad\) Below we find the 95% posterior predictive interval of a new measurement for machine 6.

##      2.5%     97.5% 
##  55.84678 119.87343

   

2e \(\mathbf{Solution.}\qquad\) Below we compute 95% posterior interval for the mean quality of measurements of the seventh machine.

##      2.5%     97.5% 
##  42.01938 141.17807

   

3a \(\mathbf{Solution.}\qquad\) From page 328 of BDA, the unnormalized posterior distribution is,

\[p(\mu, \log \sigma, \log \tau | y) \propto \tau \prod_{j=1}^{J} \mathrm{N}\left(\hat{\theta}_{j} | \mu, \tau^{2}\right) \prod_{j=1}^{J} \prod_{i=1}^{n_{j}} \mathrm{N}\left(y_{i j} | \hat{\theta}_{j}, \sigma^{2}\right) \prod_{j=1}^{J} V_{\theta_{j}}^{1 / 2}\]

   

3b \(\mathbf{Solution.}\qquad\) Below we define the following functions to be used in the Metropolis algorithm. The code is an adaptation of the sample code from page 600-601 in BDA.

Now we implement Metropolis algorithm for each value of \(\eta\). We can see from the table below that \({\eta}^2 = 0.01\) produces the highest acceptance rate, while \({\eta}^2 = 0.1\) and \({\eta}^2 = 1\) produces the lowest acceptance rates. Thus, either \({\eta}^2 = 0.1\) or \({\eta}^2 = 1\) will be more efficient.

eta^2=0.01 eta^2=0.1 eta^2=1
Accept rate 0.5484 0.5059 0.5064

   

3c \(\mathbf{Solution.}\qquad\) The first two rows of output represent the means of MH algorithm and means of Gibbs Sampler. We can see that they are both very similar. The next two rows represent the standard deviations of each estimate of \(\theta_j\), which are also very similar for both sampling algorithms. The last two rows compares the effective size of each sample of \(\theta_j\) and we find that the effective sizes for the MH algorithm are larger overall than the effective sizes from the Gibbs Sampler.

##    theta1    theta2    theta3    theta4    theta5    theta6 
##  79.74332 103.24473  88.93060 107.29052  90.71297  87.54264
##    theta1    theta2    theta3    theta4    theta5    theta6 
##  79.81077 103.24471  88.89138 107.32645  90.55743  87.53130
##   theta1   theta2   theta3   theta4   theta5   theta6 
## 6.724189 6.470693 6.053743 6.829975 6.100632 6.105462
##   theta1   theta2   theta3   theta4   theta5   theta6 
## 6.603069 6.558095 6.100432 6.874321 6.074387 6.161556
##   theta1   theta2   theta3   theta4   theta5   theta6 
## 4889.657 6286.884 9442.918 4950.360 9594.773 8250.623
##   theta1   theta2   theta3   theta4   theta5   theta6 
## 3920.079 5110.971 7549.588 3760.843 8454.377 6786.224