In class, we talked about the effective sample (ESS) size of an importance sampling with \(m\) samples defined as \[ E S S(m)=\frac{m}{1+c v[w]^{2}} \]
where \(c v[w]\) is equal to the sample standard deviation of the weights \(\left(w_{1}, \ldots, w_{m}\right)\) divided by the sample mean of the weights \(\left(w_{1}, \ldots, w_{m}\right)\)
1 \(\mathbf{Solution.}\qquad\) Below we implement implement importance sampler and calculate ESS(m).
## QUESTION 1 ----------
m = c(50, 100, 200, 500, 1000)
t_importance = function(m) {
rs = rt(m, 2)
p = dnorm(rs)
g = pt(rs,2)
w = p / g
ESS = m / (1 + var(w) / (mean(w))^2)
ESS
}
ESS_t = data.frame(matrix(sapply(m, t_importance), nrow = 1))
colnames(ESS_t) = paste0("m=",m)
rownames(ESS_t) = c("ESS(m)")
kable(ESS_t, digits = 2)
m=50 | m=100 | m=200 | m=500 | m=1000 | |
---|---|---|---|---|---|
ESS(m) | 34.13 | 77.27 | 138.88 | 351.97 | 711.67 |
2 \(\mathbf{Solution.}\qquad\) Below we implement implement importance sampler and calculate ESS(m).
## QUESTION 2 ----------
norm_importance = function(m) {
rs = rnorm(m)
p = pt(rs, 2)
g = dnorm(rs)
w = p / g
ESS = m / (1 + var(w) / (mean(w))^2)
ESS
}
ESS_norm = data.frame(matrix(sapply(m, norm_importance), nrow = 1))
colnames(ESS_norm) = paste0("m=",m)
rownames(ESS_norm) = c("ESS(m)")
kable(ESS_norm, digits = 2)
m=50 | m=100 | m=200 | m=500 | m=1000 | |
---|---|---|---|---|---|
ESS(m) | 8.61 | 49.36 | 16.63 | 60.09 | 110.28 |
In question 2, we see that that the effective sample sizes don’t increase in \(m\) and they are quite small compared to the effective sample sizes of question 1. Thus \(g(x) \propto\) standard Gaussian density is not a sensible choice.
3 \(\mathbf{Solution.}\qquad\) We see that that the effective sample sizes are smaller when \(g(x) \propto\) standard Gaussian density compared to when \(g(x) \propto\) student t distribution with 2 degrees of freedom. Thus \(g(x) \propto\) standard Gaussian density is not a sensible choice.
4a \(\mathbf{Solution.}\qquad\) Below we create a contour plot of the target density
## QUESTION 4 ----------
# Part A --------
# Define Target Function
target = function(mu, sigma) {
(sigma^(-5)) * exp(-((mu - 1)^4 + 4) / (2 * sigma^2))
}
# Prepare to compute grid of values for mu and sigma
n = 100
mu = seq(-3,5, length.out = n)
sigma = seq(0.01, 5, length.out = n)
Cmu = rep(mu, length(mu))
Csigma = rep(sigma, each = length(sigma))
z = mapply(target, Cmu, Csigma)
z_mat = matrix(z, nrow = n, ncol = n)
df = data.frame(mu = Cmu, sigma = Csigma, density = z)
ggplot(df, aes(mu, sigma, z = density)) +
geom_contour(bins = 10, colour = "steelblue") +
ggtitle("Contour Plot of Target Density") +
metR::geom_text_contour(aes(z = density))
4b \(\mathbf{Solution.}\qquad\) From the contour plot of part a) we can model it as a bivariate normal. First we will estimate the mean and variance parameters of the contour plot.
library(mvtnorm)
# First we estimate the parameters to be used in the importance function
set.seed(5)
# Estimate parameters to be used in multivariate normal
ind = which(z_mat == max(z_mat), arr.ind = TRUE)[1,]
samp1 = sample(mu, n, prob = z_mat[,ind[2]] / sum(z_mat[,ind[2]]),
replace = TRUE)
samp2 = sample(sigma, n, prob = z_mat[ind[1],] / sum(z_mat[ind[1],]),
replace = TRUE)
samples = cbind(samp1, samp2)
# Estimated parameters of bivariate normal
means = colMeans(samples)
Sigma = diag(cov(samples))
means
## samp1 samp2
## 0.9943434 1.1929828
## samp1 samp2
## 0.4923520 0.2941133
Next we will implement the importance sampling procedure using our estimated bivariate normal distribution. We will compare the effective sample sizes of this importance sampling procedure to grid sampling which draws uniformly from the grid of values.
mvnorm_importance = function(m) {
rs = rmvnorm(m, mean = means, sigma = diag(Sigma))
p = target(rs[,1], rs[,2])
g = dmvnorm(rs, mean = means, sigma = diag(Sigma))
w = p / g
ESS = m / (1 + var(w) / (mean(w))^2)
ESS
}
ESS_import = data.frame(matrix(sapply(m, mvnorm_importance), nrow = 1))
colnames(ESS_import) = paste0("m=",m)
rownames(ESS_import) = c("ESS BVN")
kable(ESS_import, digits = 3)
m=50 | m=100 | m=200 | m=500 | m=1000 | |
---|---|---|---|---|---|
ESS BVN | 17.466 | 0.602 | 123.257 | 363.632 | 596.39 |
# Lets compare to grid sampling procedure,
grid_unif = function(m) {
rs = cbind(runif(m, 3, 5), runif(m, .01, 5))
p = target(rs[,1], rs[,2])
g = 1
w = p / g
ESS = m / (1 + var(w) / (mean(w))^2)
ESS
}
ESS_grid = data.frame(matrix(sapply(m, grid_unif), nrow = 1))
colnames(ESS_grid) = paste0("m=",m)
rownames(ESS_grid) = c("ESS grid")
kable(ESS_grid, digits = 3)
m=50 | m=100 | m=200 | m=500 | m=1000 | |
---|---|---|---|---|---|
ESS grid | 10.25 | 16.912 | 32.491 | 76.368 | 158.379 |
We see that overall, the effective sample sizes are larger for the bivariate normal importance sampling procedure vs the grid sampling. Thus, choosing an importance function that is similar to the target distribution will help in increasing sample efficiency.
Table 1 contains quality control measurements from 6 machines in a factory. Quality control measurements are expensive and time-consuming, so only 5 measurements were done for each machine. In addition to the existing machines, we are interested in the quality of another machine (the seventh machine). Let \(y_{i j}\) denote the \(j^{\text {th }}\) quality control measurement of machine \(i .\) Assume the following hierarchical Gaussian model with common variance for \(y=\left\{y_{i j}\right\}\) \[ \begin{aligned} P(\mu, \log \sigma, \log \tau) & \propto \tau \\ \theta_{j} | \mu, \tau^{2} & \sim N\left(\mu, \tau^{2}\right) \\ y_{i j} | \theta_{j}, \sigma^{2} & \sim N\left(\theta_{j}, \sigma^{2}\right), \quad i=1, \ldots, 5 ; j=1, \ldots, 6 \end{aligned} \]
1a \(\mathbf{Solution.}\qquad\) From page 288 of BDA, the joint distribution of all parameters is given by, \[ P\left(\left\{ y_{ij}\right\} ,\left\{ \theta_{j}\right\} ,\sigma^{2},\mu,\tau^{2}\right)=\tau\prod_{j=1}^{J}\mathrm{N}\left(\theta_{j}\mid\mu,\tau^{2}\right)\prod_{j=1}^{J}\prod_{i=1}^{n_{j}}\mathrm{N}\left(y_{ij}\mid\theta_{j},\sigma^{2}\right) \]
2a \(\mathbf{Solution.}\qquad\) From pages 289-290 of BDA, the posterior conditionals are, i). Conditional posterior distribution of each \(\theta_{j}\): \[\begin{align*} \theta_{j}\mid\mu,\sigma,\tau,y & \sim\mathrm{N}\left(\hat{\theta}_{j},V_{\theta_{j}}\right) \end{align*}\]
where, \[\begin{align*} \hat{\theta}_{j} & =\frac{\frac{1}{\tau^{2}}\mu+\frac{n_{j}}{\sigma^{2}}\bar{y}_{.j}}{\frac{1}{\tau^{2}}+\frac{n_{j}}{\sigma^{2}}}\\ V_{\theta_{j}} & =\frac{1}{\frac{1}{\tau^{2}}+\frac{n_{j}}{\sigma^{2}}} \end{align*}\]
ii). Conditional posterior distribution of \(\mu\): \[\begin{align*} \mu\mid\theta,\sigma,\tau,y & \sim \mathrm{N}\left(\hat{\mu},\tau^{2}/J\right) \end{align*}\]
where, \[ \hat{\mu}=\frac{1}{J}\sum_{j=1}^{J}\theta_{j} \]
iii). Conditional posterior distribution of \(\sigma^{2}\): \[\begin{align*} \sigma^{2}\mid\theta,\mu,\tau,y & \sim\mathrm{Inv-}\chi^{2}\left(n,\hat{\sigma}^{2}\right) \end{align*}\]
where, \[ \hat{\sigma}^{2}=\frac{1}{n}\sum_{j=1}^{J}\sum_{i=1}^{n_{j}}\left(y_{ij}-\theta_{j}\right)^{2} \]
iv). Conditional posterior distribution of \(\tau^{2}:\) \[\begin{align*} \tau^{2}|\theta,\mu,\sigma,y & \sim\textrm{Inv}-\chi^{2}\left(J-1,\hat{\tau}^{2}\right) \end{align*}\]
where, \[ \hat{\tau}^{2}=\frac{1}{J-1}\sum_{j=1}^{J}\left(\theta_{j}-\mu\right)^{2} \]
2b \(\mathbf{Solution.}\qquad\) Below we implement implement importance sampler and calculate ESS(m). We discard the first 1000 samples as our burn-in out of 10,000 iterations.
# 2. Gibbs Sampling -----------
# Part B ----------
J = 6
y = t(matrix(c(83, 92, 92, 46, 67,
117, 109, 114, 104, 87,
101, 93, 92, 86, 67,
105, 119, 116, 102, 116,
79, 97, 103, 79, 92,
57, 92, 104, 77, 100), ncol= 6))
n = rep(ncol(y), nrow(y))
ybar = rowMeans(y)
s = apply(y, 1, sd)
# Gibbs update functions
theta_update <- function(mu, sigma, tau, J, n, ybar) {
V_theta = 1 / (1 / tau^2 + n / sigma^2)
theta_hat = V_theta * (mu / tau^2 + (n*ybar) / sigma^2)
rnorm(J, mean = theta_hat, sd = sqrt(V_theta))
}
mu_update <- function(theta, tau, J) {
mu_hat = mean(theta)
rnorm(1, mean = mu_hat, sd = tau / sqrt(J))
}
sigma_update <- function(theta, ybar, s) {
sigma2_hat = sum((n-1)*s^2 + n*(ybar - theta)^2) / sum(n)
sigma2 = sum(n) * sigma2_hat / rchisq(1, df=sum(n))
sqrt(sigma2)
}
tau_update <- function(J, theta, mu) {
tau2_hat = sum((theta - mu)^2) / (J - 1)
tau2 = (J - 1) * tau2_hat / rchisq(1, df = J - 1)
sqrt(tau2)
}
# Single chain
iter = 10000
# Initialize values
theta_chain = matrix(0, iter, J)
mu_chain = rep(0, iter)
sigma_chain = rep(0, iter)
tau_chain = rep(0, iter);
theta = ybar
mu = mean(theta)
sigma = sqrt(mean(s^2))
tau = sd(ybar)
for (t in 1:iter) {
# Update Parameters
theta = theta_update(mu, sigma, tau, J, n, ybar)
mu = mu_update(theta, tau, J)
sigma = sigma_update(theta, ybar, s)
tau = tau_update(J, theta, mu)
# Store Estimates
theta_chain[t,] = theta;
mu_chain[t] = mu
sigma_chain[t] = sigma
tau_chain[t] = tau
}
burnIn = 1000
sims = list(theta_chain=theta_chain[burnIn:iter,],
mu_chain=mu_chain[burnIn:iter],
sigma_chain=sigma_chain[burnIn:iter],
tau_chain=tau_chain[burnIn:iter])
After sampling, we create plots for gibbs samples of each \(theta_j\) after discarding.
# Create 3x3 plot gibbs samples for theta_j
plot_list = list()
df = data.frame(x = burnIn:iter, sims$theta_chain)
colnames(df) = c("x", paste0("theta", 1:J))
for (i in 1:J) {
# Plot first 200 iterations
plot_list[[i]] = local({
p1 = ggplot(data = df, aes(x = x, y = get(paste0("theta", i)))) +
geom_line(colour = "steelblue") +
ylab(TeX(paste0("$\\theta_",i,"$"))) +
ggtitle(TeX(paste0("gibbs samples for ", "$\\theta_", i, "$"))) +
theme(plot.title = element_text(hjust = 0.5))
})
}
# plot all subplots
grid.arrange(grobs = plot_list, ncol = 2,
top = textGrob(""),
gp = gpar(fontsize = 13, font = 8))
From the plots of each \(\theta_j\) above, it looks like Gibbs sampler has converged.
2c \(\mathbf{Solution.}\qquad\) The posterior expectation of the sixth machine will be the mean of \(\theta_6\) that we sampled. We print this mean below.
## [1] 87.5313
2d \(\mathbf{Solution.}\qquad\) Below we find the 95% posterior predictive interval of a new measurement for machine 6.
# Part D ----------
probs = c(.025, .975)
trial_6 = rnorm(iter, mean = sims$theta_chain[,6], sd = sims$sigma_chain)
res = quantile(trial_6, probs=probs)
res
## 2.5% 97.5%
## 55.84678 119.87343
2e \(\mathbf{Solution.}\qquad\) Below we compute 95% posterior interval for the mean quality of measurements of the seventh machine.
# Part E ----------
theta_7 = rnorm(iter, mean = sims$mu_chain, sd = sims$tau_chain)
trial_7 = rnorm(iter, mean = theta_7, sd = sims$sigma_chain)
res = quantile(trial_7, probs=probs)
res
## 2.5% 97.5%
## 42.01938 141.17807
# Store samples for Theta
theta_Gibbs = data.frame(theta_chain)
colnames(theta_Gibbs) = paste0("theta", 1:J)
3a \(\mathbf{Solution.}\qquad\) From page 328 of BDA, the unnormalized posterior distribution is,
\[p(\mu, \log \sigma, \log \tau | y) \propto \tau \prod_{j=1}^{J} \mathrm{N}\left(\hat{\theta}_{j} | \mu, \tau^{2}\right) \prod_{j=1}^{J} \prod_{i=1}^{n_{j}} \mathrm{N}\left(y_{i j} | \hat{\theta}_{j}, \sigma^{2}\right) \prod_{j=1}^{J} V_{\theta_{j}}^{1 / 2}\]
3b \(\mathbf{Solution.}\qquad\) Below we define the following functions to be used in the Metropolis algorithm. The code is an adaptation of the sample code from page 600-601 in BDA.
# 3. Metropolis-Hasting algorithm ----------
# Part B ----------
log_post <- function(tau, mu, sigma, n, y, ybar){
V_theta = 1 / (1 / tau^2 + n / sigma^2)
theta = V_theta * (mu / tau^2 + (n*ybar) / sigma^2)
log_post = log(tau) + sum(dnorm(theta, mu, sigma, log = TRUE)) +
sum(dnorm(y, theta, sigma, log = TRUE)) + 0.5 * sum(V_theta)
log_post
}
MH_update <- function(tau, mu, sigma, eta, n, y, ybar){
accept_MH = 0
tau_star = rnorm(1, tau, eta)
mu_star = rnorm(1, mu, eta)
sigma_star = rnorm(1, sigma, eta)
if (tau_star <= 0 | sigma_star <= 0)
{
return(list(tau=tau, mu=mu, sigma = sigma, accept_MH = accept_MH))
}
r = exp(log_post(tau_star, mu_star, sigma_star, n, y, ybar) -
log_post(tau, mu, sigma, n, y, ybar))
if (runif(1) < min(r, 1)){
tau = tau_star
mu = mu_star
sigma = sigma_star
accept_MH = 1
}
return(list(tau=tau, mu=mu, sigma = sigma, accept_MH = accept_MH))
}
build_MH_Chain = function(eta) {
# Initialize Lists
mu_chain = rep(0, iter)
sigma_chain = rep(0, iter)
tau_chain = rep(0, iter);
accept_chain = rep(0, iter)
# Initial Parameters
theta = ybar
mu = mean(theta)
sigma = sqrt(mean(s^2))
tau = sd(ybar)
for(t in 1:iter) {
res = MH_update(tau, mu, sigma, eta, n, y, ybar)
tau = res$tau
mu = res$mu
sigma = res$sigma
tau_chain[t] = tau
mu_chain[t] = mu
sigma_chain[t] = sigma
accept_chain[t] = res$accept_MH
}
list(mu_chain = mu_chain, sigma_chain = sigma_chain,
tau_chain = tau_chain, accept_chain = accept_chain)
}
Now we implement Metropolis algorithm for each value of \(\eta\). We can see from the table below that \({\eta}^2 = 0.01\) produces the highest acceptance rate, while \({\eta}^2 = 0.1\) and \({\eta}^2 = 1\) produces the lowest acceptance rates. Thus, either \({\eta}^2 = 0.1\) or \({\eta}^2 = 1\) will be more efficient.
# Implement Metropolis Algorithm for each eta
eta_vals = sqrt(c(0.01, 0.1, 1))
res = lapply(eta_vals, build_MH_Chain)
# Compute ESS of each parameter
df = data.frame(matrix(NA, nrow = 1, ncol = 3))
colnames(df) = paste0("eta^2=", c(0.01, 0.1, 1))
for (i in 1:length(res)) {
chains = res[[i]]
accept_rate = sum(chains$accept_chain) / iter
df[1,i] = accept_rate
}
rownames(df) = "Accept rate"
kable(df, digits = 5)
eta^2=0.01 | eta^2=0.1 | eta^2=1 | |
---|---|---|---|
Accept rate | 0.5484 | 0.5059 | 0.5064 |
3c \(\mathbf{Solution.}\qquad\) The first two rows of output represent the means of MH algorithm and means of Gibbs Sampler. We can see that they are both very similar. The next two rows represent the standard deviations of each estimate of \(\theta_j\), which are also very similar for both sampling algorithms. The last two rows compares the effective size of each sample of \(\theta_j\) and we find that the effective sizes for the MH algorithm are larger overall than the effective sizes from the Gibbs Sampler.
# Part C ----------
theta0 = ybar
theta_MH = mapply(theta_update, mu_chain, sigma_chain, tau_chain,
MoreArgs = list(J, n, ybar))
theta_MH = data.frame(t(theta_MH))
colnames(theta_MH) = paste0("theta", 1:J)
# Compare Means and Variances
colMeans(theta_MH)
## theta1 theta2 theta3 theta4 theta5 theta6
## 79.74332 103.24473 88.93060 107.29052 90.71297 87.54264
## theta1 theta2 theta3 theta4 theta5 theta6
## 79.81077 103.24471 88.89138 107.32645 90.55743 87.53130
## theta1 theta2 theta3 theta4 theta5 theta6
## 6.724189 6.470693 6.053743 6.829975 6.100632 6.105462
## theta1 theta2 theta3 theta4 theta5 theta6
## 6.603069 6.558095 6.100432 6.874321 6.074387 6.161556
## theta1 theta2 theta3 theta4 theta5 theta6
## 4889.657 6286.884 9442.918 4950.360 9594.773 8250.623
## theta1 theta2 theta3 theta4 theta5 theta6
## 3920.079 5110.971 7549.588 3760.843 8454.377 6786.224