First, repeat question 3 parts a-c from problem set 1 using data.table for all computations and data manipulations.
Then, formulate and state a question answerable using the RECS data. Your question should be similar in scope to (one of) parts a-c above and should rely on one or more variables not previously used. Answer your question (using data.table) and provide supporting evidence in the form of nicely formatted graphs and/or tables.
What percent of homes have stucco construction as the major outside wall material within each division? Which division has the highest proportion? Which the lowest?
Solution:
Census Division | % Stucco Homes (95% CI) |
---|---|
Mountain South | 64.2% (55.4, 73.0) |
Pacific | 44.6% (41.3, 47.9) |
Mountain North | 16.6% (10.2, 23.0) |
South Atlantic | 10.6% ( 7.8, 13.4) |
West North Central | 4.9% ( 0.9, 8.8) |
West South Central | 3.0% ( 1.6, 4.3) |
Middle Atlantic | 2.1% ( 0.6, 3.5) |
New England | 1.2% ( 0.0, 2.8) |
East North Central | 0.7% ( 0.1, 1.2) |
East South Central | 0.4% ( 0.0, 1.2) |
Solution:
The Mountain South division has 64.2% of homes built with stucco while East South Central division has the lowest proportion of homes with stucco with only 0.4%.
What is the average total electricity usage in kilowatt hours in each division? Answer the same question stratified by urban and rural status.
Solution:
Census Division | Average Electricity Usage, kwh/home (95% CI) |
---|---|
East South Central | 14,536, (13,320 - 15,752) |
West South Central | 14,324, (13,495 - 15,153) |
South Atlantic | 13,447, (12,904 - 13,989) |
West North Central | 10,524, ( 9,635 - 11,413) |
Mountain South | 10,442, ( 7,950 - 12,934) |
East North Central | 9,129, ( 8,730 - 9,528) |
Middle Atlantic | 8,465, ( 8,071 - 8,860) |
Mountain North | 8,384, ( 7,121 - 9,648) |
Pacific | 8,100, ( 7,750 - 8,450) |
New England | 7,515, ( 6,472 - 8,557) |
From the chart above, we see East South Central had the highest average total electricity usage at 14,536 kwh. Below, we compute average total electricity usage stratified by urban and rural status.
Census Division | Rural, kwh/home (95% CI) | Urban, kwh/home (95% CI) |
---|---|---|
East South Central | 16,333, (14,088 - 18,578) | 13,747, (12,197 - 15,298) |
West South Central | 16,317, (14,067 - 18,567) | 13,629, (12,852 - 14,405) |
South Atlantic | 15,942, (14,839 - 17,045) | 12,725, (12,134 - 13,316) |
West North Central | 14,174, (12,608 - 15,740) | 9,467, ( 8,722 - 10,211) |
East North Central | 13,500, (12,022 - 14,978) | 7,980, ( 7,552 - 8,408) |
Pacific | 14,115, (12,001 - 16,229) | 7,349, ( 6,905 - 7,793) |
Middle Atlantic | 12,223, (10,633 - 13,814) | 7,987, ( 7,659 - 8,316) |
Mountain South | 8,610, ( 6,536 - 10,685) | 10,743, ( 8,178 - 13,308) |
Mountain North | 9,356, ( 5,698 - 13,014) | 8,099, ( 7,396 - 8,803) |
New England | 9,001, ( 6,766 - 11,236) | 6,964, ( 5,918 - 8,010) |
Which division has the largest disparity between urban and rural areas in terms of the proportion of homes with internet access?
Solution:
Census Division | Urban Internet Access, % (95% CI) | Rural Internet Access, % (95% CI) | Difference, % (95% CI) |
---|---|---|---|
Mountain South | 85.3% (81.3, 89.2) | 66.7% (58.3, 75.2) | 18.5% ( 7.2, 29.8) |
East South Central | 78.4% (70.5, 86.2) | 69.0% (63.5, 74.6) | 9.3% (-1.4, 20.1) |
West North Central | 88.0% (84.6, 91.4) | 80.3% (71.5, 89.2) | 7.7% (-2.5, 17.8) |
Mountain North | 87.4% (82.0, 92.9) | 81.9% (73.8, 90.0) | 5.5% (-6.2, 17.2) |
West South Central | 81.6% (76.4, 86.8) | 76.5% (72.1, 80.9) | 5.1% (-2.3, 12.5) |
Pacific | 88.7% (86.2, 91.2) | 85.3% (77.4, 93.1) | 3.4% (-4.5, 11.4) |
South Atlantic | 85.3% (82.6, 88.0) | 82.0% (76.3, 87.8) | 3.3% (-3.5, 10.1) |
New England | 87.6% (82.5, 92.6) | 85.8% (82.4, 89.2) | 1.8% (-2.5, 6.0) |
East North Central | 86.3% (83.8, 88.7) | 86.2% (81.6, 90.8) | 0.0% (-5.3, 5.4) |
Middle Atlantic | 89.3% (83.9, 94.8) | 91.3% (85.3, 97.3) | -1.9% (-9.1, 5.2) |
In the table above, we show the division with the largest absolute difference between Urban and Rural percentage. Mountain South has the largest disparity between percent of Urban and Rural Internet Users at 18.5%. This figure is nearly twice as large as the second largest disparity which is in East South Central division.
What percent of homes have gross annual income higher than 80K within each division? Which divisions have the highest and lowest proportions?
Solution:
Census Division | % of Homes w/ Income > 80K, (95% CI) |
---|---|
Mountain North | 37.0% (29.9, 44.1) |
Pacific | 36.9% (32.1, 41.7) |
Middle Atlantic | 33.6% (26.6, 40.6) |
West North Central | 28.4% (25.7, 31.1) |
West South Central | 28.3% (22.2, 34.3) |
South Atlantic | 28.1% (24.0, 32.2) |
New England | 27.9% (20.3, 35.5) |
East North Central | 26.1% (22.5, 29.8) |
Mountain South | 20.0% (13.6, 26.4) |
East South Central | 15.0% ( 8.2, 21.8) |
Pacific and Mountain North divisions seem to have the highest proportion of homes with incomes greater than 80K. Mountain North is also significantly wealthier than its counterpart, the Mountain South division. East North Central division has a higher percent estimate than East South Central, where East South Central also happens to have the least proportion of homes with incomes greater than 80K.
In this question you will design a Monte Carlo study in R to compare the performance of different methods that adjust for multiple comparisons.
Write a function that accepts matrices X and beta and returns a p by mc_rep matrix of p-values corresponding to p-values for the hypothesis tests:
In addition to X and beta your function should have arguments sigma (σ) and mc_rep controlling the error variance of Y and number of Monte Carlo replicates, respectively. Your function should solve the least-squares problems using the QR decomposition of X′X. This decomposition should only be computed once each time your function is called.
Solution:
# MonteCarlo Function the returns P values
MonteCarloPValues = function(X, beta, sigma2, mc_rep)
{
n = dim(X)[1]
p = dim(X)[2]
p_values_m = matrix(0, nrow = p, ncol = mc_rep)
# QR Decomposition
QR = qr(X)
for (i in 1:mc_rep)
{
# Set seed for testing
if (FALSE)
{
set.seed(42)
}
# Generate Y from X, beta, and sigma2
Y = rnorm(n, mean = X %*% beta, sd = sigma2 * diag(p))
# Calculate betaHat from QR decomposition
betaHat = solve(qr.R(QR), t(qr.Q(QR)) %*% Y )
# Calculate Yhat from betaHat
Yhat = X %*% betaHat
# Calculate sigmaHat squared
sigmaHat2 = (1 / (n - p)) * sum( (Y - Yhat) ^ 2)
# Calculate Variance of our betas, getting the diagonal of Var/Cov matrix
VAR_BetaHat = diag(sigmaHat2[1] * chol2inv( qr.R(QR) ))
# Calculate Z-Score
Z = betaHat / sqrt(VAR_BetaHat)
# p_values
p_values = 2 * (1 - pnorm(abs(Z), 0, 1, lower.tail = TRUE))
# Add p_values vector to our matrix of p_values for each Monte Carlo trial
p_values_m[, i] = p_values
}
p_values_m
}
Choose Σ and σ as you like. Use the Cholesky factorization of Σ to generate a single X. Pass X, β, and σ to your function from the previous part.
Solution:
Write a function evaluate that takes a set of indices where β≠0 and returns Monte Carlo estimates for the following quantities: i. The family wise error rate i. The false discovery rate i. The sensitivity i. The specificity
Solution:
# Assumes NumBetas != 0 are at the first k rows of our data frame
evaluate = function(p_values_m, alpha, numBetasNotEqual0)
{
# Initializers
k = numBetasNotEqual0 # numBetas != 0
numCols = ncol(p_values_m)
numBetasEqual0 = nrow(p_values_m) - k
testsList = c("FWER", "FDR", "sensitivity", "specificity")
# Set up our matrix to collect multiple Tests for each column
multipleTests = data.frame(matrix(nrow = length(testsList), ncol = numCols))
row.names(multipleTests) = testsList
# Look through each column of our p-values
for (j in 1:numCols)
{
# Collect the jth column of p_values
p_values = p_values_m[,j]
# Collect total number of significant p_values
rejections = which(p_values < alpha)
# Collecting FN, FP, TN, and TP
FalsePos = sum(rejections > k)
TruePos = sum(rejections <= k)
TrueNeg = numBetasEqual0 - FalsePos
# family-wise error
FWER = any(rejections > k)
# false discovery proportion
FDR = FalsePos / (FalsePos + TruePos)
# sensitivity
sensitivity = TruePos / k
# specifity
specificity = TrueNeg / (TrueNeg + FalsePos)
# Collect results for column in our multipleTests matrix
multipleTests["FWER", j] = FWER
multipleTests["FDR", j] = FDR
multipleTests["sensitivity", j] = sensitivity
multipleTests["specificity", j] = specificity
}
# Take average for each test
multipleTests = rowMeans(multipleTests)
}
Apply your function from the previous part to the matrix of uncorrected P-values generated in part B. Use the function p.adjust() to correct these p-values for multiple comparisons using ‘Bonferroni’, ‘Holm’, ‘BH’ (Benjamini-Hochberg), and ‘BY’. Use your evaluate() function for each set of adjusted p-values.
Solution:
# Assumes NumBetas != 0 are at the first k rows of our data frame
### Perform p-value adjustments and add to data frame
correctionsList = c("bonferroni", "holm", "BH", "BY")
testsList = c("FWER", "FDR", "sensitivity", "specificity")
alpha = 0.05
# Set up our matrix to collect multiple Tests for each column
multipleTests = data.frame(matrix(
nrow = length(testsList),
ncol = length(correctionsList)))
row.names(multipleTests) = testsList
colnames(multipleTests) = correctionsList
# Loop through each method from our Corrections list and produce a table with
# the output from our evaluate function for each correction method
for (methodName in correctionsList)
{
# Initializing p_values matrix
p_values = matrix(nrow = nrow(p_values_m), ncol = ncol(p_values_m))
# adjusting p_values given the method
p_values = as.matrix(p.adjust(p_values_m, method = methodName))
dim(p_values) = c(p, mc_rep)
# Run tests with our evaluate function
test = evaluate(p_values, alpha, k)
# Update column of multipleTests matrix with results from evaluate()
multipleTests[, methodName] = test
}
# Calculate Unadjusted P-Values
multipleTests[,"Unadjusted_Pvals"] = evaluate(p_values_m, alpha, k)
multipleTests = round(multipleTests, digits = 3)
# Add column containing the tests we performed
multipleTests["Tests",] = c(correctionsList, "Unadjusted")
multipleTests = as.data.frame(t(multipleTests))
# Reorder rows to be displayed in bar graphs
multipleTests = multipleTests[c("Unadjusted_Pvals",correctionsList),]
multipleTests$Tests = factor(multipleTests$Tests,
levels = multipleTests$Tests)
Produce one or more nicely formatted graphs or tables reporting your results. Briefly discuss what you found.
Solution: In this example, we take the number of monte carlo simulations: mc_rep = 100
. Below is a histogram of the p-values for \(\beta_i = 0\), resulting in 9000 observations. Ideally we would like most if not all p-values to not be significant. The histogram demonstrates that the p-values are uniformly distributed across the range [0,1].
This is a bonus question related to problem 6 from the midterm. First, review the script written in Stata available here. In this question, you will work through various options for translating this analysis into R. You may submit all or some of these, but each part must be entirely correct to earn the points listed.
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
Write a translation using data.table for the computations.
mtcarsDT = data.table(mtcars)
mtcarsDT = mtcarsDT[order(cyl), .(mpg, cyl, disp, hp, wt)]
beta_cylDT = mtcarsDT[, .(mpg,
disp_gc = disp - mean(disp),
hp_gc = hp - mean(hp),
wt_gc = wt - mean(wt)),
by = .(cyl)]
Xpmg = beta_cylDT[, .(dispXmpg = sum(mpg * disp_gc),
vdisp = var(disp_gc),
hpXmpg = sum(mpg*hp_gc),
vhp = var(hp_gc),
wtXmpg = sum(mpg*wt_gc),
vwt = var(wt_gc)),
by = .(cyl)]
betas = Xpmg[, .(beta_cyl_disp = dispXmpg / (vdisp),
beta_cyl_hp = hpXmpg / (vhp),
beta_cyl_wt = wtXmpg / (vwt)), by = .(cyl)]
Write a function to compute the univariate regression coefficient by group for an arbitrary dependent, independent, and grouping variables. Use data.table for computations within your function. Test your function by showing it produces the same results as in part a.
Number of Cylinders | beta Estimate: Displacement | beta Estimate: Horsepower | beta Estimate: Weight |
---|---|---|---|
4 | -1.35 | -1.13 | -56.47 |
6 | 0.02 | -0.05 | -16.68 |
8 | -0.26 | -0.19 | -28.50 |
Compute the regression coefficients using the dplyr verb summarize_at().
Write a function similar to the one in part b to compute arbitrary univariate regression coefficients by group. Use dplyr for computations within your function. You should read the “Programming with dplyr” vignette at vignette(‘programming’, ‘dplyr’) before attempting this. Warning: this may be difficult to debug!