Problem Set 3

Question 1

First, repeat question 3 parts a-c from problem set 1 using data.table for all computations and data manipulations.

Then, formulate and state a question answerable using the RECS data. Your question should be similar in scope to (one of) parts a-c above and should rely on one or more variables not previously used. Answer your question (using data.table) and provide supporting evidence in the form of nicely formatted graphs and/or tables.

Part a.

What percent of homes have stucco construction as the major outside wall material within each division? Which division has the highest proportion? Which the lowest?

Solution:

**Table 1.** Proportion of homes with stucco construction within each census division in 2015. Estimates are based on the residential energy consumption survey.
Census Division	% Stucco Homes (95% CI)
Mountain South	64.2% (55.4, 73.0)
Pacific	44.6% (41.3, 47.9)
Mountain North	16.6% (10.2, 23.0)
South Atlantic	10.6% ( 7.8, 13.4)
West North Central	4.9% ( 0.9, 8.8)
West South Central	3.0% ( 1.6, 4.3)
Middle Atlantic	2.1% ( 0.6, 3.5)
New England	1.2% ( 0.0, 2.8)
East North Central	0.7% ( 0.1, 1.2)
East South Central	0.4% ( 0.0, 1.2)

Figure 1. Estimated percent of homes within each census division with major wall type of stucco.

Solution:

The Mountain South division has 64.2% of homes built with stucco while East South Central division has the lowest proportion of homes with stucco with only 0.4%.

Part b.

What is the average total electricity usage in kilowatt hours in each division? Answer the same question stratified by urban and rural status.

Solution:

**Table 2.** Average annual electricity utilization by Census Division in kwh/home.
Census Division	Average Electricity Usage, kwh/home (95% CI)
East South Central	14,536, (13,320 - 15,752)
West South Central	14,324, (13,495 - 15,153)
South Atlantic	13,447, (12,904 - 13,989)
West North Central	10,524, ( 9,635 - 11,413)
Mountain South	10,442, ( 7,950 - 12,934)
East North Central	9,129, ( 8,730 - 9,528)
Middle Atlantic	8,465, ( 8,071 - 8,860)
Mountain North	8,384, ( 7,121 - 9,648)
Pacific	8,100, ( 7,750 - 8,450)
New England	7,515, ( 6,472 - 8,557)

Figure 2. Estimated average annual electricity usage in khw/home for each of 10 census divisions.

From the chart above, we see East South Central had the highest average total electricity usage at 14,536 kwh. Below, we compute average total electricity usage stratified by urban and rural status.

**Table 3.** Average electricity utilization in kwh per home for urban and rural areas witihin each census division.
Census Division	Rural, kwh/home (95% CI)	Urban, kwh/home (95% CI)
East South Central	16,333, (14,088 - 18,578)	13,747, (12,197 - 15,298)
West South Central	16,317, (14,067 - 18,567)	13,629, (12,852 - 14,405)
South Atlantic	15,942, (14,839 - 17,045)	12,725, (12,134 - 13,316)
West North Central	14,174, (12,608 - 15,740)	9,467, ( 8,722 - 10,211)
East North Central	13,500, (12,022 - 14,978)	7,980, ( 7,552 - 8,408)
Pacific	14,115, (12,001 - 16,229)	7,349, ( 6,905 - 7,793)
Middle Atlantic	12,223, (10,633 - 13,814)	7,987, ( 7,659 - 8,316)
Mountain South	8,610, ( 6,536 - 10,685)	10,743, ( 8,178 - 13,308)
Mountain North	9,356, ( 5,698 - 13,014)	8,099, ( 7,396 - 8,803)
New England	9,001, ( 6,766 - 11,236)	6,964, ( 5,918 - 8,010)

Figure 3. Estimated average annual electricity usage in khw/home for rural and urban areas in each of 10 census divisions.

Part c.

Which division has the largest disparity between urban and rural areas in terms of the proportion of homes with internet access?

Solution:

**Table 4.** Urban and rural disparity in internet access for the ten US Census Division in 2015.
Census Division	Urban Internet Access, % (95% CI)	Rural Internet Access, % (95% CI)	Difference, % (95% CI)
Mountain South	85.3% (81.3, 89.2)	66.7% (58.3, 75.2)	18.5% ( 7.2, 29.8)
East South Central	78.4% (70.5, 86.2)	69.0% (63.5, 74.6)	9.3% (-1.4, 20.1)
West North Central	88.0% (84.6, 91.4)	80.3% (71.5, 89.2)	7.7% (-2.5, 17.8)
Mountain North	87.4% (82.0, 92.9)	81.9% (73.8, 90.0)	5.5% (-6.2, 17.2)
West South Central	81.6% (76.4, 86.8)	76.5% (72.1, 80.9)	5.1% (-2.3, 12.5)
Pacific	88.7% (86.2, 91.2)	85.3% (77.4, 93.1)	3.4% (-4.5, 11.4)
South Atlantic	85.3% (82.6, 88.0)	82.0% (76.3, 87.8)	3.3% (-3.5, 10.1)
New England	87.6% (82.5, 92.6)	85.8% (82.4, 89.2)	1.8% (-2.5, 6.0)
East North Central	86.3% (83.8, 88.7)	86.2% (81.6, 90.8)	0.0% (-5.3, 5.4)
Middle Atlantic	89.3% (83.9, 94.8)	91.3% (85.3, 97.3)	-1.9% (-9.1, 5.2)

In the table above, we show the division with the largest absolute difference between Urban and Rural percentage. Mountain South has the largest disparity between percent of Urban and Rural Internet Users at 18.5%. This figure is nearly twice as large as the second largest disparity which is in East South Central division.

Figure 4. Estimated Urban and rural disparity in internet access in each of 10 US Census Division in 2015.

Part d.

What percent of homes have gross annual income higher than 80K within each division? Which divisions have the highest and lowest proportions?

Solution:

**Table 5.** Proportion of homes with income greater than 80K by Census Division.
Census Division	% of Homes w/ Income > 80K, (95% CI)
Mountain North	37.0% (29.9, 44.1)
Pacific	36.9% (32.1, 41.7)
Middle Atlantic	33.6% (26.6, 40.6)
West North Central	28.4% (25.7, 31.1)
West South Central	28.3% (22.2, 34.3)
South Atlantic	28.1% (24.0, 32.2)
New England	27.9% (20.3, 35.5)
East North Central	26.1% (22.5, 29.8)
Mountain South	20.0% (13.6, 26.4)
East South Central	15.0% ( 8.2, 21.8)

Pacific and Mountain North divisions seem to have the highest proportion of homes with incomes greater than 80K. Mountain North is also significantly wealthier than its counterpart, the Mountain South division. East North Central division has a higher percent estimate than East South Central, where East South Central also happens to have the least proportion of homes with incomes greater than 80K.

Figure 5. Estimated proportion of homes with income greater than 80K for each of 10 Census Divisions.

Question 2

In this question you will design a Monte Carlo study in R to compare the performance of different methods that adjust for multiple comparisons.

Part a.

Write a function that accepts matrices X and beta and returns a p by mc_rep matrix of p-values corresponding to p-values for the hypothesis tests:

In addition to X and beta your function should have arguments sigma (σ) and mc_rep controlling the error variance of Y and number of Monte Carlo replicates, respectively. Your function should solve the least-squares problems using the QR decomposition of X′X. This decomposition should only be computed once each time your function is called.

Solution:

# MonteCarlo Function the returns P values
MonteCarloPValues = function(X, beta, sigma2, mc_rep)
{
  n = dim(X)[1]
  p = dim(X)[2]
  p_values_m = matrix(0, nrow = p, ncol = mc_rep)
  # QR Decomposition
  QR = qr(X)
  
  for (i in 1:mc_rep)
  {
    # Set seed for testing
    if (FALSE)
    {
      set.seed(42)
    }
    
    # Generate Y from X, beta, and sigma2
    Y = rnorm(n, mean = X %*% beta, sd = sigma2 * diag(p))
    
    # Calculate betaHat from QR decomposition
    betaHat = solve(qr.R(QR), t(qr.Q(QR)) %*% Y )
    
    # Calculate Yhat from betaHat
    Yhat = X %*% betaHat
    
    # Calculate sigmaHat squared 
    sigmaHat2 = (1 / (n - p)) * sum( (Y - Yhat) ^ 2)
    
    # Calculate Variance of our betas, getting the diagonal of Var/Cov matrix
    VAR_BetaHat = diag(sigmaHat2[1] * chol2inv( qr.R(QR) ))
    
    # Calculate Z-Score
    Z = betaHat / sqrt(VAR_BetaHat)
    
    # p_values
    p_values = 2 * (1 - pnorm(abs(Z), 0, 1, lower.tail = TRUE))
    
    # Add p_values vector to our matrix of p_values for each Monte Carlo trial
    p_values_m[, i] = p_values
  }
  p_values_m
}

Part b.

Choose Σ and σ as you like. Use the Cholesky factorization of Σ to generate a single X. Pass X, β, and σ to your function from the previous part.

Solution:

R = chol(SIGMA)
X = rnorm(n * p)
dim(X) = c(n, p)
M = X %*% R

p_values_m = MonteCarloPValues(M, beta, sigma2, mc_rep)

Part c.

Write a function evaluate that takes a set of indices where β≠0 and returns Monte Carlo estimates for the following quantities: i. The family wise error rate i. The false discovery rate i. The sensitivity i. The specificity

Solution:

# Assumes NumBetas != 0 are at the first k rows of our data frame
evaluate = function(p_values_m, alpha, numBetasNotEqual0)
{
  # Initializers
  k = numBetasNotEqual0 # numBetas != 0
  numCols = ncol(p_values_m)
  numBetasEqual0 = nrow(p_values_m) - k
  testsList = c("FWER", "FDR", "sensitivity", "specificity")
  
  # Set up our matrix to collect multiple Tests for each column
  multipleTests = data.frame(matrix(nrow = length(testsList), ncol = numCols))
  row.names(multipleTests) = testsList
  
  # Look through each column of our p-values
  for (j in 1:numCols)
  {
    # Collect the jth column of p_values
    p_values = p_values_m[,j]
    
    # Collect total number of significant p_values
    rejections = which(p_values < alpha)
    
    # Collecting FN, FP, TN, and TP
    FalsePos = sum(rejections > k)
    TruePos = sum(rejections <= k)
    TrueNeg = numBetasEqual0 - FalsePos
    
    # family-wise error
    FWER = any(rejections > k)
    # false discovery proportion
    FDR = FalsePos / (FalsePos + TruePos)
    # sensitivity
    sensitivity = TruePos / k
    # specifity
    specificity = TrueNeg / (TrueNeg + FalsePos)
    
    # Collect results for column in our multipleTests matrix
    multipleTests["FWER", j] = FWER
    multipleTests["FDR", j] = FDR
    multipleTests["sensitivity", j] = sensitivity
    multipleTests["specificity", j] = specificity
  }
  # Take average for each test
  multipleTests = rowMeans(multipleTests)
}

Part d.

Apply your function from the previous part to the matrix of uncorrected P-values generated in part B. Use the function p.adjust() to correct these p-values for multiple comparisons using ‘Bonferroni’, ‘Holm’, ‘BH’ (Benjamini-Hochberg), and ‘BY’. Use your evaluate() function for each set of adjusted p-values.

Solution:

# Assumes NumBetas != 0 are at the first k rows of our data frame
### Perform p-value adjustments and add to data frame
correctionsList = c("bonferroni", "holm", "BH", "BY")
testsList = c("FWER", "FDR", "sensitivity", "specificity")

alpha = 0.05
# Set up our matrix to collect multiple Tests for each column
multipleTests = data.frame(matrix(
  nrow = length(testsList), 
  ncol = length(correctionsList)))
row.names(multipleTests) = testsList
colnames(multipleTests) = correctionsList

# Loop through each method from our Corrections list and produce a table with 
# the output from our evaluate function for each correction method
for (methodName in correctionsList)
{
  # Initializing p_values matrix
  p_values = matrix(nrow = nrow(p_values_m), ncol = ncol(p_values_m))
  
  # adjusting p_values given the method
  p_values = as.matrix(p.adjust(p_values_m, method = methodName))
  dim(p_values) = c(p, mc_rep)
  
  # Run tests with our evaluate function
  test = evaluate(p_values, alpha, k)
  
  # Update column of multipleTests matrix with results from evaluate()
  multipleTests[, methodName] = test
}

# Calculate Unadjusted P-Values
multipleTests[,"Unadjusted_Pvals"] = evaluate(p_values_m, alpha, k)
multipleTests = round(multipleTests, digits = 3)

# Add column containing the tests we performed
multipleTests["Tests",] = c(correctionsList, "Unadjusted")
multipleTests = as.data.frame(t(multipleTests))

# Reorder rows to be displayed in bar graphs
multipleTests = multipleTests[c("Unadjusted_Pvals",correctionsList),]
multipleTests$Tests = factor(multipleTests$Tests, 
                              levels = multipleTests$Tests)

Part e.

Produce one or more nicely formatted graphs or tables reporting your results. Briefly discuss what you found.

Solution: In this example, we take the number of monte carlo simulations: mc_rep = 100. Below is a histogram of the p-values for \(\beta_i = 0\), resulting in 9000 observations. Ideally we would like most if not all p-values to not be significant. The histogram demonstrates that the p-values are uniformly distributed across the range [0,1].

Figure 5. Histogram of p-values generated from our simulation

Figure 6. Monte Carlo Estimates of Family Wise Error Rate (FWER), False Discovery Rate (FDR), Sensitivity, and Specificity

After our corrections, the Family Wise Error Rate (FWER) decreased for each of the the corrections. FWER decreased the most for Bonferroni and Holm corrections.
The false discovery rate also improved and decreased the most for Bonferroni and Holm corrections compared to the Unadjusted p-values.
The Sensitivity of our p-values did not change.
The Specificity improved for all corrections resulting in fewer False Positives.

Question 3

This is a bonus question related to problem 6 from the midterm. First, review the script written in Stata available here. In this question, you will work through various options for translating this analysis into R. You may submit all or some of these, but each part must be entirely correct to earn the points listed.

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Part a.

Write a translation using data.table for the computations.

mtcarsDT = data.table(mtcars)
mtcarsDT = mtcarsDT[order(cyl), .(mpg, cyl, disp, hp, wt)]

beta_cylDT = mtcarsDT[, .(mpg,
                          disp_gc = disp - mean(disp),
                          hp_gc = hp - mean(hp),
                          wt_gc = wt - mean(wt)), 
                      by = .(cyl)]

Xpmg = beta_cylDT[, .(dispXmpg = sum(mpg * disp_gc),
                      vdisp = var(disp_gc),
                      hpXmpg = sum(mpg*hp_gc), 
                      vhp = var(hp_gc),
                      wtXmpg = sum(mpg*wt_gc), 
                      vwt = var(wt_gc)),
                  by = .(cyl)]
betas = Xpmg[, .(beta_cyl_disp = dispXmpg / (vdisp),
                       beta_cyl_hp = hpXmpg / (vhp),
                       beta_cyl_wt = wtXmpg / (vwt)), by = .(cyl)]

Part b.

Write a function to compute the univariate regression coefficient by group for an arbitrary dependent, independent, and grouping variables. Use data.table for computations within your function. Test your function by showing it produces the same results as in part a.

**Table 6.** Average annual electricity utilization by Census Division in kwh/home.
Number of Cylinders	beta Estimate: Displacement	beta Estimate: Horsepower	beta Estimate: Weight
4	-1.35	-1.13	-56.47
6	0.02	-0.05	-16.68
8	-0.26	-0.19	-28.50

Part c.

Compute the regression coefficients using the dplyr verb summarize_at().

betas = mtcars %>%
  group_by(cyl) %>%
  summarise_at(vars(disp, hp, wt), 
               funs(beta_cylDT = sum(mpg * (. - mean(.)) ) / var(. - mean(.)) )
               )

Part d.

Write a function similar to the one in part b to compute arbitrary univariate regression coefficients by group. Use dplyr for computations within your function. You should read the “Programming with dplyr” vignette at vignette(‘programming’, ‘dplyr’) before attempting this. Warning: this may be difficult to debug!

getRegressCoeffDplyr = function(df, group_var, args, dependVar)
{
  group_var = enquo(group_var)
  dependVar = enquo(dependVar)
    
  df %>%
    group_by(!!group_var) %>%
    summarise_at(
      vars(!!!args), 
      funs(beta = 
             sum(!!dependVar * (. - mean(.)) ) / var(. - mean(.)) )
    )
}

Problem Set 3

Israel Diego

November 19, 2018

Question 1

Part a.

Part b.

Part c.

Part d.

Question 2

Part a.

Part b.

Part c.

Part d.

Part e.

Question 3

Part a.

Part b.

Part c.

Part d.