Question 1

First, repeat question 3 parts a-c from problem set 1 using data.table for all computations and data manipulations.

Then, formulate and state a question answerable using the RECS data. Your question should be similar in scope to (one of) parts a-c above and should rely on one or more variables not previously used. Answer your question (using data.table) and provide supporting evidence in the form of nicely formatted graphs and/or tables.

Part a.

What percent of homes have stucco construction as the major outside wall material within each division? Which division has the highest proportion? Which the lowest?

Solution:

Table 1. Proportion of homes with stucco construction within each census division in 2015. Estimates are based on the residential energy consumption survey.
Census Division % Stucco Homes (95% CI)
Mountain South 64.2% (55.4, 73.0)
Pacific 44.6% (41.3, 47.9)
Mountain North 16.6% (10.2, 23.0)
South Atlantic 10.6% ( 7.8, 13.4)
West North Central 4.9% ( 0.9, 8.8)
West South Central 3.0% ( 1.6, 4.3)
Middle Atlantic 2.1% ( 0.6, 3.5)
New England 1.2% ( 0.0, 2.8)
East North Central 0.7% ( 0.1, 1.2)
East South Central 0.4% ( 0.0, 1.2)
**Figure 1.** Estimated percent of homes within each census division 
with major wall type of stucco.

Figure 1. Estimated percent of homes within each census division with major wall type of stucco.

Solution:

The Mountain South division has 64.2% of homes built with stucco while East South Central division has the lowest proportion of homes with stucco with only 0.4%.

Part b.

What is the average total electricity usage in kilowatt hours in each division? Answer the same question stratified by urban and rural status.

Solution:

Table 2. Average annual electricity utilization by Census Division in kwh/home.
Census Division Average Electricity Usage, kwh/home (95% CI)
East South Central 14,536, (13,320 - 15,752)
West South Central 14,324, (13,495 - 15,153)
South Atlantic 13,447, (12,904 - 13,989)
West North Central 10,524, ( 9,635 - 11,413)
Mountain South 10,442, ( 7,950 - 12,934)
East North Central 9,129, ( 8,730 - 9,528)
Middle Atlantic 8,465, ( 8,071 - 8,860)
Mountain North 8,384, ( 7,121 - 9,648)
Pacific 8,100, ( 7,750 - 8,450)
New England 7,515, ( 6,472 - 8,557)
**Figure 2.** Estimated average annual electricity usage in khw/home for
each of 10 census divisions.

Figure 2. Estimated average annual electricity usage in khw/home for each of 10 census divisions.

From the chart above, we see East South Central had the highest average total electricity usage at 14,536 kwh. Below, we compute average total electricity usage stratified by urban and rural status.

Table 3. Average electricity utilization in kwh per home for urban and rural areas witihin each census division.
Census Division Rural, kwh/home (95% CI) Urban, kwh/home (95% CI)
East South Central 16,333, (14,088 - 18,578) 13,747, (12,197 - 15,298)
West South Central 16,317, (14,067 - 18,567) 13,629, (12,852 - 14,405)
South Atlantic 15,942, (14,839 - 17,045) 12,725, (12,134 - 13,316)
West North Central 14,174, (12,608 - 15,740) 9,467, ( 8,722 - 10,211)
East North Central 13,500, (12,022 - 14,978) 7,980, ( 7,552 - 8,408)
Pacific 14,115, (12,001 - 16,229) 7,349, ( 6,905 - 7,793)
Middle Atlantic 12,223, (10,633 - 13,814) 7,987, ( 7,659 - 8,316)
Mountain South 8,610, ( 6,536 - 10,685) 10,743, ( 8,178 - 13,308)
Mountain North 9,356, ( 5,698 - 13,014) 8,099, ( 7,396 - 8,803)
New England 9,001, ( 6,766 - 11,236) 6,964, ( 5,918 - 8,010)
**Figure 3.** Estimated average annual electricity usage in khw/home for
rural and urban areas in each of 10 census divisions.

Figure 3. Estimated average annual electricity usage in khw/home for rural and urban areas in each of 10 census divisions.

Part c.

Which division has the largest disparity between urban and rural areas in terms of the proportion of homes with internet access?

Solution:

Table 4. Urban and rural disparity in internet access for the ten US Census Division in 2015.
Census Division Urban Internet Access, % (95% CI) Rural Internet Access, % (95% CI) Difference, % (95% CI)
Mountain South 85.3% (81.3, 89.2) 66.7% (58.3, 75.2) 18.5% ( 7.2, 29.8)
East South Central 78.4% (70.5, 86.2) 69.0% (63.5, 74.6) 9.3% (-1.4, 20.1)
West North Central 88.0% (84.6, 91.4) 80.3% (71.5, 89.2) 7.7% (-2.5, 17.8)
Mountain North 87.4% (82.0, 92.9) 81.9% (73.8, 90.0) 5.5% (-6.2, 17.2)
West South Central 81.6% (76.4, 86.8) 76.5% (72.1, 80.9) 5.1% (-2.3, 12.5)
Pacific 88.7% (86.2, 91.2) 85.3% (77.4, 93.1) 3.4% (-4.5, 11.4)
South Atlantic 85.3% (82.6, 88.0) 82.0% (76.3, 87.8) 3.3% (-3.5, 10.1)
New England 87.6% (82.5, 92.6) 85.8% (82.4, 89.2) 1.8% (-2.5, 6.0)
East North Central 86.3% (83.8, 88.7) 86.2% (81.6, 90.8) 0.0% (-5.3, 5.4)
Middle Atlantic 89.3% (83.9, 94.8) 91.3% (85.3, 97.3) -1.9% (-9.1, 5.2)

In the table above, we show the division with the largest absolute difference between Urban and Rural percentage. Mountain South has the largest disparity between percent of Urban and Rural Internet Users at 18.5%. This figure is nearly twice as large as the second largest disparity which is in East South Central division.

**Figure 4.** Estimated Urban and rural disparity in internet access in 
each of 10 US Census Division in 2015.

Figure 4. Estimated Urban and rural disparity in internet access in each of 10 US Census Division in 2015.

Part d.

What percent of homes have gross annual income higher than 80K within each division? Which divisions have the highest and lowest proportions?

Solution:

Table 5. Proportion of homes with income greater than 80K by Census Division.
Census Division % of Homes w/ Income > 80K, (95% CI)
Mountain North 37.0% (29.9, 44.1)
Pacific 36.9% (32.1, 41.7)
Middle Atlantic 33.6% (26.6, 40.6)
West North Central 28.4% (25.7, 31.1)
West South Central 28.3% (22.2, 34.3)
South Atlantic 28.1% (24.0, 32.2)
New England 27.9% (20.3, 35.5)
East North Central 26.1% (22.5, 29.8)
Mountain South 20.0% (13.6, 26.4)
East South Central 15.0% ( 8.2, 21.8)

Pacific and Mountain North divisions seem to have the highest proportion of homes with incomes greater than 80K. Mountain North is also significantly wealthier than its counterpart, the Mountain South division. East North Central division has a higher percent estimate than East South Central, where East South Central also happens to have the least proportion of homes with incomes greater than 80K.

**Figure 5.** Estimated proportion of homes with income greater than 80K 
for each of 10 Census Divisions.

Figure 5. Estimated proportion of homes with income greater than 80K for each of 10 Census Divisions.

Question 2

In this question you will design a Monte Carlo study in R to compare the performance of different methods that adjust for multiple comparisons.

Part b.

Choose Σ and σ as you like. Use the Cholesky factorization of Σ to generate a single X. Pass X, β, and σ to your function from the previous part.

Solution:

Part c.

Write a function evaluate that takes a set of indices where β≠0 and returns Monte Carlo estimates for the following quantities: i. The family wise error rate i. The false discovery rate i. The sensitivity i. The specificity

Solution:

Part d.

Apply your function from the previous part to the matrix of uncorrected P-values generated in part B. Use the function p.adjust() to correct these p-values for multiple comparisons using ‘Bonferroni’, ‘Holm’, ‘BH’ (Benjamini-Hochberg), and ‘BY’. Use your evaluate() function for each set of adjusted p-values.

Solution:

# Assumes NumBetas != 0 are at the first k rows of our data frame
### Perform p-value adjustments and add to data frame
correctionsList = c("bonferroni", "holm", "BH", "BY")
testsList = c("FWER", "FDR", "sensitivity", "specificity")

alpha = 0.05
# Set up our matrix to collect multiple Tests for each column
multipleTests = data.frame(matrix(
  nrow = length(testsList), 
  ncol = length(correctionsList)))
row.names(multipleTests) = testsList
colnames(multipleTests) = correctionsList

# Loop through each method from our Corrections list and produce a table with 
# the output from our evaluate function for each correction method
for (methodName in correctionsList)
{
  # Initializing p_values matrix
  p_values = matrix(nrow = nrow(p_values_m), ncol = ncol(p_values_m))
  
  # adjusting p_values given the method
  p_values = as.matrix(p.adjust(p_values_m, method = methodName))
  dim(p_values) = c(p, mc_rep)
  
  # Run tests with our evaluate function
  test = evaluate(p_values, alpha, k)
  
  # Update column of multipleTests matrix with results from evaluate()
  multipleTests[, methodName] = test
}

# Calculate Unadjusted P-Values
multipleTests[,"Unadjusted_Pvals"] = evaluate(p_values_m, alpha, k)
multipleTests = round(multipleTests, digits = 3)

# Add column containing the tests we performed
multipleTests["Tests",] = c(correctionsList, "Unadjusted")
multipleTests = as.data.frame(t(multipleTests))

# Reorder rows to be displayed in bar graphs
multipleTests = multipleTests[c("Unadjusted_Pvals",correctionsList),]
multipleTests$Tests = factor(multipleTests$Tests, 
                              levels = multipleTests$Tests)

Part e.

Produce one or more nicely formatted graphs or tables reporting your results. Briefly discuss what you found.

Solution: In this example, we take the number of monte carlo simulations: mc_rep = 100. Below is a histogram of the p-values for \(\beta_i = 0\), resulting in 9000 observations. Ideally we would like most if not all p-values to not be significant. The histogram demonstrates that the p-values are uniformly distributed across the range [0,1].

**Figure 5.** Histogram of p-values generated from our simulation

Figure 5. Histogram of p-values generated from our simulation

**Figure 6.** Monte Carlo Estimates of Family Wise Error Rate (FWER), 
False Discovery Rate (FDR), Sensitivity, and Specificity

Figure 6. Monte Carlo Estimates of Family Wise Error Rate (FWER), False Discovery Rate (FDR), Sensitivity, and Specificity

  1. After our corrections, the Family Wise Error Rate (FWER) decreased for each of the the corrections. FWER decreased the most for Bonferroni and Holm corrections.
  2. The false discovery rate also improved and decreased the most for Bonferroni and Holm corrections compared to the Unadjusted p-values.
  3. The Sensitivity of our p-values did not change.
  4. The Specificity improved for all corrections resulting in fewer False Positives.

Question 3

This is a bonus question related to problem 6 from the midterm. First, review the script written in Stata available here. In this question, you will work through various options for translating this analysis into R. You may submit all or some of these, but each part must be entirely correct to earn the points listed.

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Part b.

Write a function to compute the univariate regression coefficient by group for an arbitrary dependent, independent, and grouping variables. Use data.table for computations within your function. Test your function by showing it produces the same results as in part a.

Table 6. Average annual electricity utilization by Census Division in kwh/home.
Number of Cylinders beta Estimate: Displacement beta Estimate: Horsepower beta Estimate: Weight
4 -1.35 -1.13 -56.47
6 0.02 -0.05 -16.68
8 -0.26 -0.19 -28.50

Part d.

Write a function similar to the one in part b to compute arbitrary univariate regression coefficients by group. Use dplyr for computations within your function. You should read the “Programming with dplyr” vignette at vignette(‘programming’, ‘dplyr’) before attempting this. Warning: this may be difficult to debug!