Question 1

Use the Lahman baseball data previously seen in the SQL notes to answer this question. Your answer should be a single SQL query, but may require anonymous tables created using nested queries.

Write an SQL query to construct a table showing the all-time leader in hits (“H” from the “batting” table) for each birth country (“birthCountry” in the “master” table). An all-time leader is the player (“playerID”) with the most total hits across all rows (e.g. seasons/stints).

Limit your table to players/countries with at least 200 hits and order the table by descending number of hits.

Create a nicely formatted table with the following columns as your final output: Player (nameFirst nameLast), Debut (debut), Country of Birth (birthCountry), Hits (H).

Solution

The SQL query we use to extract the all-time leaders per country is shown below:

This query produces the following table:

Table 1. All-time leaders for most total career hits for each birth country
Name Debut birthCountry Hits
Pete Rose 04-08-1963 USA 4256
Adrian Beltre 06-24-1998 D.R. 3166
Ichiro Suzuki 04-02-2001 Japan 3089
Rod Carew 04-11-1967 Panama 3053
Rafael Palmeiro 09-08-1986 Cuba 3020
Roberto Clemente 04-17-1955 P.R. 3000
Omar Vizquel 04-03-1989 Venezuela 2877
Chili Davis 04-10-1981 Jamaica 2380
Edgar Renteria 05-10-1996 Colombia 2327
Patsy Donovan 04-19-1890 Ireland 2256
Larry Walker 08-16-1989 CAN 2160
Tom Brown 07-06-1882 United Kingdom 1958
Andruw Jones 08-15-1996 Curacao 1933
Vinny Castilla 09-01-1991 Mexico 1884
John Anderson 09-08-1894 Norway 1843
Joe Quinn 04-26-1884 Australia 1804
Shin-Soo Choo 04-21-2005 South Korea 1496
Elmer Valo 09-22-1940 Czech Republic 1420
Horace Clarke 05-13-1965 V.I. 1230
Glenn Hubbard 07-14-1978 Germany 1084
Xander Bogaerts 08-20-2013 Aruba 832
Didi Gregorius 09-05-2012 Netherlands 739
Marvin Benard 09-05-1995 Nicaragua 714
Eddie Ainsmith 08-09-1910 Russia 707
Andre Rodgers 04-16-1957 Bahamas 628
Yan Gomes 05-17-2012 Brazil 553
Gerald Young 07-08-1987 Honduras 446
Reno Bertoia 09-22-1953 Italy 425
Steve Jeltz 07-17-1983 France 367
Tony Solaita 09-16-1968 American Samoa 336
Jack Quinn 04-15-1909 Slovakia 248

Question 2

In this question you will modify your answer to Problem Set 3, Question 2 (PS3 Q2) to practice parallel, asynchronous, and batch computing. Copy the functions from part a and c of PS3 Q2 to a new file ps4_q2_funcs.R

In each of the parts below, let β be defined as in PS3 Q2 and Σ be block diagonal with Σij=ρβiβj when i≠j and Σii=1.

Create a table or plot for your results from each part.

Part a.

Write an R script ps4_q2a.R that sources ps4_q2_funcs.R, and then uses mclapply to run parallel simulations for \(ρ∈{\{.25i\}}_{-3}^3\).

Let σ=1 and use 10,000 Monte Carlo replications. Reorganize the results into a long data frame results_q4a with columns: “rho”, “sigma”, “metric”, “method”, “est”, and “se”. “Metric” should contain the assessment measure: FWER, FDR, Sensitivity, or Specificity and “method” the multiple comparison method used. The columns “est” and “se” should contain the Monte Carlo estimate and its standard error, respectively.

Solution

Below we show code demostrate the use of mclapply with nested foreach loops.

The results using mclapply are presented below for the Monte Carlo Metric Estimates with parameters \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=1\).

The Metrics specificially are: Family Wise Error Rate (FWER), False Discovery Rate (FDR), Sensitivity, and Specificity

Metric Estimates for Part a
FWER

FDR

Sensitivity

Specificity

Part b.

Use your script from part a as the basis for a new script ps4_q2b.R. Setup a 4 core cluster using doParallel and then use nested foreach loops to run simulations for \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=\{.25,.5,1\}\). Reshape the results as before into results_q4b saved to a file results_q4b.RData. Use a PBS file to run this script on the Flux cluster.

Solution

Below we show code demostrate the use of doParallel with nested foreach loops.

The results using doParallel are presented below for the Monte Carlo Metric Estimates with parameters \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=\{.25,.5,1\}\).

Metric Estimates for Part b
FWER

FDR

Sensitivity

Specificity

Part c.

-Modify your script from part a to create ps4_q2c.R which reads the following arguments from the command line: sigma, mc_rep, and n_cores.

-Also modify the script to use the futures package for parallelism.

-Use a PBS file to run this script as a job array for \(σ=\{.25,.5,1\}\).

-Hint: see the answer at this page for how to convert *$PBS_ARRAYID* to sigma.

## character(0)

Solution

Below is the R code for demonstrating the use of future with nested foreach loops.

The results using future are presented below for the Monte Carlo Metric Estimates with the default parameters \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=1\). The full results of the job array with \(σ=\{.25,.5,1\}\) can be found in ps4_q2c-1.Rout, ps4_q2c-2.Rout, and ps4_q2c-4.Rout.

Metric Estimates for Part c
FWER

FDR

Sensitivity

Specificity

Question 3

For this question you should use the 2016 Medicare Provider Utilization and Payment data available here.

Part a.

Put the data into a folder ./data and then follow the instructions to read this data into SAS.

Solution

Part b.

Use one or more data steps to reduce the data set to those rows with “MRI” in the ‘hcpcs_description’ field and where ‘hcpcs_code’ starts with a 7.

Solution

Part e.

Export the results from “c” and “d” to csv and verify that they match. You do not need to produce a nice table within your solution document.

Solution