Problem Set 4

Question 1

Use the Lahman baseball data previously seen in the SQL notes to answer this question. Your answer should be a single SQL query, but may require anonymous tables created using nested queries.

Write an SQL query to construct a table showing the all-time leader in hits (“H” from the “batting” table) for each birth country (“birthCountry” in the “master” table). An all-time leader is the player (“playerID”) with the most total hits across all rows (e.g. seasons/stints).

Limit your table to players/countries with at least 200 hits and order the table by descending number of hits.

Create a nicely formatted table with the following columns as your final output: Player (nameFirst nameLast), Debut (debut), Country of Birth (birthCountry), Hits (H).

Solution

The SQL query we use to extract the all-time leaders per country is shown below:

SELECT firstName || " " || lastName as Name, 
strftime("%m-%d-%Y", Debut) as Debut, birthCountry, max(Hits) as Hits
FROM(
  SELECT bat.playerID, m.nameFirst firstName, m.nameLast lastName, m.debut Debut, 
  m.birthCountry birthCountry, sum(H) as Hits
  FROM batting bat
  INNER JOIN master m ON bat.playerID = m.playerID
  GROUP BY bat.playerID
  HAVING Hits >= 200
)
GROUP BY birthCountry
ORDER BY -Hits

This query produces the following table:

**Table 1.** All-time leaders for most total career hits for each birth country
Name	Debut	birthCountry	Hits
Pete Rose	04-08-1963	USA	4256
Adrian Beltre	06-24-1998	D.R.	3166
Ichiro Suzuki	04-02-2001	Japan	3089
Rod Carew	04-11-1967	Panama	3053
Rafael Palmeiro	09-08-1986	Cuba	3020
Roberto Clemente	04-17-1955	P.R.	3000
Omar Vizquel	04-03-1989	Venezuela	2877
Chili Davis	04-10-1981	Jamaica	2380
Edgar Renteria	05-10-1996	Colombia	2327
Patsy Donovan	04-19-1890	Ireland	2256
Larry Walker	08-16-1989	CAN	2160
Tom Brown	07-06-1882	United Kingdom	1958
Andruw Jones	08-15-1996	Curacao	1933
Vinny Castilla	09-01-1991	Mexico	1884
John Anderson	09-08-1894	Norway	1843
Joe Quinn	04-26-1884	Australia	1804
Shin-Soo Choo	04-21-2005	South Korea	1496
Elmer Valo	09-22-1940	Czech Republic	1420
Horace Clarke	05-13-1965	V.I.	1230
Glenn Hubbard	07-14-1978	Germany	1084
Xander Bogaerts	08-20-2013	Aruba	832
Didi Gregorius	09-05-2012	Netherlands	739
Marvin Benard	09-05-1995	Nicaragua	714
Eddie Ainsmith	08-09-1910	Russia	707
Andre Rodgers	04-16-1957	Bahamas	628
Yan Gomes	05-17-2012	Brazil	553
Gerald Young	07-08-1987	Honduras	446
Reno Bertoia	09-22-1953	Italy	425
Steve Jeltz	07-17-1983	France	367
Tony Solaita	09-16-1968	American Samoa	336
Jack Quinn	04-15-1909	Slovakia	248

Question 2

In this question you will modify your answer to Problem Set 3, Question 2 (PS3 Q2) to practice parallel, asynchronous, and batch computing. Copy the functions from part a and c of PS3 Q2 to a new file ps4_q2_funcs.R

In each of the parts below, let β be defined as in PS3 Q2 and Σ be block diagonal with Σij=ρβiβj when i≠j and Σii=1.

Create a table or plot for your results from each part.

Part a.

Write an R script ps4_q2a.R that sources ps4_q2_funcs.R, and then uses mclapply to run parallel simulations for $ρ∈{\{.25i\}}_{-3}^3$.

Let σ=1 and use 10,000 Monte Carlo replications. Reorganize the results into a long data frame results_q4a with columns: “rho”, “sigma”, “metric”, “method”, “est”, and “se”. “Metric” should contain the assessment measure: FWER, FDR, Sensitivity, or Specificity and “method” the multiple comparison method used. The columns “est” and “se” should contain the Monte Carlo estimate and its standard error, respectively.

Solution

Below we show code demostrate the use of mclapply with nested foreach loops.

# Values of rho for mclapply where rho = (-0.75, -0.5, -0.25,...,0.75)
rhoVec = (1/4) * c(-3:3)

# Use mclapply for each value of rho with our runSimulation function
resultsA = mclapply(rhoVec, runSimulation)

The results using mclapply are presented below for the Monte Carlo Metric Estimates with parameters $ρ∈{\{.25i\}}_{-3}^3$ and $σ=1$.

The Metrics specificially are: Family Wise Error Rate (FWER), False Discovery Rate (FDR), Sensitivity, and Specificity

Metric Estimates for Part a

FWER

FDR

Sensitivity

Specificity

Part b.

Use your script from part a as the basis for a new script ps4_q2b.R. Setup a 4 core cluster using doParallel and then use nested foreach loops to run simulations for $ρ∈{\{.25i\}}_{-3}^3$ and $σ=\{.25,.5,1\}$. Reshape the results as before into results_q4b saved to a file results_q4b.RData. Use a PBS file to run this script on the Flux cluster.

Solution

Below we show code demostrate the use of doParallel with nested foreach loops.

# Values of rho to be iterated where rho = (-0.75, -0.5, -0.25,...,0.75)
rhoVec = (1/4) * c(-3:3)
# Values of sigma to be iterated where sigma = (0.25, 0.5, 1)
sigmaVec = c(1/4, 1/2, 1)

# How many cores to use in the cluster? #
ncores = 4 

# set up a cluster called 'cl'
cl = makeCluster(ncores)

# register the cluster
registerDoParallel(cl)

## Do parallel computaitons with nested foreach on rho and sigma
resultsB = foreach(rho = rhoVec) %:%
  foreach(sigma = sigmaVec) %dopar% {
    runSimulation(rho, sigma)
  }

## Always shut the cluster down when done
stopCluster(cl)

The results using doParallel are presented below for the Monte Carlo Metric Estimates with parameters $ρ∈{\{.25i\}}_{-3}^3$ and $σ=\{.25,.5,1\}$.

Metric Estimates for Part b

FWER

FDR

Sensitivity

Specificity

Part c.

-Modify your script from part a to create ps4_q2c.R which reads the following arguments from the command line: sigma, mc_rep, and n_cores.

-Also modify the script to use the futures package for parallelism.

-Use a PBS file to run this script as a job array for $σ=\{.25,.5,1\}$.

-Hint: see the answer at this page for how to convert *$PBS_ARRAYID* to sigma.

## character(0)

Solution

Below is the R code for demonstrating the use of future with nested foreach loops.

# Parallel computations using plan(multicore) and use future for parallelism

plan(multicore, workers = as.numeric(args_list["n_cores"]))

resultsC = list()
for(i in 1:length(rhoVec)){
  resultsC[[i]] = with(args_list, 
                       future({runSimulation(rhoVec[i], sigma, mc_rep)}))
}

The results using future are presented below for the Monte Carlo Metric Estimates with the default parameters $ρ∈{\{.25i\}}_{-3}^3$ and $σ=1$. The full results of the job array with $σ=\{.25,.5,1\}$ can be found in ps4_q2c-1.Rout, ps4_q2c-2.Rout, and ps4_q2c-4.Rout.

Metric Estimates for Part c

FWER

FDR

Sensitivity

Specificity

Question 3

For this question you should use the 2016 Medicare Provider Utilization and Payment data available here.

Part a.

Put the data into a folder ./data and then follow the instructions to read this data into SAS.

Solution

INFILE './data/Medicare_Provider_Util_Payment_PUF_CY2016.txt'

Part b.

Use one or more data steps to reduce the data set to those rows with “MRI” in the ‘hcpcs_description’ field and where ‘hcpcs_code’ starts with a 7.

Solution

data medicare;
    set Medicare_PS_PUF;
    where prxmatch("/MRI/", hcpcs_description) AND prxmatch("/^7/", hcpcs_code);
run;

Part c.

Use proc means or proc summary (as needed) to determine the MRI procedures with the highest volume, highest total payment, and highest average payment among the procedures represented here.

Solution

/* Calculate Volume */
proc summary data = medicare;
    class hcpcs_description;
    output out = volumeTable
            sum(line_srvc_cnt) = volume;
            label volume = "Volume";
run;

/* Calculate Total Payments and Average Payments */
proc summary data = medicare;
    class hcpcs_description;
    var average_Medicare_payment_amt;
    weight line_srvc_cnt;
    output out = paymentsTable
        sum(average_Medicare_payment_amt) = total_payments
        mean(average_Medicare_payment_amt) = average_payments;
        label total_payments = "total_payments"
                average_payments = "average payments";
run;

/* Merge the two tables above */
data highestStatistics;
    merge volumeTable paymentsTable;
        by = hcpcs_description;
        DROP _TYPE_;
        DROP _FREQ_;
        DROP by;
run;

proc sort data = highestStatistics;
    by descending volume total_payments average_payments;
RUN;

Part d.

Repeat part b-c using PROC SQL.

Solution

/* Part B in SQL */ 
proc sql;
create table sqlTable as
    select hcpcs_description, line_srvc_cnt, average_Medicare_payment_amt, hcpcs_code
        from Medicare_PS_PUF
        where hcpcs_description like "%MRI%" and hcpcs_code like "7%";
quit;
run;

/* Part C in SQL */ 
proc sql;
create table sql_Results as
    select hcpcs_description, sum(line_srvc_cnt) as volume, sum(line_srvc_cnt*average_Medicare_Payment_amt) as total_payment,
        sum(line_srvc_cnt*average_Medicare_payment_amt)/sum(line_srvc_cnt) as average_payment
        from sql_Table
        group by hcpcs_description
        order by -average_payment;
quit;
run;

Part e.

Export the results from “c” and “d” to csv and verify that they match. You do not need to produce a nice table within your solution document.

Solution

proc export data = highestStatistics
    outfile='./ps4_q3c.csv'
    dbms=csv
    replace;
RUN;

proc export data = sql_Results
    outfile = './ps4_q3d.csv'
    dbms = csv
    replace;
run;

Problem Set 4

Israel Diego

December 10, 2018

Question 1

Question 2

Part a.

Metric Estimates for Part a

FWER

FDR

Sensitivity

Specificity

Part b.

Metric Estimates for Part b

FWER

FDR

Sensitivity

Specificity

Part c.

Metric Estimates for Part c

FWER

FDR

Sensitivity

Specificity

Question 3

Part a.

Part b.

Part c.

Part d.

Part e.