Use the Lahman baseball data previously seen in the SQL notes to answer this question. Your answer should be a single SQL query, but may require anonymous tables created using nested queries.
Write an SQL query to construct a table showing the all-time leader in hits (“H” from the “batting” table) for each birth country (“birthCountry” in the “master” table). An all-time leader is the player (“playerID”) with the most total hits across all rows (e.g. seasons/stints).
Limit your table to players/countries with at least 200 hits and order the table by descending number of hits.
Create a nicely formatted table with the following columns as your final output: Player (nameFirst nameLast), Debut (debut), Country of Birth (birthCountry), Hits (H).
Solution
The SQL query we use to extract the all-time leaders per country is shown below:
SELECT firstName || " " || lastName as Name,
strftime("%m-%d-%Y", Debut) as Debut, birthCountry, max(Hits) as Hits
FROM(
SELECT bat.playerID, m.nameFirst firstName, m.nameLast lastName, m.debut Debut,
m.birthCountry birthCountry, sum(H) as Hits
FROM batting bat
INNER JOIN master m ON bat.playerID = m.playerID
GROUP BY bat.playerID
HAVING Hits >= 200
)
GROUP BY birthCountry
ORDER BY -Hits
This query produces the following table:
Name | Debut | birthCountry | Hits |
---|---|---|---|
Pete Rose | 04-08-1963 | USA | 4256 |
Adrian Beltre | 06-24-1998 | D.R. | 3166 |
Ichiro Suzuki | 04-02-2001 | Japan | 3089 |
Rod Carew | 04-11-1967 | Panama | 3053 |
Rafael Palmeiro | 09-08-1986 | Cuba | 3020 |
Roberto Clemente | 04-17-1955 | P.R. | 3000 |
Omar Vizquel | 04-03-1989 | Venezuela | 2877 |
Chili Davis | 04-10-1981 | Jamaica | 2380 |
Edgar Renteria | 05-10-1996 | Colombia | 2327 |
Patsy Donovan | 04-19-1890 | Ireland | 2256 |
Larry Walker | 08-16-1989 | CAN | 2160 |
Tom Brown | 07-06-1882 | United Kingdom | 1958 |
Andruw Jones | 08-15-1996 | Curacao | 1933 |
Vinny Castilla | 09-01-1991 | Mexico | 1884 |
John Anderson | 09-08-1894 | Norway | 1843 |
Joe Quinn | 04-26-1884 | Australia | 1804 |
Shin-Soo Choo | 04-21-2005 | South Korea | 1496 |
Elmer Valo | 09-22-1940 | Czech Republic | 1420 |
Horace Clarke | 05-13-1965 | V.I. | 1230 |
Glenn Hubbard | 07-14-1978 | Germany | 1084 |
Xander Bogaerts | 08-20-2013 | Aruba | 832 |
Didi Gregorius | 09-05-2012 | Netherlands | 739 |
Marvin Benard | 09-05-1995 | Nicaragua | 714 |
Eddie Ainsmith | 08-09-1910 | Russia | 707 |
Andre Rodgers | 04-16-1957 | Bahamas | 628 |
Yan Gomes | 05-17-2012 | Brazil | 553 |
Gerald Young | 07-08-1987 | Honduras | 446 |
Reno Bertoia | 09-22-1953 | Italy | 425 |
Steve Jeltz | 07-17-1983 | France | 367 |
Tony Solaita | 09-16-1968 | American Samoa | 336 |
Jack Quinn | 04-15-1909 | Slovakia | 248 |
In this question you will modify your answer to Problem Set 3, Question 2 (PS3 Q2) to practice parallel, asynchronous, and batch computing. Copy the functions from part a and c of PS3 Q2 to a new file ps4_q2_funcs.R
In each of the parts below, let β be defined as in PS3 Q2 and Σ be block diagonal with Σij=ρβiβj when i≠j and Σii=1.
Create a table or plot for your results from each part.
Write an R script ps4_q2a.R
that sources ps4_q2_funcs.R
, and then uses mclapply
to run parallel simulations for \(ρ∈{\{.25i\}}_{-3}^3\).
Let σ=1 and use 10,000 Monte Carlo replications. Reorganize the results into a long data frame results_q4a
with columns: “rho”, “sigma”, “metric”, “method”, “est”, and “se”. “Metric” should contain the assessment measure: FWER, FDR, Sensitivity, or Specificity and “method” the multiple comparison method used. The columns “est” and “se” should contain the Monte Carlo estimate and its standard error, respectively.
Solution
Below we show code demostrate the use of mclapply
with nested foreach loops.
# Values of rho for mclapply where rho = (-0.75, -0.5, -0.25,...,0.75)
rhoVec = (1/4) * c(-3:3)
# Use mclapply for each value of rho with our runSimulation function
resultsA = mclapply(rhoVec, runSimulation)
The results using mclapply
are presented below for the Monte Carlo Metric Estimates with parameters \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=1\).
The Metrics specificially are: Family Wise Error Rate (FWER), False Discovery Rate (FDR), Sensitivity, and Specificity
Use your script from part a as the basis for a new script ps4_q2b.R
. Setup a 4 core cluster using doParallel
and then use nested foreach loops to run simulations for \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=\{.25,.5,1\}\). Reshape the results as before into results_q4b
saved to a file results_q4b.RData
. Use a PBS file to run this script on the Flux cluster.
Solution
Below we show code demostrate the use of doParallel
with nested foreach loops.
# Values of rho to be iterated where rho = (-0.75, -0.5, -0.25,...,0.75)
rhoVec = (1/4) * c(-3:3)
# Values of sigma to be iterated where sigma = (0.25, 0.5, 1)
sigmaVec = c(1/4, 1/2, 1)
# How many cores to use in the cluster? #
ncores = 4
# set up a cluster called 'cl'
cl = makeCluster(ncores)
# register the cluster
registerDoParallel(cl)
## Do parallel computaitons with nested foreach on rho and sigma
resultsB = foreach(rho = rhoVec) %:%
foreach(sigma = sigmaVec) %dopar% {
runSimulation(rho, sigma)
}
## Always shut the cluster down when done
stopCluster(cl)
The results using doParallel
are presented below for the Monte Carlo Metric Estimates with parameters \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=\{.25,.5,1\}\).
-Modify your script from part a to create ps4_q2c.R
which reads the following arguments from the command line: sigma
, mc_rep
, and n_cores
.
-Also modify the script to use the futures
package for parallelism.
-Use a PBS file to run this script as a job array for \(σ=\{.25,.5,1\}\).
-Hint: see the answer at this page for how to convert *$PBS_ARRAYID*
to sigma
.
## character(0)
Solution
Below is the R code for demonstrating the use of future
with nested foreach loops.
# Parallel computations using plan(multicore) and use future for parallelism
plan(multicore, workers = as.numeric(args_list["n_cores"]))
resultsC = list()
for(i in 1:length(rhoVec)){
resultsC[[i]] = with(args_list,
future({runSimulation(rhoVec[i], sigma, mc_rep)}))
}
The results using future
are presented below for the Monte Carlo Metric Estimates with the default parameters \(ρ∈{\{.25i\}}_{-3}^3\) and \(σ=1\). The full results of the job array with \(σ=\{.25,.5,1\}\) can be found in ps4_q2c-1.Rout
, ps4_q2c-2.Rout
, and ps4_q2c-4.Rout
.
For this question you should use the 2016 Medicare Provider Utilization and Payment data available here.
Put the data into a folder ./data and then follow the instructions to read this data into SAS.
Solution
Use one or more data steps to reduce the data set to those rows with “MRI” in the ‘hcpcs_description’ field and where ‘hcpcs_code’ starts with a 7.
Solution
Use proc means or proc summary (as needed) to determine the MRI procedures with the highest volume, highest total payment, and highest average payment among the procedures represented here.
Solution
/* Calculate Volume */
proc summary data = medicare;
class hcpcs_description;
output out = volumeTable
sum(line_srvc_cnt) = volume;
label volume = "Volume";
run;
/* Calculate Total Payments and Average Payments */
proc summary data = medicare;
class hcpcs_description;
var average_Medicare_payment_amt;
weight line_srvc_cnt;
output out = paymentsTable
sum(average_Medicare_payment_amt) = total_payments
mean(average_Medicare_payment_amt) = average_payments;
label total_payments = "total_payments"
average_payments = "average payments";
run;
/* Merge the two tables above */
data highestStatistics;
merge volumeTable paymentsTable;
by = hcpcs_description;
DROP _TYPE_;
DROP _FREQ_;
DROP by;
run;
proc sort data = highestStatistics;
by descending volume total_payments average_payments;
RUN;
Repeat part b-c using PROC SQL.
Solution
/* Part B in SQL */
proc sql;
create table sqlTable as
select hcpcs_description, line_srvc_cnt, average_Medicare_payment_amt, hcpcs_code
from Medicare_PS_PUF
where hcpcs_description like "%MRI%" and hcpcs_code like "7%";
quit;
run;
/* Part C in SQL */
proc sql;
create table sql_Results as
select hcpcs_description, sum(line_srvc_cnt) as volume, sum(line_srvc_cnt*average_Medicare_Payment_amt) as total_payment,
sum(line_srvc_cnt*average_Medicare_payment_amt)/sum(line_srvc_cnt) as average_payment
from sql_Table
group by hcpcs_description
order by -average_payment;
quit;
run;
Export the results from “c” and “d” to csv and verify that they match. You do not need to produce a nice table within your solution document.
Solution