STATS 509: Homework 2

Question 1

Exercise 2 on page 81 Column seven of the data set RecentFord.csv on the book’s web site contains Ford daily closing prices, adjusted for splits and dividends, for the years \(2009-2013 .\) Repeat Problem 1 using these more recent returns. One of returns is approximately \(-0.175\). For part \((\mathrm{d}),\) use that return in place of Black Monday. (Black Monday, of course, is not in this data set.) On what date did this return occur? Search the Internet for news about Ford that day. Why did the Ford price drop so precipitously that day?

Recall Problem 1: This problem uses the data set ford.csv on the book’s web site. The data were taken from the ford.s data set in R’s fEcofin package. This package is no longer on CRAN. This data set contains 2000 daily Ford returns from January 2,1984, to December 31,1991.

Find the sample mean, sample median, and standard deviation of the Ford returns.
Create a normal plot of the Ford returns. Do the returns look normally distributed? If not, how do they differ from being normally distributed?
Test for normality using the Shapiro-Wilk test? What is the \(p\) -value? Can you reject the null hypothesis of a normal distribution at \(0.01\)?
Create several \(t\) -plots of the Ford returns using a number of choices of the degrees of freedom parameter (df). What value of df gives a plot that is as linear as possible? The returns include the return on Black Monday, October 19,1987. Discuss whether or not to ignore that return when looking for the best choices of df.
Find the standard error of the sample median using formula (4.3) with the sample median as the estimate of \(F^{-1}(0.5)\) and a KDE to estimate \(f\). Is the standard error of the sample median larger or smaller than the standard error of the sample mean?

The data set RecentFord.csv is in Data directory under Files in Canvas. Note: The normal plot and t-plots correspond to QQ plots relative to those distributions.

(a) \(\mathbf{Solution.}\qquad\) Below we compute the relevant statistics from the data,

# Read Data
data = read.csv("RecentFord.csv")

# Compute Returns
log_ret = diff(log(data$Adj.Close))
ret = exp(log_ret) - 1

# Add new columns to dataframe for plotting
data["Ret"] = c(NA, ret)
data["Log_Ret"] = c(NA, log_ret)

df = data.frame("Mean" = mean(ret), "Median" = median(ret), 
                "Std. Dev" = sd(ret))

kable(df, digits = 3, format.args = list(big.mark = ","), 
      caption = "Summary Statistics for Ford Returns",
      col.names = c("Mean", "Median", "Std. Dev"), escape = FALSE) %>%
  kable_styling(latex_options = c("striped", "hold_position"))

(#tab:q1_A)Summary Statistics for Ford Returns
Mean	Median	Std. Dev
0.002	0.001	0.026

(b) \(\mathbf{Solution.}\qquad\) Below is the R code to construct a normal plot of the Ford returns.

ggplot(data = subset(data, !is.na(data$Ret)), aes(sample = Ret)) +
  stat_qq(colour = "darkred", shape = 1) + 
  stat_qq_line(colour = "steelblue") +
  ggtitle("Normal Plot of Returns") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  xlab('Theoretical Quantiles') + 
  ylab('Sample Quantiles')

The Ford returns do not look normally distributed, because the left end and right end of the QQ plot deviate away from the theoretical quantiles. In other words, returns are much more heavy tailed in comparison to the normal distribution.

(c) \(\mathbf{Solution.}\qquad\) According to Shapiro-Wilk test, its \(p\)-value is 1.504037210^{-23}, so we can certainly reject the null-hypothesis at the 1% level that the distribution is normally distributed.

shapiro.test(ret)

(d) \(\mathbf{Solution.}\qquad\) To create \(t\)-plots we will use the following function qqt. Note that we construct the reference line by interpolating between the 25th and 75th quantiles. We could have also done linear regression on the points between these two quantiles.

# Function for generating t-plot
qqt = function(x, nu){
  xsort = sort(x)
  n = length(xsort)
  q = (1 / (n + 1)) * c(1:n)
  qq = qt(q, nu)
  
  # Construct dataframe to plot in ggplot
  df = data.frame("quantiles" = qq, "data" = xsort)
  
  # Compute theoretical quantiles
  xx = quantile(qq, c(0.25, 0.75))
  yy = quantile(xsort, c(0.25, 0.75))
  slope = (yy[2] - yy[1]) / (xx[2] - xx[1])
  inter = yy[1] - slope*xx[1]
  
  g = ggplot(df, aes(x = quantiles, y = data)) +
    geom_point(colour = "darkred", shape = 1) + 
    ggtitle(TeX(paste0("t-Plot $\\nu$= ", as.character(nu)))) + 
    theme(text=element_text(size=7),
          plot.title = element_text(hjust = 0.5)) + 
    geom_abline(intercept = inter, slope = slope, colour = "steelblue") + 
    xlab('Theoretical Quantiles') + 
    ylab('Sample Quantiles')
  
  return(g)
}

We show several subplots with degrees of freedom ranging from \(\{2,\ldots ,10\}\). The R-code for constructing the \(t\)-plots is shown below.

# Construct t-distribution qq plots
plot_list = list()
nu_vals = c(2:10)

iter = 1
for (nu in nu_vals) {
  plot_list[[iter]] = local({
    nu = nu
    p1 = qqt(ret, nu)
  })
  iter = iter + 1
}

# plot all subplots
n <- length(plot_list)
nCol <- ceiling(sqrt(n))
grid.arrange(grobs = plot_list, ncol = nCol, 
             top = textGrob(TeX("t-Plots for $t(\\nu)$ of Ford Returns"), 
                            gp = gpar(fontsize = 13, font = 8)))

By inspection, it seems that the \(t\)-plot with \(3\) degrees of freedom produces the straightest line. When sorting the data, we see that the lowest return occurred on May 12, 2009.

ordered_ret = data[order(data$Ret),]
ordered_ret[1,] %>% select(c("Date", "Ret"))

##         Date        Ret
## 90 5/12/2009 -0.1748252

Below we test this observation using \(-1.5\times IQR\) as a proxy to determine whether it is an outlier.

max_loss = ordered_ret$Ret[1]
is_outlier = max_loss < -1.5*IQR(ordered_ret$Ret, na.rm = TRUE)
is_outlier

## [1] TRUE

Indeed Ford shares dropped markedly on this day after “disclosing a public offering of 300 million shares of common stock that will help it fund its health care trust for retired autoworkers and their families” (https://www.mercurynews.com/2009/05/12/ford-shares-sink-on-share-offering/). Since this type of event occurs rarely, we can ignore this return.

(e) \(\mathbf{Solution.}\qquad\) First we construct a KDE of the returns using the following R code. We will use the default setting for the KDE since lowering the bandwidth will fit too much of the noice in the histogram. Increasing the bandwidth will smooth the curve, but won’t be able to capture the peak well.

# Create dataframe in order to plot multiple densities
dens = density(ret)
df = data.frame("x" = dens$x, "y" = dens$y)

# Plot each density along with histogram
ggplot(df, aes(x = x, y = y, color = "darkred")) + 
  geom_line() + 
  geom_histogram(data = data, aes(x = Ret, y = ..density..), inherit.aes = FALSE,
               binwidth = 1 / (0.2 * nrow(data)), alpha = .5) +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none") + 
  ggtitle("Histogram of Ford Returns with KDEs") + 
  xlab('Returns') + 
  ylab('Density')

Formula (4.3) gives us the variance of the \(q\)th sample quantile when \(n\) is large. \[\frac{q(1-q)}{n\left[f\left\{F^{-1}(q)\right\}\right]^{2}}\] Using (4.3), the standard error formula of the sample median becomes, \[{SE}_{0.5} = \sqrt{\frac{{0.5}^2}{n\left[\hat{f}\left\{F^{-1}(q)\right\}\right]^{2}}}\] From our KDE, we can estimate \(\hat{f}\left\{F^{-1}(q)\right\}\). In the R code below, we compute the standard errors of the sample median and sample mean.

# Get estimated f(F^-1(q))
f = df[ceiling(nrow(df) / 2), ]$y
n = length(ret)

# Standard Error for Sample Median
std_med = sqrt((.5^2) / (n * f^2))

# Standard Error for Sample Mean
std_mean =  sd(ret) / sqrt(n)

We find that, \[\hat{f}\left\{F^{-1}(0.5)\right\}=18.112\]

Then the SE of sample median is, \[{SE}_{0.5}=0.000779\]

Likewise, the SE of sample mean is, \[{SE}(\hat{\mu})=0.000745\]

We can see that both standard error estimates are quite close.

Question 2

Exercise 5 on page 82 Let diffbp be the changes (that is, differences) in the variable bp, the U.S. dollar to British pound exchange rate, which is in the Garch data set of R’s Ecdat package.

Create a \(3 \times 2\) matrix of normal plots of diffbp and in each plot add a reference line that goes through the \(p-\) and \((1-p)\) -quantiles, where \(p=0.25,0.1,0.05,0.025,0.01,\) and \(0.0025,\) respectively, for the six plots. Create a second set of six normal plots using \(n\) simulated \(N(0,1)\) random variables, where \(n\) is the number of changes in bp plotted in the first figure. Discuss how the reference lines change with the value of \(p\) and how the set of six different reference lines can help detect nonnormality.
Create a third set of six normal plots using changes in the logarithm of bp. Do the changes in log(bp) look closer to being normally distributed than the changes in bp?

(a) \(\mathbf{Solution.}\qquad\) For the first set of \(3 \times 2\) plots, we display normal plots of diffbp with different reference lines interpolated between \(p-\) and \((1-p)\) -quantiles.

library(Ecdat)

# i. Create 6 normal plots diffbp
data = Ecdat::Garch
diffbp = diff(data$bp)
data["diffbp"] = c(NA, diffbp)

# Construct subplots
plot_list = list()
p_vals = c(0.25, 0.1, 0.05, 0.025, 0.01, 0.0025)

# Iterate through each value of p
iter = 1
for (p in p_vals) {
  plot_list[[iter]] = local({
    p = p
    
    # Get slope and intercept of reference line
    xx = qnorm(c(p, 1 - p))
    yy = quantile(diffbp, c(p, 1 - p))
    slope = (yy[2] - yy[1]) / (xx[2] - xx[1])
    inter = yy[1] - slope*xx[1]
    
    # Create plot
    p1 = ggplot(data = subset(data, !is.na(data$diffbp)), 
                aes(sample = diffbp)) +
      stat_qq(colour = "darkred", shape = 1) + 
      geom_abline(intercept = inter, slope = slope, colour = "steelblue") + 
      ggtitle(paste0("Normal Plot with p = ", as.character(p))) + 
      theme(plot.title = element_text(size = 10, hjust = 0.5))
  })
  iter = iter + 1
}

# plot all subplots
grid.arrange(grobs = plot_list, ncol = 2, 
             top = textGrob("3 × 2 Matrix of Normal Plots of diffbp", 
                            gp = gpar(fontsize = 13, font = 8)))

Next, we display the second set of six normal plots, but using \(n\) simulated \(N(0,1)\) random variables.

# ii. Create 6 normal plots for normally distributed data
set.seed(1234)
randn = rnorm(length(diffbp))
data$randn = c(NA, randn)
plot_list = list()

# Iterate through each value of p
iter = 1
for (p in p_vals) {
  plot_list[[iter]] = local({
    p = p
    
    # Get slope and intercept of reference line
    xx = qnorm(c(p, 1 - p))
    yy = quantile(randn, c(p, 1 - p))
    slope = (yy[2] - yy[1]) / (xx[2] - xx[1])
    inter = yy[1] - slope*xx[1]
    
    # Create plot
    p1 = ggplot(data = subset(data, !is.na(data$randn)), 
                aes(sample = randn)) +
      stat_qq(colour = "darkred", shape = 1) + 
      geom_abline(intercept = inter, slope = slope, colour = "steelblue") + 
      ggtitle(paste0("Normal Plot with p = ", as.character(p))) + 
      theme(plot.title = element_text(size = 11, hjust = 0.5))
  })
  iter = iter + 1
}

# plot all subplots
grid.arrange(grobs = plot_list, ncol = 2, 
             top = textGrob(TeX("3 × 2 Matrix of Normal Plots of $N(0,1)$"), 
                      gp = gpar(fontsize = 13, font = 8)))

On the first set of plots, we notice that when we change \(p\), the slope of the reference line increases. The plot that looks the straightest is the last one with \(p=0.0025\). The lines on the second set of plots all look identical. Thus we can detect nonnormality based on how much we have to decrease the value of \(p\) in order to get a straight-looking plot. In this case, we required a very low value of \(p\) to get normal plot of diffbp to look straight, indicating probable nonnormality.

(b) \(\mathbf{Solution.}\qquad\) Below we construct a third set of normal plots using changes in log(bp).

diff_lbp = diff(log(data$bp))
data$diff_lbp = c(NA, diff_lbp)

# Iterate through each value of p
iter = 1
for (p in p_vals) {
  plot_list[[iter]] = local({
    p = p
    
    # Get slope and intercept of reference line
    xx = qnorm(c(p, 1 - p))
    yy = quantile(diff_lbp, c(p, 1 - p))
    slope = (yy[2] - yy[1]) / (xx[2] - xx[1])
    inter = yy[1] - slope*xx[1]
    
    # Create plot
    p1 = ggplot(data = subset(data, !is.na(data$diff_lbp)), 
                aes(sample = diffbp)) +
      stat_qq(colour = "darkred", shape = 1) + 
      geom_abline(intercept = inter, slope = slope, colour = "steelblue") + 
      ggtitle(paste0("Normal Plot with p = ", as.character(p))) + 
      theme(plot.title = element_text(size = 10, hjust = 0.5))
  })
  iter = iter + 1
}

# plot all subplots
grid.arrange(grobs = plot_list, ncol = 2, 
             top = textGrob(TeX("3 × 2 Matrix of Normal Plots of $\\Delta \\log(bp)$"), 
                            gp = gpar(fontsize = 13, font = 8)))

We notice that none of the individual plots look straight, even though the last plot with \(p=0.0025\) would be considered the stragightest out of all other plots. Here, changes in log(bp) look less close to being normally distributed than changes in bp.

Question 3

(a) Derive the limiting distribution of \(\mathrm{GPD}(\mu, \xi, \sigma)\) for case where \(\mu=0, \sigma>0\) and \(\xi \rightarrow 0 .\) Hint: With \(F\) being the cdf, focus on the limit of the tail probability \([1-F(x)]\) Also, can use the standard result in mathematics that \[ \lim _{M \rightarrow \infty}\left(1+\frac{\beta}{M}\right)^{M}=e^{\beta} \]

(b) Based on result in (a), what can you say about using the Pareto distribution to model the tails of a normal distribution?

(a) \(\mathbf{Solution.}\qquad\) First we start by computing \(\lim_{\xi\rightarrow0}(1-F(x))\), \[ \lim_{\xi\rightarrow0}(1-F(x))=\lim_{\xi\rightarrow0}\left(1+\frac{\xi x}{\sigma}\right)^{-1/\xi} \]

Let \(\frac{1}{\xi}=M,\) then, \[\begin{align*} \lim_{\xi\rightarrow0}(1-F(x)) & =\lim_{M\rightarrow\infty}\left[\left(1+\frac{x}{\sigma M}\right)^{M}\right]^{-1}\\ & =\left[\lim_{M\rightarrow\infty}\left(1+\frac{x}{\sigma M}\right)^{M}\right]^{-1}\\ & =e^{-\frac{x}{\sigma}} \end{align*}\]

We were able to pass the limit inside of the reciprocal function, because the reciprocal function is monotonic decreasing on the interval \((0,\infty)\).

(b) \(\mathbf{Solution.}\qquad\) From part a) we know that, \[ \lim_{\xi\rightarrow0}(1-F(x))=e^{-x/\sigma}\sim e^{-x} \]

This means that by letting the shape parameter \(\xi\rightarrow0\), the tail probabilities will, at most, decrease exponentially fast. However, it is not well suited to model tails of a normal distribution since tails of normal decrease at a faster rate of \(e^{-x^{2}}\).

STATS 509: Homework 2

Israel Diego

1/29/2020

Question 1

Question 2

Question 3