This provides a quick demo to illustrate the semi-parametric
distribution. This distribution estimates the tails of the data using
the Generalized Pareto distribution (`gpd`

) whilst the
interior is estimated using a kernel density. The tails and interior are
then smoothly joined together to form the semi-parametric
distribution.

For this demo, we’ll generate a random sample from the Generalized Hyperbolic Skew Student distribution and then model the top and bottom 1% of the data using the Generalized Pareto distribution. Choosing the correct threshold when applying the Generalized Pareto distribution is an important step to ensure that the asymptotic distribution fits the data well and the parameter estimates are stable. There are a number of methods to test for this, and numerous packages in R which can help. The mev package, which is used for estimating the GPD (except in the case of the probability weighted moments approach) provides threshold stability plots and other diagnostics for aid in choosing the right threshold.

There are a total of 4 parameters, 2 for each tail, estimated independently. As such, the standard errors are based on a block diagonal variance-covariance matrix.

```
library(tsdistributions)
library(data.table)
library(knitr)
# simulate data
set.seed(200)
n <- 6000
sim <- rghst(n, mu = 0, sigma = 1, skew = -0.6, shape = 8.01)
spec <- spd_modelspec(sim, lower = 0.01, upper = 0.99, kernel_type = 'normal')
mod <- estimate(spec, method = "pwm")
print(summary(mod))
#> SPD Model Summary
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> [lower tail] scale 0.63612 0.12620 5.041 4.64e-07 ***
#> [lower tail] shape -0.07680 0.15348 -0.500 0.617
#> [upper tail] scale 0.50206 0.09886 5.078 3.81e-07 ***
#> [upper tail] shape -0.03801 0.15033 -0.253 0.800
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> N: 6000, dof: 4,
#> LogLik: -8338, AIC: -16670, BIC: -16640
```

We’ll investigate the goodness of fit of the distribution using some visuals:

```
lower_treshold <- mod$gpd$lower$threshold
upper_treshold <- mod$gpd$upper$threshold
oldpar <- par(mfrow = c(1,1))
par(mfrow = c(3,1), mar = c(3,3,3,3))
hist(sim, breaks = 300, probability = TRUE, xlab = "r",
col = "steelblue", border = "whitesmoke",main = "PDF")
box()
abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)
lines(sort(sim), dspd(sort(sim), mod), col = "darkorange",lwd = 2)
plot(sort(sim), (1:length(sim)/length(sim)),
ylim = c(0, 1), pch = 19,
cex = 0.5, ylab = "p", xlab = "q", main = "CDF")
grid()
lines(sort(sim), pspd(sort(sim), mod), col = "darkorange",lwd = 2)
abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)
plot(sort(sim), qspd(ppoints(length(sim)), mod), ylab = "Model Simulated", xlab = "Observed", main = "Q-Q")
abline(0, 1, col = "darkorange")
grid()
abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)
abline(h = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(h = upper_treshold, col = 'mediumpurple3', lty = 2)
```

The vertical purple lines show the region modeled by the kernel density, representing the thresholds beyond which the GPD is used.

To calculate expectations from the estimated object we can use quadrature integration, as shown below, with calculations for the mean, variance, skewness, kurtosis, value at risk (1%) and expected shortfall (1%). We choose reasonably large but finite values for the numerical integration limits.

```
f <- function(x) x * dspd(x, mod)
mu <- integrate(f, -50, 50)$value
f <- function(x) x^2 * dspd(x, mod)
variance <- integrate(f, -50, 50)$value
f <- function(x) (x - mu)^3 * dspd(x, mod)
skewness <- integrate(f, -50, 50)$value/(variance^(3/2))
f <- function(x) (x - mu)^4 * dspd(x, mod)
kurtosis <- integrate(f, -50, 50)$value/(variance^2)
value_at_risk <- qspd(0.01, mod)
f <- function(x) x * dspd(x, mod)
expected_shortfall <- integrate(f, -50, value_at_risk)$value/0.01
tab <- data.table(stat = c("mean","variance","skewness","kurtosis","VaR(1%)","ES(1%)"),
spd = c(mu, variance, skewness, kurtosis, value_at_risk, expected_shortfall),
observed = c(mean(sim), var(sim), mean((sim - mean(sim))^3)/(sd(sim)^3),
mean((sim - mean(sim))^4)/(sd(sim)^4),quantile(sim, 0.01),
mean(sim[sim <= quantile(sim, 0.01)])))
kable(tab, digits = 2)
```

stat | spd | observed |
---|---|---|

mean | -0.01 | -0.02 |

variance | 1.03 | 0.98 |

skewness | -0.21 | -0.28 |

kurtosis | 4.16 | 4.28 |

VaR(1%) | -2.79 | -2.79 |

ES(1%) | -3.28 | -3.38 |

The results appear promising and close to the observed values. Note that even though we simulated the data from a known distribution, in practice we usually don’t know which distribution the data came from. Therefore, the semi-parametric distribution provides a good trade-off between fully parametric and non-parametric approaches. The caveat is that we still need to be careful in choosing the thresholds. There is no free lunch!