# Introdution to ceser

## Summary

The ceser package contains an implementation in R of the Cluster Estimated Standard Error (CESE) method proposed by Jackson (2019). The method estimates the covariance matrix of the estimated coefficients of linear models in grouped data sets with correlation among observations within groups. CESE is an alternative solution for the classical Cluster Robust Standard Error (CRSE) (Greene (2012),Eicker (1967),White (1980),Liang and Zeger (1986),MacKinnon and Webb (2017)), which underestimates the standard errors in most of the situations encountered in practice (Esarey and Menger (2018)).

## Technical Details

Jackson (2019) proposes an approach labeled CESE to estimate the standard errors in grouped data with within-group correlation in the residuals. The approach is based on the estimated expectation of the product of the residuals. Assuming that the residuals have the same variance-covariance matrix within the groups, if we denote by $$\sigma_{ig} = \sigma_{g}^{2}$$ and $$\rho_{ig} = \rho_{g}$$ the variance and covariance, respectively, of the residuals within the group $$g$$, then the expectation of the product of the residuals is given by (see Jackson (2019) for details):

\begin{align} \displaystyle\label{eq-exp-ee} \begin{split} \Sigma_g = \mathbb{E}[e_g e_g^T] = \sigma_{g}^{2} (I_g - P_g) + \rho_{g} \bigg[ \iota_{g} \iota_{g}^{T} - (I_g - P_g) - (P_g \iota_{g} \iota_{g}^{T} + \iota_{g} \iota_{g}^{T} P_g) \\ + X_g(X^{T}X)^{-1} \left( \sum_{g=1}^{G} X_{g}^{T}\iota_{g} \iota_{g}^{T} X_g \right)(X^{T}X)^{-1}X_g \bigg] \end{split} \end{align}

where $$\iota_{g}$$ is a unitary column vector, $$I_g$$ is a $$g \times g$$ identity matrix, and $$P_g = X_g (X^{T}X)^{-1}X_{g}^{T}$$. The equation above can be rewritten concisely as:

\begin{align} \displaystyle\label{eq-exp-ee-reduced} \Sigma_{g}^{} = \sigma_{g}^{2} Q_{1g} + \rho_{g} Q_{2g}. \end{align}

This equation explicitly shows that the expectation of the cross-product of the residuals is a function the data through $$Q_{1g}$$, $$Q_{2g}$$, the unknown variance $$\sigma_{g}^{2}$$, and correlation $$\rho_{g}$$ of the residuals $$e_g$$ in each group $$g$$. The CESE solution is to explore the linear structure of that equation and estimate $$\sigma_{g}^{2}$$ and $$\rho_{g}$$ as if the estimated values of $$e_g e_{g}^{T}$$ were random deviances from their expectations. Denote $$\xi$$ that deviance. Then

\begin{align} \displaystyle\label{eq-cese-reg} \begin{split} e_g e_{g}^{T} &= \mathbb{E}[e_g e_{g}^{T}] + \xi \\ &= \sigma_{g}^{2} Q_{1g} + \rho_{g} Q_{2g} + \xi \\ &= \Sigma_{g}^{} + \xi. \end{split} \end{align}

The estimates of $$\sigma_{g}^{2}$$ and $$\rho_{g}$$ are obtained using the OLS estimator. That is, if we denote $$\Omega_g = (\sigma_{g}^{2} , \rho_{g} )^{T}$$, $$q_{1g}$$ (or $$q_{2g}$$) the vectorized diagonal and lower triangle of $$Q_{1g}$$ (or $$Q_{2g}$$) stacked into a $$n_g(n_g + 1)/2$$ column vector, $$q_g = [q_{1g}, q_{2g}]$$, and $$s_{eg}$$ the corresponding elements of $$e_g e_{g}^{T}$$ stacked into a column vector as well, then the OLS CESE estimator $$\hat{\Omega}_{g} = (\hat{\sigma}_{g}^{2} , \hat{\rho}_{g} )^{T}$$ of the variance and correlation of the residuals in group $$g$$ is given by

$\hat{\Omega}_g = \text{arg min}_{\Omega_g } (s_{eg} - q_{g}\Omega_{g} )^{T}(s_{eg} - q_{g}\Omega_{g} ).$

If we assume that $$q_{g}^{T}q_{g}$$ is invertible, the first order condition gives:

\begin{align} \displaystyle\label{eq-cese-concise} \hat{\Omega}_{g} = (q_g^{T}q_g)^{-1}q_g^{T}s_{eg}. \end{align}

We can rewrite the equation above as:

\begin{align} \displaystyle\label{eq-cese-matrix } \begin{bmatrix} \hat{\sigma}_{g}^{2} \\ \hat{\rho}_{g} \end{bmatrix} = \begin{bmatrix} q_{1g}^{T}q_{1g} & q_{1g}^{T}q_{2g}\\ q_{2g}^{T}q_{1g} & q_{2g}^{T}q_{2g} \end{bmatrix}^{-1} \begin{bmatrix} q_{1g}^{T}s_{eg} \\ q_{2g}^{T}s_{eg} \end{bmatrix}. \end{align}

The estimators of $$\sigma_{g}^{2}$$ and $$\rho_{g}$$ do not require any assumption on $$\xi$$, unless we want to construct confidence intervals for the estimates of those parameters.

Jackson (2019) shows that CESE produces larger standard errors for the coefficients and much more conservative confidence intervals than the CRSE, which is known to be biased downward. CESE is also less sensitive to the number of clusters and to the heterogeneity of the clusters, which can be a problem for both CRSE and bootstrap methods.

## Implementation

The package CESER provides a function vcovCESE() that takes the output of the function lm() (or any other that produces compatible outputs) and computes the CESE. The basic structure of the function is:

vcovCESE(mod, cluster = NULL, type=NULL)

The parameter mod receives the output of the lm() function. The parameter cluster can receive a right-hand side R formula with the summation of the variables in the data that will be used to cluster the standard errors. For instance, if one wants to cluster the standard error by country, one can use:

vcovCESE(..., cluster = ~ country, ...)

To cluster by country and gender, simply use:

vcovCESE(..., cluster = ~ country + gender, ...)

The parameter cluster can also receive, instead of a formula, a string vector with the name of the variables that contain the groups to cluster the standard errors. If cluster = NULL, each observation is considered its own group to cluster the standard errors.

The parameter type receives the procedure to use for heterokedasticity correction. Heterokedasticity occurs when the diagonal elements of $$\Sigma$$ are not constant across observations. The correction can also be used to deal with underestimation of the true variance of the residuals due to leverage produced by outliers. We include five types of correction. In particular, type can be either “HC0”, “HC1”, “HC2”, “HC3”, and “HC4” (Hayes and Cai (2007)). Denote $$e_c$$ the corrected residuals. Each option produce the following corretion:

\begin{array}{l@{}l} \ & e_{ic} = e_i \ & e_{ic} = e_i ( ) \ & e_{ic} = e_i ( ) \ & e_{ic} = e_i ( ) \ & e_{ic} = e_i ( )\ \ \end{array}

where $$k$$ is the number of covariates, $$h_{ii}$$ is the $$i^{th}$$ diagonal element of the matrix $$P=X(X^{T}X)^{-1}X^{T}$$, and $$\delta_{i} = \min(4, h_{ii}\frac{n}{k} )$$.

The estimation also corrects for cases in which $$\rho_{g} > \sigma^2{g}$$. Following Jackson (2019), we use $$\hat{\sigma}_{g}^{2} = (\hat{\rho}_g + 0.02 )$$ in those cases.

## Example

To ilustrate how to use the ceser package, we use the data set dcese, which is provided with the package. The data set contains information of 310 (i=1,…, 310) observations across 51 countries (g=1,…,51).

The outcome variable is the number of effective legislative parties (enep). The explanatory variables are: the number of presidential candidates (enpc); a measure of presidential power (fapres); the proximity of presidential and legislative elections (proximity); the effective number of ethnic groups (eneg); the log of average district magnitudes (logmag); an interaction term between the number of presidential candidates and the presidential power (enpcfapres = enpc $$\times$$ fapres), and another interaction term between the log of the district magnitude and the number of ethnic groups (logmag_eneg = logmag $$\times$$ eneg). In the example below, we regress enpc on fapres, enpc, their interaction, and other controls.

library(ceser)
#>
#>  ## ------------------------------------------
#>  ## Cluster Estimated Standard Errors  (ceser)
#>  ## ------------------------------------------
#>  Author(s): Diogo Ferrari and John Jackson
#>
data(dcese)

Let’s estimate a linear model using the lm() function.

mod  = lm(enep ~  enpc + fapres + enpcfapres + proximity
+ eneg + logmag + logmag_eneg , data=dcese)

For all the examples below, we use the HC3 correction. To compute the variance covariance matrix of the estimated coefficients $$\hat{\beta}$$ using clustered standard errors by country, just execute:

vcovCESE(mod, cluster = ~country, type="HC3")
#>             (Intercept)          enpc       fapres    enpcfapres    proximity
#> (Intercept)  1.59804215 -0.3565889644 -0.326044522  0.0928614018 -0.086958688
#> enpc        -0.35658896  0.1254735451  0.104834222 -0.0354704134 -0.003332760
#> fapres      -0.32604452  0.1048342220  0.143205927 -0.0389793642 -0.017878997
#> enpcfapres   0.09286140 -0.0354704134 -0.038979364  0.0126978025  0.003217871
#> proximity   -0.08695869 -0.0033327597 -0.017878997  0.0032178709  0.139694748
#> eneg        -0.08737194  0.0028258190 -0.007080605  0.0010940285 -0.005680272
#> logmag      -0.22422347  0.0009844983  0.006687954  0.0038080340  0.009776123
#> logmag_eneg  0.08380931 -0.0058250477 -0.011500450  0.0008568949  0.004471824
#>                     eneg        logmag   logmag_eneg
#> (Intercept) -0.087371945 -0.2242234723  0.0838093142
#> enpc         0.002825819  0.0009844983 -0.0058250477
#> fapres      -0.007080605  0.0066879544 -0.0115004496
#> enpcfapres   0.001094029  0.0038080340  0.0008568949
#> proximity   -0.005680272  0.0097761226  0.0044718243
#> eneg         0.039433115  0.0481561426 -0.0231003489
#> logmag       0.048156143  0.2244236813 -0.1048418029
#> logmag_eneg -0.023100349 -0.1048418029  0.0606625677

We can obtain the summary table with CESE by country by running:

library(lmtest)
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#>     as.Date, as.Date.numeric
coeftest(mod, vcov = vcovCESE, cluster = ~ country, type="HC3")
#>
#> t test of coefficients:
#>
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  2.704309   1.264137  2.1393  0.03325 *
#> enpc         0.303988   0.354222  0.8582  0.39150
#> fapres      -0.611787   0.378426 -1.6167  0.10703
#> enpcfapres   0.207791   0.112685  1.8440  0.06620 .
#> proximity   -0.022429   0.373758 -0.0600  0.95219
#> eneg        -0.065662   0.198578 -0.3307  0.74114
#> logmag      -0.181462   0.473734 -0.3830  0.70197
#> logmag_eneg  0.360513   0.246298  1.4637  0.14435
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1