# B Robust Regression Methods

This appendix is largely based on the introduction to linear robust regression presented in Ronchetti (2006) and Duncan and Guerrier (2016). In these references it is stated that the vast majority of the statistical models employed in different fields going from finance to biology and engineering, for example, are parametric models. Based on these models, assumptions are made concerning the properties of the variables of interest (and the models themselves) and optimal procedures are derived under these assumptions. Among these procedures, the least squares and maximum likelihood estimators are well known examples that, however, are only optimal when the underlying statistical assumptions are exactly satisfied. If the latter case does not hold, then these procedures can become considerably biased and/or inefficient when there exist small deviations from the model. The results obtained by classical procedures can therefore be misleading when applied to real data (see e.g. Ronchetti (2006) and Huber and Ronchetti (2009)).

In order to address the problems arising from violated parametric assumptions, robust statistics can be seen as an extension to classical parametric statistics by directly considering the deviations from the models. Indeed, while parametric models may be a good approximation of the true underlying situation, robust statistics does not assume that the model is exactly correct. A robust procedure as stated in Huber and Ronchetti (2009) therefore should have the following features:

- It should efficiently estimate the assumed model.
- It should be reliable and reasonably efficient under small deviations from the model (e.g. when the distribution lies in a neighborhood of the assumed model).
- Larger deviations from the model should not affect the estimation procedure excessively.

A robust estimation method is a compromise with respect to these three features. This compromise is illustrated by Anscombe and Guttman (1960) using an insurance metaphor: “sacrifice some efficiency at the model in order to insure against accidents caused by deviations from the model”.

It is often believed that robust procedures may be avoided by using the following two-step procedure:

- Clean the data using some rule for outlier rejection.
- Apply classical optimal procedures on the “clean” data.

Unfortunately such procedures cannot replace robust methods as discussed in Huber and Ronchetti (2009) for the following reasons:

- It is rarely possible to seperate the two steps. For example, in a parametric regression setting, outliers are difficult to recognize without reliable (i.e. robust) estimates of the model’s parameters.
- The cleaned data will not correspond to the assumed model since there will be statistical errors of both kinds (false acceptance and false rejection). Therefore in general the classical theory is not applicable to the cleaned sample.
- Empirically, the best rejection procedures do not reach the performance of the best robust procedures. The latter are apparently superior because they can make a smoother transition between the full acceptance and full rejection of an observation using weighting procedures Hampel et al. (1987).
- Empirical studies have also shown that many of the classical rejection methods are unable to deal with multiple outliers. Indeed, it is possible that a second outlier “masks” the effect of the first so that neither are rejected.

Unfortunately the least squares estimator suffers from a dramatic lack of robustness. A single outlier can have an arbitrarily large effect on the estimated parameters. In order to assess the robustness of an estimator we first need to introduce an important concept, namely the influence function. This concept was introduced in Hampel (1968) and it formalizes the bias caused by one outlier. The influence function of an estimator represents the effect of an infinitesimal contamination at the point \(x\) or (\(\mathbf{x}\), \(y\)) in the regression setting) on the estimate, standardized by the mass of contamination. Mathematically, the influence function of the estimator \(T\) for the model \(F\) is given by:

\[\text{IF}(x| T, F) = \lim_{\varepsilon \rightarrow 0} \frac{T\left((1-\varepsilon) F + \varepsilon \Delta_x \right) - T\left(F\right)}{\varepsilon}\]

where \(\Delta_x\) is a probability measure which puts mass \(1\) at the point \(x\).

## B.1 The Classical Least-Squares Estimator

The standard definition of the linear model is derived as follows. Let \({(\mathbf{x}_{i},y_{i}): i = 1, \ldots, n}\) be a sequence of independent identically distributed random variables such that:

\[y_{i} = \mathbf{x}_{i}^{T} {\boldsymbol{\beta}} + u_{i}\]

where \(y_{i} \in \mathbb{R}\) is the \(i\)-th observation, \(\mathbf{x}_{i} \in \mathbb{R}^{p}\) is the \(i\)-th row of the design matrix \(\mathbf{X} \in \mathbb{R}^{n\times p}\), \(\boldsymbol{\beta} \in \boldsymbol{\Theta} \subseteq \mathbb{R}\) is a *p*-vector of unknown parameters, \(u_{i} \in \mathbb{R}\) is the \(i\)-th error.

The least-squares estimator \(\hat{\boldsymbol{\beta}}_{LS}\) of \(\boldsymbol{\beta}\) can be expressed as an \(M\)-estimator^{1} Least-squares estimators are an example of the larger class of \(M\)-estimators. The definition of \(M\)-estimators was motivated by robust statistics which delivered new types of \(M\)-estimators. defined by the estimating equation:

This estimator is optimal under the following assumptions:

- \(u_{i}\) are normally distributed.
- \(\mathbb{E}[u_{i}] = 0\) for \(i = 1, \ldots, n\).
- \(Cov(u_{1}, \ldots, u_{n}) = \sigma^2 \, \mathbf{I}_{n}\) where \(\mathbf{I}_{n}\) denotes the identity matrix of size \(n\).

In other words, least-squares estimation is only optimal when the errors are normally distributed. Small departures from the normality assumption for the errors results in considerable loss of efficiency of the least-squares estimator (see Hampel et al. (1987), Huber (1973) and Hampel (1973)).

## B.2 Robust Estimators for Linear Regression Models

The “Huber estimator” introduced in Huber (1973) was one of the first robust estimation methods applied to linear models. Basically, this estimator is a weighted version of the least-squares estimate with weights of the form:

\[ w_{i} = \min \left(1,\frac{c}{|r_{i}|}\right) \]

where \(r_{i}\) is the \(i\)-th residual and \(c\) is a positive constant which controls the trade-off between robustness and efficiency.

Huber proposed an -estimator \(\hat{\boldsymbol{\beta}}_{H}\) of \(\boldsymbol{\beta}\) defined by the estimating equation:

\[ \sum_{i = 1}^{n} \psi_{c}\left(y_{i} - \mathbf{x}_{i}^{T} \boldsymbol{\beta} \right)\mathbf{x}_{i} = 0 \]

where \(\psi_{c}(\cdot)\) corresponds to Huber’s weight function

\[\begin{equation} w\left({x}\right) = \begin{cases} 1, &\text{if } \left|{x}\right| \le k \\ \frac{k}{\left|{x}\right|}, &\text{if} \left|{x}\right| > k \end{cases} \tag{B.2} \end{equation}\]and, thus, is defined as:

\[ \psi_{c}(r) = \left\{ \begin{array}{l l} r & \quad \\ c \cdot \text{sign}(r). & \quad \\ \end{array} \right. \]

However, the Huber estimator cannot cope with problems caused by outlying points in the design (or covariate) matrix \(X\). An estimator which was developed to address this particular issue is the one proposed by Mallows which has the important property that the influence function is bounded also for the matrix \(X\) (see Krasker (1980) for more details).

## B.3 Applications of Robust Estimation

Having rapidly highlighted the theory of robustness, we now focus on the application of robust techniques in practical settings. Over the next three examples, estimation between classical or traditional methods will be compared with robust methods to illustrate the usefulness of using the latter techniques can have in specific scenarios.

**Example B.1 ** Consider a simple linear model with Gaussian errors such as

```
# Load robust library
library("robustbase")
# Set seed for reproducibility
set.seed(867)
# Sample size
n = 20
# Model's parameters
alpha = 5 # Intercept
beta = 0.5 # Slope
sig2 = 1 # Residual variance
# Construct response variable y
x = 1:n
y = alpha + beta*x + rnorm(n,0,sqrt(sig2))
# Construct "perturbed" verison of y
y.perturbed = y
y.perturbed[20] = alpha + beta*x[20] - 10
# Compute LS estimates
LS.y = lm(y ~ x)
LS.y.fit = coef(LS.y)[1] + coef(LS.y)[2]*x
LS.y.pert = lm(y.perturbed ~ x)
LS.y.pert.fit = coef(LS.y.pert)[1] + coef(LS.y.pert)[2]*x
# Compute robust estimates
RR.y = lmrob(y ~ x)
RR.y.fit = coef(RR.y)[1] + coef(RR.y)[2]*x
RR.y.pert = lmrob(y.perturbed ~ x)
RR.y.pert.fit = coef(RR.y.pert)[1] + coef(RR.y.pert)[2]*x
# Define colors
gg_color_hue <- function(n, alpha = 1) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100, alpha = alpha)[1:n]
}
couleurs = gg_color_hue(6)
# Compare results based on y and y.perturbed
par(mfrow = c(1,2))
plot(NA, ylim = range(cbind(y,y.perturbed)), xlim = range(x),
main = "Uncontaminated setting", xlab = "x", ylab = "y")
grid()
points(x,y, pch = 16, col = couleurs[4])
lines(x, alpha + beta*x, col = couleurs[3], lty = 2)
lines(x, LS.y.fit, col = couleurs[5])
lines(x, RR.y.fit, col = couleurs[1])
legend("topleft",c("Observations","Estimated model (least-squares)",
"Estimated model (robust)", "True model"),
lwd = c(NA,1,1,1), col = couleurs[c(4,5,1,3)],
lty = c(NA,1,1,2), bty = "n", pch = c(16,NA,NA,NA))
plot(NA, ylim = range(cbind(y,y.perturbed)), xlim = range(x),
main = "Contaminated setting", xlab = "x", ylab = "y")
grid()
points(x[1:19],y.perturbed[1:19], pch = 16, col = couleurs[4])
points(x[20], y.perturbed[20], pch = 15, col = couleurs[6])
lines(x, alpha + beta*x, col = couleurs[3], lty = 2)
lines(x, LS.y.pert.fit, col = couleurs[5])
lines(x, RR.y.pert.fit, col = couleurs[1])
legend("topleft",c("Uncontamined observations", "Contamined observation",
"Estimated model (least-squares)", "Estimated model (robust)",
"True model"), lwd = c(NA,NA,1,1,1), col = couleurs[c(4,6,5,1,3)],
lty = c(NA,NA,1,1,2), bty = "n", pch = c(16,15,NA,NA,NA))
```

**Example B.2 (Robust v. Classical Simulation Study.)**The next example presents a simulation study where the robust and classical techniques will be compared in order to show that the previous example was not due to a “lucky” sample favorable to the robust approach. To do so, the simulation study will generate 100 realizations from the model in the previous example and use the same robust and classical estimators to retrieve the parameters for each of the 100 iterations.

```
# Number of bootstrap replications
B = 10^3
# Initialisation
coef.LS.cont = coef.rob.cont = matrix(NA,B,2)
coef.LS.uncont = coef.rob.uncont = matrix(NA,B,2)
# Start Monte-carlo
for (j in 1:2){
for (i in seq_len(B)) {
# Control seed
set.seed(2*j*B + i)
# Uncontaminated case
if (j == 1){
y = alpha + beta*x + rnorm(n,0,sqrt(sig2))
coef.LS.uncont[i,] = as.numeric(lm(y ~ x)$coef)
coef.rob.uncont[i,] = as.numeric(lmrob(y ~ x)$coef)
}
# Contaminated case
if (j == 2){
y = alpha + beta*x + rnorm(n,0,sqrt(sig2))
y[20] = alpha + beta*x[20] - 10
coef.LS.cont[i,] = as.numeric(lm(y ~ x)$coef)
coef.rob.cont[i,] = as.numeric(lmrob(y ~ x)$coef)
}
}
}
# Make graph
colors = gg_color_hue(6, alpha = 0.5)
names = c("LS - uncont","Rob - uncont","LS - cont","Rob - cont")
par(mfrow = c(1,2))
boxplot(coef.LS.uncont[,1],coef.rob.uncont[,1],coef.LS.cont[,1],
coef.rob.cont[,1], main = expression(alpha),
col = colors[c(5,1,5,1)],
cex.main = 1.5, xaxt = "n")
axis(1, labels = FALSE)
text(x = seq_along(names), y = par("usr")[3] - 0.15, srt = 45, adj = 1,
labels = names, xpd = TRUE)
abline(h = alpha, lwd = 2)
boxplot(coef.LS.uncont[,2],coef.rob.uncont[,2],coef.LS.cont[,2],
coef.rob.cont[,2], main = expression(beta),
col = colors[c(5,1,5,1)],
cex.main = 1.5, xaxt = "n")
axis(1, labels = FALSE)
text(x = seq_along(names), y = par("usr")[3] - 0.015, srt = 45, adj = 1,
labels = names, xpd = TRUE)
abline(h = beta, lwd = 2)
```

It can be seen that, as underlined earlier, that the estimations resulting from the two methods appear to be quite similar, with the robust estimator performing slightly less efficiently than the classical estimator. However, under the contaminated setting it is evident how the robust estimation procedure is not affected much by the outliers in the simulated data while the classical techniques appear to be highly biased. Therefore, if there appear to be outliers in the data, a robust estimation procedure may considered preferable. In fact, the results of the two estimations could be compared to understand if there appears to be some deviation from the model assumptions.

**Example B.3**A practical example which underlines the usefulness of robust estimation procedures is given by the dataset that contains the properties of 47 stars from the Hertzsprung-Russell diagram of the star cluster CYG OB1 in the direction of Cygnus. From the plot it can be observed that there appears to be a cluster of four stars on the upper left hand-side of the plot. The rest of the data however appears to have a reasonably good linear behavior.

```
colors = gg_color_hue(6)
data(starsCYG, package = "robustbase")
par(mfrow = c(1,1))
plot(NA, xlim = range(starsCYG$log.Te) + c(-0.1,0.1),
ylim = range(starsCYG$log.light),
xlab = "Temperature at the surface of the star (log)",
ylab = "Light intensity (log)")
grid()
points(starsCYG, col = colors[4], pch = 16)
LS = lm(starsCYG$log.light ~ starsCYG$log.Te)
rob = lmrob(starsCYG$log.light ~ starsCYG$log.Te)
x = seq(from = min(starsCYG$log.Te)-0.3,
to = max(starsCYG$log.Te)+0.3, length.out = 4)
fit.LS = coef(LS)[1] + x*coef(LS)[2]
fit.rob = coef(rob)[1] + x*coef(rob)[2]
lines(x, fit.LS, col = colors[5])
lines(x, fit.rob, col = colors[1])
legend("topright",c("Observations","Estimated model (least-squares)",
"Estimated model (robust)"), lwd = c(NA,1,1),
col = colors[c(4,5,1)], lty = c(NA,1,1),
bty = "n", pch = c(16,NA,NA))
```

From Figure B.2 it can be seen that classical linear regression is considerably affected by the cloud of apparent outliers on the left hand side of the plot. As a result, the regression line has a negative slope when the majority of the data appears to have a clear positive linear trend. The latter is however the case when using a robust approach which indeed detects a positive linear trend. Having said this, there are in fact two distinct populations in the data consisting in “giants” (upper left corner) and “main sequencers” (right handside). Thus, when observing data in general it is always necessary to understand if we are dealing with real outliers (unexplained outlying data) or simply with data that can be explained by other factors (e.g. a different population as in this example).

`summary(LS)`

```
##
## Call:
## lm(formula = starsCYG$log.light ~ starsCYG$log.Te)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1052 -0.5067 0.1327 0.4423 0.9390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7935 1.2365 5.494 1.75e-06 ***
## starsCYG$log.Te -0.4133 0.2863 -1.444 0.156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5646 on 45 degrees of freedom
## Multiple R-squared: 0.04427, Adjusted R-squared: 0.02304
## F-statistic: 2.085 on 1 and 45 DF, p-value: 0.1557
```

`summary(rob)`

```
##
## Call:
## lmrob(formula = starsCYG$log.light ~ starsCYG$log.Te)
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -0.80959 -0.28838 0.00282 0.36668 3.39585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9694 3.4100 -1.457 0.15198
## starsCYG$log.Te 2.2532 0.7691 2.930 0.00531 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 0.4715
## Multiple R-squared: 0.3737, Adjusted R-squared: 0.3598
## Convergence in 15 IRWLS iterations
##
## Robustness weights:
## 4 observations c(11,20,30,34) are outliers with |weight| = 0 ( < 0.0021);
## 4 weights are ~= 1. The remaining 39 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6533 0.9171 0.9593 0.9318 0.9848 0.9986
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.548e+00 5.000e-01 4.685e+00 1.000e-07
## rel.tol scale.tol solve.tol eps.outlier
## 1.000e-07 1.000e-10 1.000e-07 2.128e-03
## eps.x warn.limit.reject warn.limit.meanrw
## 8.404e-12 5.000e-01 5.000e-01
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
```

### C Proofs

Ronchetti, E.M. 2006. “The Historical Development of Robust Statistics.”

Duncan, I., and S. Guerrier. 2016. “Member Plan Choice and Migration in Response to Changes in Member Premiums after Massachusetts Health Insurance Reform.” *North American Actuarial Journal*.

Huber, P.J., and E.M. Ronchetti. 2009. *Robust Statistics*. John Wiley & Sons Inc.

Anscombe, F.J., and I. Guttman. 1960. “Rejection of Outliers.” *Technometrics* 2 (2). The American Society for Quality Control; The American Statistical Association: 123–47.

Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel. 1987. *Robust Statistics: the Approach Based on Influence Functions*. Wiley New York.

Hampel, F.R. 1968. “Contributions to the Theory of Robust Estimation.” PhD thesis, University of California, Berkeley.

Huber, P.J. 1973. “Robust Regression: Asymptotics, Conjectures and Monte Carlo.” *The Annals of Statistics* 1 (5). Institute of Mathematical Statistics: 799–821.

Hampel, F.R. 1968. “Contributions to the Theory of Robust Estimation.” PhD thesis, University of California, Berkeley.

1973. “Robust Estimation: A Condensed Partial Survey.”*Probability Theory and Related Fields*27 (2). Springer: 87–104.

Krasker, W.S. 1980. “Estimation in Linear Regression Models with Disparate Data Points.” *Econometrica* 48 (6). The Econometric Society: 1333–46.

\(M\)-estimators are obtained as the minima of sums of functions of the data.↩