Robust Statistical Methods with R 1st Edition. For Instructors Request Inspection Copy. Add to Wish List. Description Reviews Contents Subjects. Description Robust statistical methods were developed to supplement the classical procedures when the data violate classical assumptions.
The Bookshelf application offers access: Online — Access your eBooks using the links emailed to you on your Routledge. Trimmed estimators and Winsorised estimators are general methods to make statistics more robust. L-estimators are a general class of simple statistics, often robust, while M-estimators are a general class of robust statistics, and are now the preferred solution, though they can be quite involved to calculate. There are various definitions of a "robust statistic.
This means that if the assumptions are only approximately met, the robust estimator will still have a reasonable efficiency , and reasonably small bias , as well as being asymptotically unbiased , meaning having a bias tending towards 0 as the sample size tends towards infinity. One of the most important cases is distributional robustness. Thus, in the context of robust statistics, distributionally robust and outlier-resistant are effectively synonymous. A related topic is that of resistant statistics, which are resistant to the effect of extreme scores.
The data sets for that book can be found via the Classic data sets page, and the book's website contains more information on the data. Although the bulk of the data look to be more or less normally distributed, there are two obvious outliers.
These outliers have a large effect on the mean, dragging it towards them, and away from the center of the bulk of the data. Thus, if the mean is intended as a measure of the location of the center of the data, it is, in a sense, biased when outliers are present. Also, the distribution of the mean is known to be asymptotically normal due to the central limit theorem. However, outliers can make the distribution of the mean non-normal even for fairly large data sets.
Besides this non-normality, the mean is also inefficient in the presence of outliers and less variable measures of location are available. The plot below shows a density plot of the speed-of-light data, together with a rug plot panel a. Also shown is a normal Q—Q plot panel b. The outliers are clearly visible in these plots.
The analysis was performed in R and 10, bootstrap samples were used for each of the raw and trimmed means.
Also note that whereas the distribution of the trimmed mean appears to be close to normal, the distribution of the raw mean is quite skewed to the left. So, in this sample of 66 observations, only 2 outliers cause the central limit theorem to be inapplicable. Robust statistical methods, of which the trimmed mean is a simple example, seek to outperform classical statistical methods in the presence of outliers, or, more generally, when underlying parametric assumptions are not quite correct. Whilst the trimmed mean performs well relative to the mean in this example, better robust estimates are available.
In fact, the mean, median and trimmed mean are all special cases of M-estimators.
Based on more than a decade of teaching and research experience, Robust Statistical Methods with R offers a thorough, detailed overview of robust procedures. There is a rapidly increasing number of books with titles “Something with R”, where “Some- thing” is some area of statistics. Clearly this is a.
Details appear in the sections below. The outliers in the speed-of-light data have more than just an adverse effect on the mean; the usual estimate of scale is the standard deviation, and this quantity is even more badly affected by outliers because the squares of the deviations from the mean go into the calculation, so the outliers' effects are exacerbated. The plots below show the bootstrap distributions of the standard deviation, median absolute deviation MAD and Qn estimator of scale.
Panel a shows the distribution of the standard deviation, b of the MAD and c of Qn. The distribution of standard deviation is erratic and wide, a result of the outliers. This simple example demonstrates that when outliers are present, the standard deviation cannot be recommended as an estimate of scale. Traditionally, statisticians would manually screen data for outliers , and remove them, usually checking the source of the data to see whether the outliers were erroneously recorded. Indeed, in the speed-of-light example above, it is easy to see and remove the two outliers prior to proceeding with any further analysis.
However, in modern times, data sets often consist of large numbers of variables being measured on large numbers of experimental units. Therefore, manual screening for outliers is often impractical. Outliers can often interact in such a way that they mask each other. As a simple example, consider a small univariate data set containing one modest and one large outlier. The estimated standard deviation will be grossly inflated by the large outlier. The result is that the modest outlier looks relatively normal. As soon as the large outlier is removed, the estimated standard deviation shrinks, and the modest outlier now looks unusual.
This problem of masking gets worse as the complexity of the data increases. For example, in regression problems, diagnostic plots are used to identify outliers. However, it is common that once a few outliers have been removed, others become visible. The problem is even worse in higher dimensions. Robust methods provide automatic ways of detecting, downweighting or removing , and flagging outliers, largely removing the need for manual screening.
Care must be taken; initial data showing the ozone hole first appearing over Antarctica were rejected as outliers by non-human screening. Although this article deals with general principles for univariate statistical methods, robust methods also exist for regression problems, generalized linear models, and parameter estimation of various distributions. The basic tools used to describe and measure robustness are, the breakdown point , the influence function and the sensitivity curve. Intuitively, the breakdown point of an estimator is the proportion of incorrect observations e.
The higher the breakdown point of an estimator, the more robust it is. Therefore, the maximum breakdown point is 0. For example, the median has a breakdown point of 0. Statistics with high breakdown points are sometimes called resistant statistics. In the speed-of-light example, removing the two lowest observations causes the mean to change from We structure the packages roughly into the following topics, and typically will first mention functionality in packages robustbase and robust.
The ltsReg and lmrob. S functions are available in robustbase , but rather for comparison purposes. Note that Koenker's quantile regression package quantreg contains L1 aka LAD, least absolute deviations -regression as a special case, doing so also for nonparametric regression via splines. Quantile regression and hence L1 or LAD for mixed effect models, is available in package lqmm , whereas an MM-like approach for robust linear mixed effects modeling is available from package robustlmm.
Package mblm 's function mblm fits median-based Theil-Sen or Siegel's repeated simple linear models. Package TEEReg provides trimmed elemental estimators for linear models. Generalized linear models GLMs are provided both via glmrob robustbase and glmRob robust , where package robustloggamma focuses on generalized log gamma models. Robust Nonlinear model fitting is available through robustbase 's nlrob. Here, the rrcov package which builds " Depends " on robustbase provides nice S4 class based methods, more methods for robust multivariate variance-covariance estimation, and adds robust PCA methodology.
It is extended by rrcovNA , providing robust multivariate methods for for incomplete or missing NA data, and by rrcovHD , providing robust multivariate methods for High Dimensional data. High dimensional data with an emphasis on functional data are treated robustly also by roahd. Historically, note that robust PCA can be performed by using standard R's princomp , e. GSE estimates multivariate location and scatter in the presence of missing data.