diagnostics
Overview
The diagnostics
module has classes and functions to examine the fit of OLS models and the extreme observations in datasets.
The main class is the BadApples
class, which consumes an OLS model object and is used to examine the outliers, high-leverage points and influential points in a model. In essence it is used to examine the 'bad apples' that may be stinking up a model's results.
The main methods are:
variance_inflation_factors
heteroskedasticity_test
partial_regression_plot
wald_test
There are also methods for diagnostic plots such as pp_plot
but they are exposed more conveniently in an OLS model object method:
pp_plot
: P-P plotqq_plot
: Q-Q plotrvf_plot
: plot of residuals against fitted valuesrvp_plot
: plot of residuals against values of a predictor
BadApples
The 10 Minutes To Appelpy notebook fits a BadApples instance, consuming a model of the California Test Score dataset.
- : interactive experience of the 10 Minutes to Appelpy tutorial via Binder.
- : static render of the 10 Minutes to Appelpy notebook.
from appelpy.diagnostics import BadApples
bad_apples = BadApples(model_hc1).fit()
Attributes
- Measures:
measures_influence
,measures_leverage
andmeasures_outliers
. - Indices:
indices_high_influence
,indices_high_leverage
andindices_outliers
.
INFLUENCE:
- dfbeta (for each independent variable): DFBETA diagnostic. Extreme if val > 2 / sqrt(n)
- dffits (for each independent variable): DFFITS diagnostic. Extreme if val > 2 * sqrt(k / n)
- cooks_d: Cook's distance. Extreme if val > 4 / n
LEVERAGE:
- leverage: value from hat matrix diagonal. Extreme if val > (2*k + 2) / n
OUTLIERS:
- resid_standardized: standardized residual. Extreme if |val| > 2, i.e. approx. 5% of observations will be flagged.
- resid_studentized: studentized residual. Extreme if |val| > 2, i.e. approx. 5% of observations will be flagged.
Methods
The plot_leverage_vs_residuals_squared
method plots leverage values (y-axis) against the residuals squared (x-axis). The plot can be annotated with the index values.
Variance inflation factors
The variance_inflation_factors
method takes a dataframe and calculates the variance inflation factors of its regressors.
Heteroskedasticity test
The heteroskedasticity_test
method takes an OLS model object and returns the results of a heteroskedasticity test (the test statistic and p-value). Examples of heteroskedasticity tests include:
- Breusch-Pagan test (
breusch_pagan
) - Breusch-Pagan studentized test (
breusch_pagan_studentized
) - White test (
white
)
The 10 Minutes To Appelpy notebook shows the results of heteroskedasticity tests, given a model fitted to the California Test Score dataset.
- : interactive experience of the 10 Minutes to Appelpy tutorial via Binder.
- : static render of the 10 Minutes to Appelpy notebook.
Here is a code snippet for a heteroskedasticity test.
from appelpy.diagnostics import heteroskedasticity_test
ep, pval = heteroskedasticity_test('breusch_pagan_studentized', model_nonrobust)
print('Breusch-Pagan test (studentized)')
print('Test statistic: {:.4f}'.format(ep))
print('Test p-value: {:.4f}'.format(pval))
Partial regression plot
Also known as the added variable plot, the partial regression plot shows the effect of adding another regressor (independent variable) to a regression model.
The method requires these parameters:
appelpy_model_object
: a fitted OLS model object.df
: the dataframe used in the model.regressor
: the additional variable in the partial regression.
Wald test
The Wald test lets you do joint testing of hypotheses, e.g.
- Are the coefficients of dummy columns for a categorical variable significantly different from 0?
- Is the difference between two regressor coefficients, e.g.
beta_u - beta_v
, significantly different from 2?
Pass a list of variables to the function for straightforward joint hypothesis testing of whether a coefficient is significantly different from zero.
Pass a dict for testing of hypotheses against non-zero scalars, where the values are scalars and the keys are either strings or two-item tuples e.g.
hypotheses_object = {('col_u', 'col_v'): 2} # for the difference between two coefficients
hypotheses_object = {'col_a' : 2}