Mon Jan 18, 2016 at 11:37   with lessR version 3.4.1

Output Options: explain=TRUE, interpret=TRUE, results=TRUE, document=TRUE, code=TRUE


Multiple Regression of Reading

The variable of primary interest is the response variable, Reading, Reading Ability. The purpose of this analysis is to account for the values of Reading in terms of the values of the predictor variables Verbal, Absent and Income.

The Data

Read the data with the lessR function Read. The corresponding data values for the variables in the model comprise the training data, from which the model is estimated.

mydata <- rd("Reading", format = "lessR")
## 
## Data File:   Reading 
## ----------------------------------------------------------
## Dimensions: 4 variables over 100 rows of data
## 
## First two row names: 1     2 
## Last two row names:  99     100 
## ----------------------------------------------------------
## 
## 
## Data Types
## ------------------------------------------------------------
## integer: Numeric data values, but integers only
## numeric: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
## 
##  Variable                  Missing  Unique 
##      Name     Type  Values  Values  Values   First and last values
## ----------------------------------------------------------------------------------------
##   Reading   integer    100       0      45   73  77  42 ... 51  77  67
##    Verbal   integer    100       0      49   64  80  54 ... 39  87  74
##    Absent   integer    100       0      12   4  2  9 ... 7  0  5
##    Income   integer    100       0      70   36  103  4 ... 26  70  28
## ----------------------------------------------------------------------------------------
## 
## 
## No missing data
## 
## 
## Variable Names   Variable Labels
## -------------------------
## Reading   Reading Ability 
## Verbal    Verbal Aptitude 
## Absent    Number of Absences 
## Income    Family Income ($1000) 
## -------------------------
## 
## No variable units

Data from the following variables are available for analysis: Reading, Verbal, Absent and Income.

The Model

Specified Model

Express Reading as a linear function of three predictor variables: Verbal, Absent and Income. Within the context of the model, indicate the response variable with a Y, subscripted by the variable name, \(Y_{Reading}\). Write each predictor variable as a subscript to an X. From the training data compute \(\hat Y_{Reading}\), the fitted value of the response variable from the model for a specific set of values for Verbal, Absent and Income. \[\hat Y_{Reading} = b_0 + b_1 X_{Verbal} + b_2 X_{Absent} + b_3 X_{Income}\] The intercept, \(b_0\), indicates the fitted value of Reading, for values of Verbal, Absent and Income all equal to zero. Each slope coefficient, \(b_1\) through \(b_3\), is the average change in the value of response variable, Reading, for a one-unit increase in the value of the corresponding predictor variable, with the values of all remaining predictor variables held constant. The values of these estimated coefficients only apply to the interpretation of the training data from which they were estimated.

To compute \(\hat Y_{Reading}\) from a specific set of values for Verbal, Absent and Income requires the estimated values of the coefficients of the model, the values of each regression coefficient, \(b_j\). This estimation procedure depends on the residual, the difference between the actual value of Reading for each row of data and the corresponding value fitted by the model. Define the residual as a variable across all rows of data. Use the subscript i to indicate the \(i^{th}\) row of data to emphasize that the expression applies to each row of training data. The name of the response variable in this notation is understood and so is omitted for simplicity. \[e_i = Y_i - \hat Y_i\] Estimate the coefficients with ordinary least squares (OLS), which provides the one set of estimates that minimize the sum of the squared residuals, \(\sum e^2_i\), across all the rows of the training data.

Accomplish the estimation and related computations with the lessR function Regression, abbreviated as reg. Keep graphics separate, so generate these later.

r <- reg(Reading ~ Verbal + Absent + Income, graphics = FALSE)

The output begins with a specification of the variables in the model and a brief description of the data.

r$out_background
## Data Frame:  mydata 
##  
## Response Variable: Reading, Reading Ability 
## Predictor Variable 1: Verbal, Verbal Aptitude 
## Predictor Variable 2: Absent, Number of Absences 
## Predictor Variable 3: Income, Family Income ($1000) 
##  
## Number of cases (rows) of data:  100 
## Number of cases retained for analysis:  100

Of the 100 cases presented for analysis, 100 are retained, so the number of deleted cases due to missing data is 0.

Estimated Model

The analysis of the model begins with the estimation of each sample regression coefficient, \(b_j\), from the training data. Of greater interest is each corresponding population value, \(\beta_j\), in the population model. \[\hat Y_{Reading} = \beta_0 + \beta_1 X_{Verbal} + \beta_2 X_{Absent} + \beta_3 X_{Income}\] The associated inferential analyses for each estimate are the hypothesis test and confidence interval. Each t-test evaluates the null hypothesis that the corresponding individual population regression coefficient is 0, here for the \(j^{th}\) coefficient. \[H_0: \beta_j=0\] \[H_1: \beta_j \ne 0\] The confidence interval provides the range of likely values of the corresponding \(\beta_j\). Each 95% confidence interval is the margin of error on either side of the corresponding estimated intercept or slope coefficient, \(b_j\).

r$out_estimates
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept)     64.44       8.35    7.713    0.000       47.85       81.02 
##      Verbal      0.20       0.10    2.120    0.037        0.01        0.39 
##      Absent     -2.04       0.49   -4.176    0.000       -3.01       -1.07 
##      Income      0.03       0.04    0.912    0.364       -0.04        0.11

This estimated model is the specific linear function that yields a fitted value of Reading from the provided values of Verbal, Absent and Income. \[\hat Y_{Reading} = 64.44 + 0.20 X_{Verbal}- 2.04 X_{Absent}+ 0.03 X_{Income}\] One predictor variable has a p-value larger than \(\alpha\) = 0.05: Income. The null hypothesis of no relationship could not be rejected, so there is a reasonable possibility that the predictor variable may not contribute to explaining the values of Reading, with the values of all remaining predictor variables held constant.

These predictor variables each have a p-value less than or equal to \(\alpha\) = 0.05: Verbal and Absent. The possibility should be further explored in the remainder of this analysis that these two variables may form an equally effective but more parsimonious model in terms of their cumulative contribution to explaining the values of Reading, compared to the current model with three predictor variables.

To extend the results beyond this sample to the population from which the sample was obtained, interpret the meaning of these corresponding coefficients in terms of their confidence intervals.

  • Verbal: With 95% confidence, for each additional unit of the value of Verbal Aptitude, on average, Reading Ability changes somewhere between 0.01 to 0.39, with the values of Absent and Income held constant.
  • Absent: With 95% confidence, for each additional unit of the value of Number of Absences, on average, Reading Ability changes somewhere between -3.01 to -1.07, with the values of Verbal and Income held constant.

Model Fit

An estimated model is not necessarily a useful model. A preliminary consideration is the fit of the model, based on the extent that the values of Reading fitted by the model to the training data match the corresponding training data values of Reading. Are the residuals typically close to their mean of zero, or are they scattered about the regression surface with relatively large positive and negative values?

Partitioning of Variance

The analysis of fit depends on the adequacy of the model to account for the variability of the data values of Reading, expressed in model notation as \(Y_{Reading}\). The core component of variability is the sum of squares, short for the sum of some type of squared deviations. The total variability of \(Y_{Reading}\) depends on the deviations of its data values from its mean, \(Y_{Reading} - \bar Y_{Reading}\), and then the resulting sums of squares, \(SS_{Reading}\).

The analysis of the residuals, \(e = Y_{Reading} - \hat Y_{Reading}\), follows from the corresponding sum of squares, the value minimized by the least squares estimation procedure, \(\sum e^2_i\) = \(SS_{Residual}\). This residual sum of squares represents variation of \(Y_{Reading}\) not accounted for by \(\hat Y_{Reading}\). The complement to the residual sum of squares is the Model (or Regression) sum of squares, the deviations of the fitted values about the mean, \(\hat Y_{Reading} - \bar Y_{Reading}\).

The analysis of variance (ANOVA) partitions this total sum of squares into the residual variability, \(\sum e^2_i\), and the Model sum of squares, \(SS_{Model}\). The ANOVA table displays these various sources of variation.

r$out_anova
##             df    Sum Sq   Mean Sq   F-value   p-value 
##    Verbal    1   5134.43   5134.43     49.43     0.000 
##    Absent    1   2861.27   2861.27     27.54     0.000 
##    Income    1     86.40     86.40      0.83     0.364 
## Model        3   8082.10   2694.03     25.93     0.000 
##  
## Residuals   96   9972.65    103.88 
##  
## Reading     99  18054.75    182.37

\[SS_{Reading} = SS_{Model} + SS_{Residual} = 8,082.10 + 9,972.65 = 18,054.75 \]

The ANOVA further partitions the overall model sums of squares by predictor variable. \[SS_{Model} = SS_{Verbal} + SS_{Absent} + SS_{Income} = 8,082.10\]

The sum of squares for a predictor variable is called a sequential sums of squares. It represents the effect of a predictor variable after the effects of all previously entered variables in the model have already been accounted for. Unless the predictor variables are uncorrelated, its value depends on the sequence of the variables as specified in the model. Usually the interpretation of a sequential effect is more useful when the variables are entered in order of perceived importance.

Progressing through the table of the sequential sums of squares for each predictor variable from the first entry, Verbal, through the last last entry, Income, forms a sequence of increasingly larger nested models that successively contain more variables. For example, the p-value of 0.364 for the last variable entered into the model, Income, is the same for both the ANOVA table and its regression slope coefficient because in both situations the effects of all other predictor variables are partialled out.

This fundamental relationship of these various sums of squares components provides the basis for assessing fit.

Fit Indices

From the ANOVA two types of primary indicators of fit are derived: standard deviation of the residuals and several \(R^2\) statistics.

r$out_fit
## Standard deviation of residuals:  10.19 for 96 degrees of freedom 
##  
## R-squared:  0.448    Adjusted R-squared:  0.430    PRESS R-squared:  0.399 
## 
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 25.934     df: 3 and 96     p-value:  0.000

The standard deviation of the residuals, \(s_e\), directly assesses the variability of the data values of Reading about the corresponding fitted values for the training data, the particular data set from which the model was estimated. Each mean square in the ANOVA table is a variance, a sum of squares divided by the corresponding degrees of freedom, df. By definition, the standard deviation, \(s_e\) is the square root of the mean square of the residuals. \[s_e = \sqrt{MS_{Residual}} = \sqrt{103.88} = 10.19\] To interpret \(s_e\) = 10.19, consider the estimated range of 95% of the values of a normally distributed variable, which depends on the corresponding 2.5% cutoff from the \(t\)-distribution for df=96: 1.985. \[95\% \; Range: 2 * t_{cutoff} * s_e = 2 * 1.985 * 10.19 = 40.46\]

This range of the residuals for the fitted values is the lower limit of the range of forecasting error presented later.

A second type of fit index is \(R^2\), the proportion of the overall variability of response variable Reading that is accounted for by the model, applied to the training data, expressed either in terms of \(SS_{Residual}\) or \(SS_{Model}\). \[R^2 = 1 - \frac{SS_{Residual}}{SS_{Reading}} = \frac{SS_{Model}}{SS_{Reading}} = \frac{8,082.10} {18,054.75} = 0.448 \] Unfortunately when any new predictor variable is added to a model, useful or not, \(R^2\) necessarily increases. Use the adjusted version, \(R^2_{adj}\), to more appropriately compare models estimated from the same training data with different numbers of predictors. \(R^2_{adj}\) helps to avoid overfitting a model because it only increases if a new predictor variable added to the model improves the fit more than would be expected by chance. The adjustment considers the number of predictor variables relative to the number of rows of data (cases). Accomplish this adjustment with the degrees of freedom, to shift from the Sum of Squares to the corresponding Mean Squares. \[R^2_{adj} = 1 - \frac{SS_{Residual} \; / \; 96}{SS_{Reading} \; / \; 99} = 1 - \frac{MS_{Residual}}{MS_{Reading}} = 1 - \frac{103.88} {182.37} = 0.430\] From this analysis compare \(R^2\) = 0.448 to the adjusted value of \(R^2_{adj}\) = 0.430, a difference of 0.017. A large difference indicates that too many predictor variables in the model for the available data yielded an overfitted model.

Both \(R^2\) and \(R^2_{adj}\) describe the fit of the model to the training data. Base the fit of the model to forecasts from new data on the predictive residual (PRE). To calculate this residual for a row of data (case), first estimate a model with the case deleted, that is, from only all the remaining cases in the training data, what is called a case-deletion statistic. Repeat for all rows of data. \(SS_{PRE}\), or PRESS, is the sum of squares of all the predictive residuals in a data set. From \(SS_{PRE}\) define the predictive \(R^2\). \[R^2_{PRESS} = 1 - \frac{SS_{PRE}}{SS_{Reading}} = 1 - \frac{10,842.49} {18,054.75} = 0.399 \]

Because an estimated model at least to some extent overfits the training data, the more useful \(R^2_{PRESS}\) = 0.399 is lower than both \(R^2\) and \(R^2_{adj}\).

Hypothesis Tests of Multiple Coefficients

The ANOVA table presents the overall hypothesis test that evaluates if all the predictor variables as a set – Verbal, Absent and Income – are related to Reading as specified by the model. \[ \begin{aligned} H_0&: \;\beta_{Verbal} = \beta_{Absent} = \beta_{Absent} = \beta_{Income}= 0 \\\\ H_1&: \; at \; least \; one \; \beta_j \ne 0 \end{aligned} \] From the sums of squares for the Model and Residuals, with degrees of freedom of 3 and 96, the test statistic is F = 25.93, with a p-value of 0.00.

To help identify predictors that contribute little beyond those variables previously included in the model, generally list the more important variables first in the model specification. Add together the sequential sum of squares from the ANOVA table for variables listed last in the table to form a nested model. Then test if the designated subset of the regression coefficients are all equal to zero. To illustrate, consider the hypothesis test that the slope coefficients for the last two variables, Absent and Income, are both equal to zero. \[ \begin{aligned} H_0&: \;\beta_{Absent} = \beta_{Income}= 0 \\\\ H_1&: \; at \; least \; one \; \beta_{Absent}, \beta_{Income} \ne 0 \end{aligned} \]

Compare two nested models with the lessR function Nest. Specify the response variable Reading, the variables in the reduced model, and then the additional variables in the full model. Nest also ensures that the same data values are compared when there is missing data that might otherwise leave more data in the analysis of the reduced model.

n <- Nest(Reading, c(Verbal), c(Absent, Income))

First verify that the reduced and full models are properly specified.

n$out_models
## Reduced Model: Reading ~ Verbal
## Full Model   : Reading ~ Verbal + Absent + Income

Compare nested models to evaluate if the additional variables in the full model provide a detectable increase in fit beyond that of the reduced model. Evidence for accepting the reduced model is to have the test not be significant, which means that the evaluated coefficients from the additional variables in the full model are not detected to be different from 0, and so perhaps need not be included in the model.

n$out_anova
##            df    Sum Sq   Mean Sq  F-value   p-value 
## Tested      2   2947.67   1473.84    14.19     0.000 
## Residual   96   9972.65    103.88

Reject the null hypothesis of the tested regression coefficients equal to 0 because of the small p-value of 0.000. Realize that if the reduced model was constructed solely from the regression output of the initial training data, and then analyzed from the same data, the analysis is post-hoc. The p-value in this situation is an interesting and perhaps useful heuristic, but cannot be literally interpreted.

Relations Among the Variables

Scatter Plot

How do the variables in the model relate to each other? The correlations of response variable Reading with predictor variables Verbal, Absent and Income should be high. The correlations of the predictor variables with each other should be relatively small.

Visually summarize all the relationships among the variables in the model with the scatterplot. The model has multiple predictor variables, so the different scatter plots are presented in a scatter plot matrix. Each scatter plot in the matrix also contains a non-linear best-fit curve. Express the linear numeric variable relationships among the variables in the model with their correlations. Plot the scatter plot separately with the lessR function regPlot. Specify option 1 to indicate this specific plot.

regPlot(r, 1)  # 1: scatter plot matrix

Collinearity Analysis

The collinearity analysis assesses the extent that the predictor variables – Verbal, Absent and Income – linearly depend upon each other, which in the simplest case is a high pairwise correlation. Although collinearity diminishes neither the fit of the model, nor forecasting efficacy, it typically indicates an overly complex model. Collinear variables have relatively large standard errors of their estimated slope coefficients, which yield unstable estimates. The result is that any unique effects of collinear variables cannot be statistically disentangled without a very large sample size.

To assess collinearity for predictor variable \(X_j\), regress that predictor onto all of the remaining predictor variables. A high resulting \(R^2_j\) indicates collinearity for that predictor. Usually express this result in terms of the tolerance of the predictor, \(1 - R^2_j\), the proportion of variance for \(X_j\) not due do the remaining predictor variables. Because each \(R^2_j\) should be low, presumably at least less than 0.8, tolerance should be high, at least larger than approximately 0.20. An equivalent statistic is the variance inflation factor (VIF), which indicates the extent collinearity inflates the variance of the estimated coefficient. VIF is the reciprocal of tolerance, so usually should be at least less than approximately 5.

r$out_collinear
##          Tolerance       VIF 
##   Verbal     0.600     1.667 
##   Absent     0.463     2.158 
##   Income     0.667     1.500

No collinearity exists according to the tolerance cutoff of 0.30. The predictor variable with the lowest tolerance is Absent at 0.463.

All Possible Subset Models

Especially when collinearity is present, can a simpler model be more, or at least almost, as effective as the current model?

To investigate, assess the fit for the models that correspond to all possible combinations of the predictor variables. Each row of the analysis defines a different model. A 1 means the predictor variable is in the model, a 0 means it is excluded from the model.

r$out_subsets
##  Verbal Absent Income    R2adj    X's 
##       1      1      0    0.431      2 
##       1      1      1    0.430      3 
##       0      1      0    0.411      1 
##       0      1      1    0.410      2 
##       1      0      1    0.334      2 
##       1      0      0    0.277      1 
##       0      0      1    0.177      1 
##  
## [based on Thomas Lumley's leaps function from the leaps package] 
## 

The goal of this analysis is parsimony, to obtain the most explanatory power, here assessed with \(R^2_{adj}\), with the least number of predictor variables, presumably guided also by the content and meaningfulness of the variables in the model.

Note that this analysis only describes the available data. This subset analysis is a descriptive heuristic that can effectively help eliminate unnecessary predictor variables from your model, but all resulting inferential statistics such as p-values are no longer valid. Ultimately a model revised from the training data requires cross-validation on a new data set.

Analysis of Residuals and Influence

Values of Reading fitted by the estimated model do not generally equal the corresponding data values. Which cases (rows of data) contribute the most to this lack of fit? The identification of cases that have a large residual and/or undue influence on the estimation of the model helps detect potential outliers. For each case, in addition to the data values, fitted value and corresponding residual, the analysis provides the following values .

  • residual: Value of the response variable Reading minus its fitted value, \(e = Y_{Reading} - \hat Y_{Reading}\)
  • rstudent: Externally Studentized residual, standardized value of the residual from a model estimated without the case present
  • dffits: Standardized difference between a fitted value with and without the case present
  • cooks: Cook’s Distance, the aggregate influence of the case on all the fitted values with each fitted value calculated with the case deleted
r$out_residuals
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [res.rows = 20, out of 100 rows of data, or do res.rows="all"] 
## ------------------------------------------------------------------- 
##      Verbal Absent Income Reading fitted  resid rstdnt dffits cooks 
##   47     49      0     82      98  77.17  20.83   2.19   0.72  0.12 
##   18     64      8     89      81  64.12  16.88   1.74   0.48  0.06 
##   63     63      2    127      57  77.44 -20.44  -2.09  -0.47  0.05 
##   93     70      0     17      66  79.27 -13.27  -1.38  -0.46  0.05 
##   95     77      5     51      49  71.62 -22.62  -2.30  -0.41  0.04 
##    3     54      9      4      42  57.19 -15.19  -1.55  -0.39  0.04 
##   31     56      6     39      88  64.90  23.10   2.35   0.37  0.03 
##    9     50     12      2      40  50.18 -10.18  -1.05  -0.35  0.03 
##   10     60     12     35      45  53.32  -8.32  -0.87  -0.32  0.03 
##   59     98      0    130     100  88.76  11.24   1.15   0.32  0.03 
##   15     86      4     59      90  75.77  14.23   1.44   0.31  0.02 
##   72     53      3     45      57  70.62 -13.62  -1.37  -0.30  0.02 
##   71     66      4     28      87  70.65  16.35   1.64   0.29  0.02 
##   21     49      8     15      71  58.58  12.42   1.25   0.28  0.02 
##   38     94      0     92     100  86.67  13.33   1.34   0.28  0.02 
##   19     73      1     22      88  78.01   9.99   1.02   0.28  0.02 
##   11     90      0     83     100  85.56  14.44   1.45   0.27  0.02 
##   42     92      0     71      73  85.56 -12.56  -1.26  -0.27  0.02 
##   65     72      2     99      57  78.34 -21.34  -2.15  -0.27  0.02 
##   14     63      9      0      50  58.89  -8.89  -0.91  -0.25  0.02

From this analysis the five largest Cook’s distances: 0.12, 0.06, 0.05, 0.05 and 0.04.

An informal criterion for evaluating the size of Cook’s distance is a cutoff value of 1 to indicate too large of a large size. No cases have a Cook’s distance larger than 1 in this analysis.

Prediction Intervals

Ultimately the analysis moves beyond the training sample. Prediction is from new data values of Verbal, Absent and Income, what may be called the prediction sample. Applying these new data values to the estimated model yields the predicted values. For data values from the training sample, the fitted value and predicted value are the same, calculated from the same estimated model, but are different concepts with different interpretations.

Unfortunately, prediction is not perfect. The range of values likely to contain the actual data value for Reading predicted from specific values of Verbal, Absent and Income quantifies the forecasting error. The standard deviation of the residuals, \(s_e\), assumed to be the same for all sets of values of the predictor variables, specifies the modeling error of the fitted values from the training data, error due to imperfections in the model. However, for predictions of future values of Reading, new data are collected. So sampling error of a value on the regression line, indicated with \(s_{\hat Y}\), must also be considered in the assessment of forecasting error. Consideration of both sources of error results in the standard error of forecast.

\[s_f = \sqrt{s^2_e + s^2_{\hat Y}}\]

Unlike modeling error, the amount of sampling error varies depending on the values of the predictor variables, so each row of data has its own value, \(s_f\). Each prediction interval is the margin of error, the t-cutoff multiplied by the corresponding \(s_f\), added and subtracted on either side of \(\hat Y_{Reading}\).

Prediction Intervals from the Training Data

The analysis provides each row of data values, as if they were new data, with a predicted value based on the model estimated from the training data, as well as the standard error of forecast. From these values obtain the lower and upper bounds of the corresponding 95% prediction interval. By default, only the first three, middle three and last three rows of data are presented, sufficient to indicate the ranges of prediction error encountered throughout the ranges of data values

r$out_predict
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals do pred.rows="all"] 
## ----------------------------------------------------------------- 
##      Verbal Absent Income Reading  pred    sf pi:lwr pi:upr width 
##    9     50     12      2      40 50.18 10.69  28.96  71.41 42.45 
##   10     60     12     35      45 53.32 10.79  31.90  74.75 42.85 
##   70     56     11     41      55 54.75 10.63  33.65  75.86 42.21 
## ... 
##   74     74      3     47      66 74.96 10.30  54.52  95.41 40.89 
##   75     68      3     81      65 74.88 10.25  54.54  95.22 40.68 
##   51     65      2     48      72 75.20 10.34  54.68  95.73 41.06 
## ... 
##   44     93      0    116      80 87.27 10.43  66.58 107.97 41.39 
##   94     96      0    141      83 88.72 10.58  67.72 109.72 42.01 
##   59     98      0    130     100 88.76 10.55  67.81 109.71 41.90

The size of the prediction intervals for the range of data found in the input data table vary from a minimum of 40.68 for Row 75 to a maximum of 42.85 for Row 10. Plot the scatter plot with prediction intervals separately with the lessR function regPlot. Specify option 1 to indicate this specific plot.

Prediction Intervals from New Data

New data values from which to obtain a forecast, different from the training data, can be entered with the options X1.new, X2.new, up to X6.new, where each option name refers to the position of the corresponding predictor variable in the specification of the regression model. Any number of values can be specified for each predictor variable. Suppose, for example, that there are two values of interest for each predictor variable from which to make a forecast, listed below.

Verbal: 55, 89
Absent: 0, 10
Income: 50, 97

Re-run the analysis to obtain the prediction intervals with these specified values.

r <- reg(Reading ~ Verbal + Absent + Income,
         X1.new=c(55,89), X2.new=c(0,10), X3.new=c(50,97),
         graphics = FALSE)

The new data values are specified for each variable separately, but a row of data consists of data values for all the predictor values. Accordingly, calculate a prediction interval for each combination of the specified new values for each predictor variable.

r$out_predict
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
## ---------------------------------------------------------------- 
##     Verbal Absent Income Reading  pred    sf pi:lwr pi:upr width 
##   3  55.00  10.00  50.00         56.89 10.54  35.97  77.81 41.84 
##   7  55.00  10.00  97.00         58.47 10.80  37.02  79.91 42.88 
##   4  89.00  10.00  50.00         63.82 11.18  41.63  86.01 44.37 
##   8  89.00  10.00  97.00         65.39 11.42  42.72  88.07 45.35 
##   1  55.00   0.00  50.00         77.32 10.65  56.18  98.45 42.27 
##   5  55.00   0.00  97.00         78.89 10.53  57.98  99.80 41.82 
##   2  89.00   0.00  50.00         84.25 10.47  63.47 105.03 41.56 
##   6  89.00   0.00  97.00         85.82 10.35  65.28 106.36 41.08

The size of the prediction intervals for the range of data found in the newly specified values vary from a minimum of 41.08 for Row 6 to a maximum of 45.35 for Row 8. The rows in the output display, however, are re-ordered according to the combinations of the ordered values of the predictor variables.

Model Validity

The residuals should be independent, normal random variables with a mean of zero and constant variance.

Distribution of Residuals

For the inferential analyses to be valid, the residuals should be normally distributed. Violation of normality does not bias the estimates, but it does render the inferential tests invalid. Plot the distribution of residuals separately with the lessR function regPlot. Specify option 2 to indicate this specific plot.

regPlot(r, 2)  # 2: distribution of residuals

Fitted Values vs Residuals

The residuals should represent random variation, free of any pattern or structure. They should satisfy the homoscedasticity assumption, randomly scattered about 0, with approximately the same level of variability across the range of the fitted values within a horizontal band around the zero-line. Otherwise they demonstrate heteroskedasticity. Plot the scatter plot of fitted values with residuals separately with the lessR function regPlot. Specify option 3 to indicate this specific plot.

regPlot(r, 3)  # 3: scatter plot of fitted with residuals