Processing math: 100%

Chapter 8 - Linear Regression

Learning Outcomes

  • Define the explanatory variable as the independent variable (predictor), and the response variable as the dependent variable (predicted).
  • Plot the explanatory variable (x) on the x-axis and the response variable (y) on the y-axis, and fit a linear regression model y=β0+β1x where β0 is the intercept, and β1 is the slope.
    • Note that the point estimates (estimated from observed data) for β0 and β1 are b0 and b1, respectively.
  • When describing the association between two numerical variables, evaluate
    • direction: positive (x,y), negative (x,y)
    • form: linear or not
    • strength: determined by the scatter around the underlying relationship
  • Define correlation as the association between two numerical variables.
    • Note that a relationship that is nonlinear is simply called an association.
  • Note that correlation coefficient (r, also called Pearson’s r) the following properties:
    • the magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables
    • the sign of the correlation coefficient indicates the direction of association
    • the correlation coefficient is always between -1 and 1, inclusive, with -1 indicating perfect negative linear association, +1 indicating perfect positive linear association, and 0 indicating no relationship
    • the correlation coefficient is unitless
    • since the correlation coefficient is unitless, it is not affected by changes in the center or scale of either variable (such as unit conversions)
    • the correlation of X with Y is the same as of Y with X
    • the correlation coefficient is sensitive to outliers
  • Recall that correlation does not imply causation.
  • Define residual (e) as the difference between the observed (y) and predicted (ˆy) values of the response variable. ei=yiˆyi
  • Define the least squares line as the line that minimizes the sum of the squared residuals, and list conditions necessary for fitting such line:
    1. linearity
    2. nearly normal residuals
    3. constant variability
  • Define an indicator variable as a binary explanatory variable (with two levels).
  • Calculate the estimate for the slope (b1) as b1=Rsysx, where r is the correlation coefficient, sy is the standard deviation of the response variable, and sx is the standard deviation of the explanatory variable.
  • Interpret the slope as
    • “For each unit increase in x, we would expect y to increase/decrease on average by |b1| units” when x is numerical.
    • “The average increase/decrease in the response variable when between the baseline level and the other level of the explanatory variable is |b1|.” when x is categorical.
    • Note that whether the response variable increases or decreases is determined by the sign of b1.
  • Note that the least squares line always passes through the average of the response and explanatory variables (ˉx,ˉy).
  • Use the above property to calculate the estimate for the slope (b0) as b0=ˉyb1ˉx, where b1 is the slope, ˉy is the average of the response variable, and ˉx is the average of explanatory variable.
  • Interpret the intercept as
    • “When x=0, we would expect y to equal, on average, b0.” when x is numerical.
    • “The expected average value of the response variable for the reference level of the explanatory variable is b0.” when x is categorical.
  • Predict the value of the response variable for a given value of the explanatory variable, x, by plugging in x in the in the linear model: ˆy=b0+b1x
    • Only predict for values of x that are in the range of the observed data.
    • Do not extrapolate beyond the range of the data, unless you are confident that the linear pattern continues.
  • Define R2 as the percentage of the variability in the response variable explained by the the explanatory variable.
    • For a good model, we would like this number to be as close to 100% as possible.
    • This value is calculated as the square of the correlation coefficient, and is between 0 and 1, inclusive.
  • Define a leverage point as a point that lies away from the center of the data in the horizontal direction.
  • Define an influential point as a point that influences (changes) the slope of the regression line.
    • This is usually a leverage point that is away from the trajectory of the rest of the data.
  • Do not remove outliers from an analysis without good reason.
  • Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points.
  • Determine whether an explanatory variable is a significant predictor for the response variable using the t-test and the associated p-value in the regression output.
  • Set the null hypothesis testing for the significance of the predictor as H0:β1=0, and recognize that the standard software output yields the p-value for the two-sided alternative hypothesis.
    • Note that β1=0 means the regression line is horizontal, hence suggesting that there is no relationship between the explanatory and the response variables.
  • Calculate the T score for the hypothesis test as Tdf=b1nullvalueSEb1 with df=n2.
    • Note that the T score has n2 degrees of freedom since we lose one degree of freedom for each parameter we estimate, and in this case we estimate the intercept and the slope.
  • Note that a hypothesis test for the intercept is often irrelevant since it’s usually out of the range of the data, and hence it is usually an extrapolation.
  • Calculate a confidence interval for the slope as b1±tdfSEb1 where df=n2 and tdf is the critical score associated with the given confidence level at the desired degrees of freedom.
    • Note that the standard error of the slope estimate SEb1 can be found on the regression output.

Supplemental Readings

Videos