HOW TO INTERPRETE REGRESSION RESULTS
THE OUT PUT OF SPSS ON REGRESSION ANALYSIS
Adjusted R-squared
Adjusted R-squared is a statistical measure that is closely related to the more commonly known R-squared (R²) value in the context of linear regression analysis. While R-squared measures the proportion of the variance in the dependent variable (the variable being predicted) that is explained by the independent variables (the predictors) in a regression model, Adjusted R-squared takes into account the number of independent variables used in the model, providing a more conservative and useful assessment of model fit.
R-squared (R²): R-squared is a measure of how well the independent variables in a regression model explain the variability in the dependent variable. It ranges from 0 to 1, with higher values indicating a better fit. Specifically, R-squared represents the proportion of the total variation in the dependent variable that is explained by the model. However, as you add more independent variables to the model, R-squared tends to increase, even if the additional variables do not significantly improve the model’s predictive power. This can lead to overfitting, where the model fits the training data extremely well but may not generalize well to new data. Adjusted R-squared: Adjusted R-squared addresses the issue of overfitting by penalizing the inclusion of unnecessary independent variables in the model. It takes into account the number of predictors in the model and adjusts R-squared accordingly.
The formula for Adjusted R-squared is; Adjusted R² = 1 – [(1 – R²) * (n – 1) / (n – k – 1)]
R² is the regular R-squared.
n is the number of data points (observations).
k is the number of independent variables in the model.
Adjusted R-squared will always be lower than R-squared when you have multiple independent variables, and it tends to decrease as you add irrelevant or redundant variables to the model. It provides a more realistic assessment of the model’s fit by accounting for model complexity.
In summary, while R-squared tells you how well your regression model fits the data, Adjusted R-squared helps you determine whether the improvement in model fit achieved by adding more independent variables is justified by the increased complexity. It is a useful tool for model selection and comparison, as it encourages the use of simpler models that explain the data adequately without unnecessary complexity.
SPSS Statistics will generate quite a few tables of output for a linear regression. In this section, we show you only the three main tables required to understand your results from the linear regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the six assumptions required to carry out linear regression is provided in our enhanced guide. This includes relevant scatterplots, histogram (with superimposed normal curve), Normal P-P Plot, casewise diagnostics and the Durbin-Watson statistic. Below, we focus on the results for the linear regression analysis only.
The first table of interest is the Model Summary table, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
This table provides the R and R2 values. The R value represents the simple correlation and is 0.873 (the “R” Column), which indicates a high degree of correlation. The R2 value (the “R Square” column) indicates how much of the total variation in the dependent variable, Price, can be explained by the independent variable, Income. In this case, 76.2% can be explained, which is very large.
The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts the dependent variable) and is shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
This table indicates that the regression model predicts the dependent variable significantly well. How do we know this? Look at the “Regression” row and go to the “Sig.” column. This indicates the statistical significance of the regression model that was run. Here, p < 0.0005, which is less than 0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).
The Coefficients table provides us with the necessary information to predict price from income, as well as determine whether income contributes statistically significantly to the model (by looking at the “Sig.” column). Furthermore, we can use the values in the “B” column under the “Unstandardized Coefficients” column, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
to present the regression equation as:
Price = 8287 + 0.564(Income)
CORRELATION ANALYSIS
Pearson Correlation
The bivariate Pearson Correlation produces a sample correlation coefficient, r, which measures the strength and direction of linear relationships between pairs of continuous variables. By extension, the Pearson Correlation evaluates whether there is statistical evidence for a linear relationship among the same pairs of variables in the population, represented by a population correlation coefficient, ρ (“rho”). The Pearson Correlation is a parametric measure.
This measure is also known as:
- Pearson’s correlation
- Pearson product-moment correlation (PPMC)
Common Uses
The bivariate Pearson Correlation is commonly used to measure the following:
- Correlations among pairs of variables
- Correlations within and between sets of variables
The bivariate Pearson correlation indicates the following:
- Whether a statistically significant linear relationship exists between two continuous variables
- The strength of a linear relationship (i.e., how close the relationship is to being a perfectly straight line)
- The direction of a linear relationship (increasing or decreasing)
Note: The bivariate Pearson Correlation cannot address non-linear relationships or relationships among categorical variables. If you wish to understand relationships that involve categorical variables and/or non-linear relationships, you will need to choose another measure of association.
Note: The bivariate Pearson Correlation only reveals associations among continuous variables. The bivariate Pearson Correlation does not provide any inferences about causation, no matter how large the correlation coefficient is.
Data Requirements
To use Pearson correlation, your data must meet the following requirements:
- Two or more continuous variables (i.e., interval or ratio level)
- Cases must have non-missing values on both variables
- Linear relationship between the variables
- Independent cases (i.e., independence of observations)
- There is no relationship between the values of variables between cases. This means that:
- the values for all variables across cases are unrelated
- for any case, the value for any variable cannot influence the value of any variable for other cases
- no case can influence another case on any variable
- The biviariate Pearson correlation coefficient and corresponding significance test are not robust when independence is violated.
- There is no relationship between the values of variables between cases. This means that:
- Bivariate normality
- Each pair of variables is bivariately normally distributed
- Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
- This assumption ensures that the variables are linearly related; violations of this assumption may indicate that non-linear relationships among variables exist. Linearity can be assessed visually using a scatterplot of the data.
- Random sample of data from the population
- No outliers
Hypotheses
The null hypothesis (H0) and alternative hypothesis (H1) of the significance test for correlation can be expressed in the following ways, depending on whether a one-tailed or two-tailed test is requested:
Two-tailed significance test:
H0: ρ = 0 (“the population correlation coefficient is 0; there is no association”)
H1: ρ ≠ 0 (“the population correlation coefficient is not 0; a nonzero correlation could exist”)
One-tailed significance test:
H0: ρ = 0 (“the population correlation coefficient is 0; there is no association”)
H1: ρ > 0 (“the population correlation coefficient is greater than 0; a positive correlation could exist”)
OR
H1: ρ < 0 (“the population correlation coefficient is less than 0; a negative correlation could exist”)
where ρ is the population correlation coefficient.
The sample correlation coefficient between two variables x and y is denoted r or rxy, and can be computed as:
Run a Bivariate Pearson Correlation
To run a bivariate Pearson Correlation in SPSS, click Analyze > Correlate > Bivariate.
The Bivariate Correlations window opens, where you will specify the variables to be used in the analysis. All of the variables in your dataset appear in the list on the left side. To select variables for the analysis, select the variables in the list on the left and click the blue arrow button to move them to the right, in the Variables field.
AVariables: The variables to be used in the bivariate Pearson Correlation. You must select at least two continuous variables, but may select more than two. The test will produce correlation coefficients for each pair of variables in this list.
BCorrelation Coefficients: There are multiple types of correlation coefficients. By default, Pearson is selected. Selecting Pearson will produce the test statistics for a bivariate Pearson Correlation.
CTest of Significance: Click Two-tailed or One-tailed, depending on your desired significance test. SPSS uses a two-tailed test by default.
DFlag significant correlations: Checking this option will include asterisks (**) next to statistically significant correlations in the output. By default, SPSS marks statistical significance at the alpha = 0.05 and alpha = 0.01 levels, but not at the alpha = 0.001 level (which is treated as alpha = 0.01)
EOptions: Clicking Options will open a window where you can specify which Statistics to include (i.e., Means and standard deviations, Cross-product deviations and covariances) and how to address Missing Values (i.e., Exclude cases pairwise or Exclude cases listwise). Note that the pairwise/listwise setting does not affect your computations if you are only entering two variable, but can make a very large difference if you are entering three or more variables into the correlation procedure.
Example: Understanding the linear association between weight and height
PROBLEM STATEMENT
Perhaps you would like to test whether there is a statistically significant linear relationship between two continuous variables, weight and height (and by extension, infer whether the association is significant in the population). You can use a bivariate Pearson Correlation to test whether there is a statistically significant linear relationship between height and weight, and to determine the strength and direction of the association.
BEFORE THE TEST
In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height” is a continuous measure of height in inches and exhibits a range of values from 55.00 to 84.41 (Analyze > Descriptive Statistics > Descriptives). The variable “Weight” is a continuous measure of weight in pounds and exhibits a range of values from 101.71 to 350.07.
Before we look at the Pearson correlations, we should look at the scatterplots of our variables to get an idea of what to expect. In particular, we need to determine if it’s reasonable to assume that our variables have linear relationships. Click Graphs > Legacy Dialogs > Scatter/Dot. In the Scatter/Dot window, click Simple Scatter, then click Define. Move variable Height to the X Axis box, and move variable Weight to the Y Axis box. When finished, click OK.
To add a linear fit like the one depicted, double-click on the plot in the Output Viewer to open the Chart Editor. Click Elements > Fit Line at Total. In the Properties window, make sure the Fit Method is set to Linear, then click Apply. (Notice that adding the linear regression trend line will also add the R-squared value in the margin of the plot. If we take the square root of this number, it should match the value of the Pearson correlation we obtain.)
From the scatterplot, we can see that as height increases, weight also tends to increase. There does appear to be some linear relationship.
RUNNING THE TEST
To run the bivariate Pearson Correlation, click Analyze > Correlate > Bivariate. Select the variables Height and Weight and move them to the Variables box. In the Correlation Coefficients area, select Pearson. In the Test of Significance area, select your desired significance test, two-tailed or one-tailed. We will select a two-tailed significance test in this example. Check the box next to Flag significant correlations.
Click OK to run the bivariate Pearson Correlation. Output for the analysis will display in the Output Viewer.
Syntax
CORRELATIONS /VARIABLES=Weight Height /PRINT=TWOTAIL NOSIG /MISSING=PAIRWISE.
OUTPUT
Tables
The results will display the correlations in a table, labeled Correlations.
A Correlation of Height with itself (r=1), and the number of non-missing observations for height (n=408).
B Correlation of height and weight (r=0.513), based on n=354 observations with pairwise nonmissing values.
C Correlation of height and weight (r=0.513), based on n=354 observations with pairwise nonmissing values.
D Correlation of weight with itself (r=1), and the number of nonmissing observations for weight (n=376).
The important cells we want to look at are either B or C. (Cells B and C are identical, because they include information about the same pair of variables.) Cells B and C contain the correlation coefficient for the correlation between height and weight, its p-value, and the number of complete pairwise observations that the calculation was based on.
The correlations in the main diagonal (cells A and D) are all equal to 1. This is because a variable is always perfectly correlated with itself. Notice, however, that the sample sizes are different in cell A (n=408) versus cell D (n=376). This is because of missing data — there are more missing observations for variable Weight than there are for variable Height.
If you have opted to flag significant correlations, SPSS will mark a 0.05 significance level with one asterisk (*) and a 0.01 significance level with two asterisks (0.01). In cell B (repeated in cell C), we can see that the Pearson correlation coefficient for height and weight is .513, which is significant (p < .001 for a two-tailed test), based on 354 complete observations (i.e., cases with nonmissing values for both height and weight).
DECISION AND CONCLUSIONS
Based on the results, we can state the following:
- Weight and height have a statistically significant linear relationship (r=.513, p < .001).
- The direction of the relationship is positive (i.e., height and weight are positively correlated), meaning that these variables tend to increase together (i.e., greater height is associated with greater weight).
- The magnitude, or strength, of the association is approximately moderate (.3 < | r | < .5).
HOW TO CARRY OUT DESCRIPTIVE STATISTICS IN SPSS
Variable View:
Data View:
The data can be summarized via the ‘Descriptive Statistics’ part of the ‘Analyse’ menu. Let’s explore;
ANALYZE
DESCRIPTIVE STATISTICS
FREQUENCIES.
This will open up a new dialogue box:
We can use this to make frequency tables of the variables. Statistics can be calculated and graphs can be drawn. Let’s make frequency tables for the categorical data: Passenger Class, Passenger Gender and Passenger Survived.
Ensure that ‘Display frequency tables’ is checked. Click on one of the three categories required in the left most box and then press the arrow button in the middle to move to the right hand box. Do the same for the other 2 categories.
Click on the ‘Charts’ button and choose the option to draw a bar chart (or a pie chart if you prefer). Press ‘Continue’ the ‘OK’. Output will now be written to the output window.
The frequency tables summarising the data are:
You should also find the charts that were selected.
The descriptive statistics feature of SPSS can also give summary statistics such as the mean, median and standard deviation. We have some scale data in the form of the passenger’s age. Go back to ;
Analyse Descriptive Stats Frequencies and return the previously moved categories back to the left box. Move over the ‘Passenger’s Age’ variable to the right box.
Choose the measures that you would like to get by clicking the check boxes.
Click continue. The frequency table is not needed for these data so uncheck and uncheck the options inside the chart menu.
When ready click ‘OK’
Above the chosen options were Mean, Median, Mode, Std. Deviation, Minimum and Maximum.
Continuous variables can also be analysed using the ‘Descriptives’ menu in SPSS. Go to Analyse -> Descriptive Statistics -> Descriptives.
First move the variable ‘Passenger Age’ to the ‘Variable(s)’ section.
Click on the ‘Options’ button. We can choose the statistics we want computed. Select those shown in the picture below:
The mean is a measure of average (sum of the values divided by the number of values). Standard deviation measures the spread of the data and can be used to describe normal distributions. Skewness is a measure of how symmetrical the distribution is. Values of skewness close to 0 represent symmetry, positive values mean that there are some high valued outliers and a negative value means some low valued outliers.
(image from www.managedfuturesinvesting.com)
Kurtosis values refer to how peaked the distribution is. A normal distribution would have a value of 0. Negative values mean that the distribution is flat, i,e, many cases in the extremes) and positive values meaning the distribution is clustered in the centre.
Choose
‘Continue’ and ‘OK’
The descriptives provide the requested information about the passenger age on the titanic.