regression analysis
CROSS TABULATION:
This helps us in understanding how much of a specific variable has a relationship with the other.
For example
| Age | Education | Gender | District |
| 22 | Primary | Male | Tororo |
| 22 | Secondary | Male | Tororo |
| 24 | Tertiary | Female | Iganga |
| 31 | Primary | Male | Tororo |
| 26 | Primary | Female | Jinja |
| 26 | Secondary | Female | Tororo |
| 31 | University | Male | Iganga |
| 26 | Secondary | Male | Jinja |
| 22 | Primary | Male | Jinja |
regression analysis
STEPS
- Analyze
- Descriptive statistics
- Cross tabulations
- The select variables for now
- Select variables for columns
REGRESSION ANALYSIS
Interpretation
The correlation coefficient is -0.259, this implies that there a weak negative correlation between highest year of school completed and age of the respondent. The correlation significant at 1% level of significance since the P-value (0.000) < or thus the null hypothesis is rejected and conclusion made there is significant relationship between highest year of school complete and age of the respondent.
- Spearman- deals with ranked.
- Kendal’s- categorical variables of some order such as Education level.
Example
Using the
If the confidence internal does not include the hypothesized value the population parameter, the null hypothesis is rejected otherwise it is accepted.
Chi-square test
It is a test of dependence or association between two variables which must be categorical such as marital status, education level, and religion e.t.c.
Example
Does religion of the respondent depend on marital status.
Procedure
Analyze >> descriptive statistics >> cross tabs
Select one variable for arrow and another for a column.
Click statistics >>………..square >> continue
Cells >>> row and column percent- age >> continue
Press ok
NOTE:
First state the Hypothesis
Ho: religion of the respondent does not depend on marital status.
Ha: Religion of the respondent depends on marital.
SAMPLE T-TESTS
This is used for testing means.
SAMPLE TESTS IN SPSS
- One sample t-tests.
- Paired sample t-test.
- Independent sample t-test.
- ANOVA test.
Please always remember that;
- One sample t-test is used to compare the mean of one variable from a target value.
- Paired sample t-test is used to compared the mean of two variables for a single group.
- Independent sample t-test is used to compare means of two distinct groups of cases e.g alive or dead, on off, men or women e.t.c.
- ANOVA is used for testing several means.
One sample t-test
One sample t-test is performed when you want to determine if the mean value of a target variable is different from any pothesized value.
Examples
- A researcher might want to test whether the average age to respondent differs from 52.
- A researcher might want to test whether the average marked students differs from 75.
Assumptions for the one sample t-test
- The dependent variable is normally distributed with in the population.
- The data are independent.
SPSS
Example 1
A study on the physical strength measured in Kilograms on 7 subjects before and after a specified training period gave the following results.
| Subject | Before | After | diff |
| 1 | 100 | 115 | |
| 2 | 110 | 125 | |
| 3 | 90 | 105 | |
| 4 | 110 | 130 | |
| 5 | 125 | 140 | |
| 6 | 130 | 140 | |
| 7 | 105 | 125 |
Is there a difference in the physical strength before and after a specified training period?
- State the hypothesis.
- Use t-test to show that there is no mean difference in the physical strength before and after a specified training period.
Solution
First compute a new variable diff in the physical strength before and after a specified training period?
- State the hypothesis.
- Use t-test to show that there is no mean difference in the physical strength before and after a specified training period.
Solution
First compute a new variable diff-the difference between the value and the before value.
STEPS
- Transform- compute.
- For target variable type diff for numeric, expression type and after-before.
- Click ok.
- Analyze-compare means- one sample test.
- Select diff as the test variables and test value to be 0.
- Click on option and put 95%.
- Under missing value select “exclude cases analysis by analysis”
- Continue
Interpretation of the results
Ho: there is no significant mean difference in physical strength before and after a specified training period.
Ha: there is a significant mean difference in physical strength before and after a specified training period.
Since the P-value (0.000)<0.05, the null hypothesis is rejected implying that there is a significant mean difference in physical strength before and after a specified training period.
THE PAIRED T-TEST
- The paired sample t-test produce compares the means of two variables for a single group.
- It computes the difference between values of the two variables for each case and tests whether the average differs from 0.
- It is usually used in the mate….. pairs or case- control study.
STEPS
- Analyze – compare means- sample t-test.
- Select a pair of variables, as follows.
- Click each of two variables.
The first variables appear.
In the current selection group as variable 1 and the second appears as variable 2.
- After you have selected a pair of variables.
- Click the arrow button to move the pair into the paired variables lists click ok.
- You may select move pairs of variables.
- To move a pair of variables from the analysis.
- Select a pair in the paired variable list and click the arrow button.
- Click options to control treatment of missing data and the level of the confidence interval.
DATA ANALYSIS
DATA ANALYSIS
Independent sample t-test
This is used for testing means of a variable which has two disadvantaged groups of cases e.g is there a significant difference in income between the male and female respondent?
Assumptions for the independent sample t-test
- The variables of the dependent variable in the two populations are equal.
- The dependent variable is normally distributed with in the population.
- The data are independent (scores of one participant are not dependent on scores of others).
The independent sample t-test procedure
- It compares means for groups of cases.
- The subjects should be randomly assigned to two groups so that any difference in the response is due to the treatment or lack of treatment but not to other factors.
- Always ensure that the difference in other factors are not making or enhancing a significance difference in mean.
Example
- The researcher is interested to see if in the population men and women have the same scores in a test.
- If there is a difference in the highest year of school completed between the males and females.
STEPS
- Analyze- compare means- independent- sample t-test.
- Select one or more quantitative test variables.
- Select a single grouping variable.
- Click defines groups to specify two codes for the groups you want to compare.
- Click options to control the treatment of missing data the level of the confidence.
Example
Page 50
Procedure
- Analyze- compare means- independent- sample t-test.
- Select highest year of school completed for test variable.
- Select sex for grouping variable.
- Click on define groups- use specified values, put 1 for group 1 and 2 for group 2. This is because 1 stands for male 2 stands for female.
The results show sets of test statistics
- Equal variance assumed.
- Equal variance not assumed.
- If the F-statistics is significant (null is rejected) we used of equal variance not assumed for interpreting the t-test.
- If the F-statistics is not significant (null is accept) we use the row oof equal variance assumed for interpreting the t-test.
NOTE
Lerene’s test helps to determine which row to use make a decision of accepting or rejecting the null hypothesis.
Results are interpreted as below
Ho: there is no significant mean difference in the highest year of school completed between the male and female respondents.
Ha: there is a significant mean difference in the highest year of school completed between the male and female respondents.
Page 51
Put a diagram down.
Page 52
Analysis of variables (ANOVA)
One way Analysis of variance
- The one way ANOVA procedure produces a one way analysis of variance for a quantitative dependent variable by a single factor (independent) variable.
- Analysis of variance is used to test the hypothesis that several means are equal.
- This technique is an extension of the two sample t-test.
- In addition to determining that differences exist among the means, you may want to know which means differ.
- Here we can use post hoc test which are run after the experiment has been conducted.
REGRESSSION ANALYSIS
LINEAR REGRESSION ANALYSIS USING SPSS STATISTICS
Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable’s value is called the independent variable (or sometimes, the predictor variable). For example, you could use linear regression to understand whether exam performance can be predicted based on revision time; whether cigarette consumption can be predicted based on smoking duration; and so forth. If you have two or more independent variables, rather than just one, you need to use multiple regression analysis.
This “quick start” guide shows you how to carry out linear regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for linear regression to give you a valid result. We discuss these assumptions next.
SPSS Statistics
Assumptions
When you choose to analyse your data using linear regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using linear regression. You need to do this because it is only appropriate to use linear regression if your data “passes” seven assumptions that are required for linear regression to give you a valid result. In practice, checking for these seven assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these seven assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out linear regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let’s take a look at these seven assumptions:
Assumption one: Your dependent variable should be measured at the continuous level (i.e., it is either an interval or ratio variable). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth.
Assumption two: Your independent variable should also be measured at the continuous level (i.e., it is either an interval or ratio variable). See the bullet above for examples of continuous variables.
Assumption three: There needs to be a linear relationship between the two variables. Whilst there are a number of ways to check whether a linear relationship exists between your two variables, we suggest creating a scatterplot using SPSS Statistics where you can plot the dependent variable against your independent variable and then visually inspect the scatterplot to check for linearity. Your scatterplot may look something like one of the following:
If the relationship displayed in your scatterplot is not linear, you will have to either run a non-linear regression analysis, perform a polynomial regression or “transform” your data, which you can do using SPSS Statistics. In our enhanced guides, we show you how to:
- a) Create a scatterplot to check for linearity when carrying out linear regression using SPSS Statistics;
(b) Interpret different scatterplot results; and
(c) Transform your data using SPSS Statistics if there is not a linear relationship between your two variables.
Assumption Four: There should be no significant outliers. An outlier is an observed data point that has a dependent variable value that is very different to the value predicted by the regression equation. As such, an outlier will be a point on a scatterplot that is (vertically) far away from the regression line indicating that it has a large residual, as highlighted below:
The problem with outliers is that they can have a negative effect on the regression analysis (e.g., reduce the fit of the regression equation) that is used to predict the value of the dependent (outcome) variable based on the independent (predictor) variable. This will change the output that SPSS Statistics produces and reduce the predictive accuracy of your results. Fortunately, when using SPSS Statistics to run a linear regression on your data, you can easily include criteria to help you detect possible outliers. In our enhanced linear regression guide, we: (a) show you how to detect outliers using “casewise diagnostics”, which is a simple process when using SPSS Statistics; and (b) discuss some of the options you have in order to deal with outliers.
- Assumption FIVE:You should have independence of observations, which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic in our enhanced linear regression guide.
- Assumption SIX:Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line. Whilst we explain more about what this means and how to assess the homoscedasticity of your data in our enhanced linear regression guide, take a look at the three scatterplots below, which provide three simple examples: two of data that fail the assumption (called heteroscedasticity) and one of data that meets this assumption (called homoscedasticity):
Whilst these helps to illustrate the differences in data that meets or violates the assumption of homoscedasticity, real-world data can be a lot more messy and illustrate different patterns of heteroscedasticity. Therefore, in our enhanced linear regression guide, we explain: (a) some of the things you will need to consider when interpreting your data; and (b) possible ways to continue with your analysis if your data fails to meet this assumption.
- Assumption SEVEN:Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed (we explain these terms in our enhanced linear regression guide). Two common methods to check this assumption include using either a histogram (with a superimposed normal curve) or a Normal P-P Plot. Again, in our enhanced linear regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) or Normal P-P Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.
REGRESSION ANALYSIS
Example
A salesperson for a large car brand wants to determine whether there is a relationship between an individual’s income and the price they pay for a car. As such, the individual’s “income” is the independent variable and the “price” they pay for a car is the dependent variable. The salesperson wants to use this information to determine which cars to offer potential customers in new areas where average income is known.
SPSS Statistics
Setup in SPSS Statistics
In SPSS Statistics, we created two variables so that we could enter our data: Income (the independent variable), and Price (the dependent variable). It can also be useful to create a third variable, caseno, to act as a chronological case number. This third variable is used to make it easy for you to eliminate cases (e.g., significant outliers) that you have identified when checking for assumptions. However, we do not include it in the SPSS Statistics procedure that follows because we assume that you have already checked these assumptions. In our enhanced linear regression guide, we show you how to correctly enter data in SPSS Statistics to run a linear regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, “quick start” guide: Entering Data in SPSS Statistics.
SPSS Statistics
Test Procedure in SPSS Statistics
The five steps below show you how to analyse your data using linear regression in SPSS Statistics when none of the seven assumptions in the previous section, Assumptions, have been violated. At the end of these four steps, we show you how to interpret the results from your linear regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6 and #7, which are required when using linear regression and can be tested using SPSS Statistics, you can learn more about our enhanced guides on our Features: Overview page.
Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28, as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version, SPSS Statistics introduced a new look to their interface called “SPSS Light”, replacing the previous look for versions 26 and earlier versions, which was called “SPSS Standard”. Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical.
- Click Analyze > Regression > L..on the top menu, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
You will be presented with the Linear Regression dialogue box:
Published with written permission from SPSS Statistics, IBM Corporation.
- Transfer the independent variable, Income, into the Independent(s):box and the dependent variable, Price, into the Dependent: You can do this by either drag-and-dropping the variables or by using the appropriate buttons. You will end up with the following screen:
Published with written permission from SPSS Statistics, IBM Corporation.
- You now need to check four of the assumptions discussed in the Assumptionssection above: no significant outliers (assumption #3); independence of observations (assumption #4); homoscedasticity (assumption #5); and normal distribution of errors/residuals (assumptions #6). You can do this by using the and features, and then selecting the appropriate options within these two dialogue boxes. In our enhanced linear regression guide, we show you which options to select in order to test whether your data meets these four assumptions.
- Click on the This will generate the results
REGRESSION ANALYSIS
THE OUT PUT OF SPSS ON REGRESSION ANALYSIS
SPSS Statistics will generate quite a few tables of output for a linear regression. In this section, we show you only the three main tables required to understand your results from the linear regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the six assumptions required to carry out linear regression is provided in our enhanced guide. This includes relevant scatterplots, histogram (with superimposed normal curve), Normal P-P Plot, casewise diagnostics and the Durbin-Watson statistic. Below, we focus on the results for the linear regression analysis only.
The first table of interest is the Model Summary table, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
This table provides the R and R2 values. The R value represents the simple correlation and is 0.873 (the “R” Column), which indicates a high degree of correlation. The R2 value (the “R Square” column) indicates how much of the total variation in the dependent variable, Price, can be explained by the independent variable, Income. In this case, 76.2% can be explained, which is very large.
The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts the dependent variable) and is shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
This table indicates that the regression model predicts the dependent variable significantly well. How do we know this? Look at the “Regression” row and go to the “Sig.” column. This indicates the statistical significance of the regression model that was run. Here, p < 0.0005, which is less than 0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).
The Coefficients table provides us with the necessary information to predict price from income, as well as determine whether income contributes statistically significantly to the model (by looking at the “Sig.” column). Furthermore, we can use the values in the “B” column under the “Unstandardized Coefficients” column, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
to present the regression equation as:
Price = 8287 + 0.564(Income)
CORRELATIONS
CORRELATION ANALYSIS
Pearson Correlation
The bivariate Pearson Correlation produces a sample correlation coefficient, r, which measures the strength and direction of linear relationships between pairs of continuous variables. By extension, the Pearson Correlation evaluates whether there is statistical evidence for a linear relationship among the same pairs of variables in the population, represented by a population correlation coefficient, ρ (“rho”). The Pearson Correlation is a parametric measure.
This measure is also known as:
- Pearson’s correlation
- Pearson product-moment correlation (PPMC)
Common Uses
The bivariate Pearson Correlation is commonly used to measure the following:
- Correlations among pairs of variables
- Correlations within and between sets of variables
The bivariate Pearson correlation indicates the following:
- Whether a statistically significant linear relationship exists between two continuous variables
- The strength of a linear relationship (i.e., how close the relationship is to being a perfectly straight line)
- The direction of a linear relationship (increasing or decreasing)
Note: The bivariate Pearson Correlation cannot address non-linear relationships or relationships among categorical variables. If you wish to understand relationships that involve categorical variables and/or non-linear relationships, you will need to choose another measure of association.
Note: The bivariate Pearson Correlation only reveals associations among continuous variables. The bivariate Pearson Correlation does not provide any inferences about causation, no matter how large the correlation coefficient is.
Data Requirements
To use Pearson correlation, your data must meet the following requirements:
- Two or more continuous variables (i.e., interval or ratio level)
- Cases must have non-missing values on both variables
- Linear relationship between the variables
- Independent cases (i.e., independence of observations)
- There is no relationship between the values of variables between cases. This means that:
- the values for all variables across cases are unrelated
- for any case, the value for any variable cannot influence the value of any variable for other cases
- no case can influence another case on any variable
- The biviariate Pearson correlation coefficient and corresponding significance test are not robust when independence is violated.
- There is no relationship between the values of variables between cases. This means that:
- Bivariate normality
- Each pair of variables is bivariately normally distributed
- Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
- This assumption ensures that the variables are linearly related; violations of this assumption may indicate that non-linear relationships among variables exist. Linearity can be assessed visually using a scatterplot of the data.
- Random sample of data from the population
- No outliers
hypothesis
Hypotheses
The null hypothesis (H0) and alternative hypothesis (H1) of the significance test for correlation can be expressed in the following ways, depending on whether a one-tailed or two-tailed test is requested:
Two-tailed significance test:
H0: ρ = 0 (“the population correlation coefficient is 0; there is no association”)
H1: ρ ≠ 0 (“the population correlation coefficient is not 0; a nonzero correlation could exist”)
One-tailed significance test:
H0: ρ = 0 (“the population correlation coefficient is 0; there is no association”)
H1: ρ > 0 (“the population correlation coefficient is greater than 0; a positive correlation could exist”)
OR
H1: ρ < 0 (“the population correlation coefficient is less than 0; a negative correlation could exist”)
where ρ is the population correlation coefficient.
The sample correlation coefficient between two variables x and y is denoted r or rxy, and can be computed as:
Run a Bivariate Pearson Correlation
To run a bivariate Pearson Correlation in SPSS, click Analyze > Correlate > Bivariate.
The Bivariate Correlations window opens, where you will specify the variables to be used in the analysis. All of the variables in your dataset appear in the list on the left side. To select variables for the analysis, select the variables in the list on the left and click the blue arrow button to move them to the right, in the Variables field.
AVariables: The variables to be used in the bivariate Pearson Correlation. You must select at least two continuous variables, but may select more than two. The test will produce correlation coefficients for each pair of variables in this list.
BCorrelation Coefficients: There are multiple types of correlation coefficients. By default, Pearson is selected. Selecting Pearson will produce the test statistics for a bivariate Pearson Correlation.
CTest of Significance: Click Two-tailed or One-tailed, depending on your desired significance test. SPSS uses a two-tailed test by default.
DFlag significant correlations: Checking this option will include asterisks (**) next to statistically significant correlations in the output. By default, SPSS marks statistical significance at the alpha = 0.05 and alpha = 0.01 levels, but not at the alpha = 0.001 level (which is treated as alpha = 0.01)
EOptions: Clicking Options will open a window where you can specify which Statistics to include (i.e., Means and standard deviations, Cross-product deviations and covariances) and how to address Missing Values (i.e., Exclude cases pairwise or Exclude cases listwise). Note that the pairwise/listwise setting does not affect your computations if you are only entering two variable, but can make a very large difference if you are entering three or more variables into the correlation procedure.
Example: Understanding the linear association between weight and height
PROBLEM STATEMENT
Perhaps you would like to test whether there is a statistically significant linear relationship between two continuous variables, weight and height (and by extension, infer whether the association is significant in the population). You can use a bivariate Pearson Correlation to test whether there is a statistically significant linear relationship between height and weight, and to determine the strength and direction of the association.
BEFORE THE TEST
In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height” is a continuous measure of height in inches and exhibits a range of values from 55.00 to 84.41 (Analyze > Descriptive Statistics > Descriptives). The variable “Weight” is a continuous measure of weight in pounds and exhibits a range of values from 101.71 to 350.07.
Before we look at the Pearson correlations, we should look at the scatterplots of our variables to get an idea of what to expect. In particular, we need to determine if it’s reasonable to assume that our variables have linear relationships. Click Graphs > Legacy Dialogs > Scatter/Dot. In the Scatter/Dot window, click Simple Scatter, then click Define. Move variable Height to the X Axis box, and move variable Weight to the Y Axis box. When finished, click OK.