Data analysis
CROSS TABULATION:
This helps us in understanding how much of a specific variable has a relationship with the other.
For example
Age | Education | Gender | District |
22 | Primary | Male | Tororo |
22 | Secondary | Male | Tororo |
24 | Tertiary | Female | Iganga |
31 | Primary | Male | Tororo |
26 | Primary | Female | Jinja |
26 | Secondary | Female | Tororo |
31 | University | Male | Iganga |
26 | Secondary | Male | Jinja |
22 | Primary | Male | Jinja |
STEPS
- Analyze
- Descriptive statistics
- Cross tabulations
- The select variables for now
- Select variables for columns
DATA ANALYSIS
This involves 5 major steps
- Enter your Data in the Data Editor.
- Select the procedure from the menu.
- Select variables for the Analysis.
- Examine the results in the output window.
- Interpret the Results in the word document.
NOTE
Before any data analysis is done, first identify whether the variable(s) is/are Quantitative or categorical.
It includes Univariate analysis, Bivariate analysis and Multi- variate analysis.
For Univariate analysis, a single variable is analyzed at a time e.g what is the Average age of students?
Bivariate- two variables
Does the income of the respondent depend on age?
Multivariate- more than two variables
Does income of the respond depend on age, Education level and sex of the respond.
UNIVARIATE ANALYSIS
Descriptive statistics
These are computed only for Quantitative/ continuous variables such as age, height and weight.
Procedure
- Analyze _ descriptive statistics _ descriptive.
- Select variables from the LH box into RH box.
- The user can specify the particular statistics required by selecting “options” or statistics button.
- Press ok
Interpret the resorts i.e mean, median, mode, frequency, Quantile sum, variable, standard deviation, minimum, maximum, range, kurtosis and skewness.
Example
To get the minimum, maximum, mean, stdn, variable.
Click
Analyze_ descriptive statistics_ descriptive.
The output will be as;
Frequency Distribution
This done for categorical/ Qualitative variables such as sex, marital status and age group.
GRAPHING
Steps
Graphs
Legacy dialogues
The choose the type of diagram you want.
A PIE CHART
This is done for categorical/ Qualitative variables such as sex and Educational.
BOX PLOT
- Graph _ legacy dialogs_ box plots simple
- Select summaries of separate variables
- Define
- Select the continuous variables to be charted
- Press ok
HISTOGRAM
- Graphs legacy dialogs_ histogram
- Select variables
- Select display normal curve
- Press ok
LINE GRAPH
- Graph_ legacy dialogue_ line_ simple
- Select values of individual cases
- Define
- Select Y and X-axis variables
- Press ok
Multiple responses/open ended questions
This is the case where each respondent gives one or more than one answer to a particular question e.g what are reasons for high children dropouts in some parts of Uganda.
Example
Opinion poll about Rwanda elections
People were asked to give their opinion as to why Museveni won the recent presidential election in Museveni here below is data of their responses.
- Best candidate
- Rigged the election
- He is a dictator
- He was the only candidate
Response
- 1
- 2,3
- 2,3,4
- 3,4
- 2,3,4
- 2,4
- 3,2
According to the survey if Museveni win election?
Method 1 / Dichotomics method
STEPS
DATA ENTRY
Best Candidate | Rigged election | Dictator | Only ……..
|
1 |
| ||
2 | 3 |
| |
2 | 3 |
| |
2 | 3 | 4 | |
2 | 3 | 4 | |
2 | 3 | 4 | |
2 | 3 | 4 |
- Analyze
- Multiple Responses
- Define sets
- Move the desired variables from set definition box to variables set box
- Click on dichotomies counted values
- Put 1
- Go to name Reasons
- Label why Kagame won
- Click on add
- Close
- Go to analyze
- Multiple responses
- Frequencies
The output appears as below show output
Interpretation
The above analysis shows that the main reasons for Museveni’s win were that he rigged the Election and that He is a dictator as reported by 30.8% of the response in either cases. This is followed by reason that he was the only candidate as reported by 15.4% of the response. The other minor reason was that he was the best candidate reported by only 7.7% of the reason.
Method 2 / categories method
STEPS
Data Entry
Best candidate | Rigged election | dictator | Only candidate | |
1 | 1 | |||
2 | 2 | |||
3 | 2 | 3 | ||
4 | 3 | 4 | ||
5 | 2 | 3 | 4 | |
6 | 2 | 3 | 4 | |
7 | 2 | 4 | ||
8 | 2 | 3 | ||
9 |
- Analyze
- Multiple Responses
- Define sets
- Move the desired variable from set definition box to variables set box
- Click on categorical values
- Put 1 through 4 (it depends on the number of reasons you have)
- Go to name-reasons
- Label-why-Museveni won
- Click on add
- Close
- Go to analyze
- Multiple response
- Frequencies
The table will appear
NB: Interpretation will appear as in table 1
Bivariate Analysis
This is only done for categorical variables.
CORRELATION ANALYSIS
This is a measure of relationship between two variables.
It tells us how strong the correlation between the two variables.
The relationship could be negative (-) or positive (+) if the correlation coefficient (p) =1, then there is perfect positive correlation between the variables and if it is = -1 then there is perfect negative correlation between variables.
If p>0.5, there is a strong relationship between the variables.
If p= 0.5, the relationship is moderate.
If p<0.5, there is weak relationship between the variables.
If p<0, then the relationship is very weak.
NOTE
In correlation analysis, we analyze the strength, direction and significance of relationship.
DIRECTION
If the correlation coefficient is negative, it implies the two variables are moving in the different directions, as one variables increases another one decreases. If the correlation coefficient positive, it implies that the two variables are moving in the same direction, as one variable increases, another variable also increases.
Significance of the relationship
If the P-value is less than the level of significance such as 0.05, and 0.01, then the relationship is significance otherwise it is insignificant.
Testing for correlation
- Graphical approach Ascalter plot is used.
The scalter plot illustrates relationship between the variables which can be positive, negative or non-existing.
- Graphs legacy Dialogs_ scalter_ simple define
- Select the Y and X-axis variables
- Press ok
- To add the line of best fit, double click in the plot and click on add a reference line from equation.
The following are scalter plots for visual interpretations of types of correlations
Example
Using the data
Y | X |
2441.1 | 3776.3 |
2476.9 | 3843.1 |
2503.7 | 3760.3 |
2619.4 | 3906.6 |
2746.1 | 4148.5 |
2865.8 | 4279.8 |
2969.1 | 4404.5 |
3052.2 | 4539.9 |
3162.4 | 4718.6 |
3223.3 | 4838 |
3260.4 | 4877.5 |
3240.8 | 4821 |
Is there any relationship between the variables?
Statistical tests
Reason correlation coefficient
This is used for qualitative variable such as age and income.
For example
Is there any significant correlation between age and income of the respondent.
The Hypotheses are stated as follows
Ho: There is no significant correlation between age and income of the respondent.
Ha: There is a significant correlation between age and income of the respondent.
STEPS
- Analyze _ correlate_ bivariate.
- Select the variables from the LH box into the RH box.
- Select flag significant correlations.
- Select type of correlation coefficient person.
- Press ok.
- Interpret the result.
Example
Interpretation
The correlation coefficient is -0.259, this implies that there a weak negative correlation between highest year of school completed and age of the respondent. The correlation significant at 1% level of significance since the P-value (0.000) < or thus the null hypothesis is rejected and conclusion made there is significant relationship between highest year of school complete and age of the respondent.
- Spearman- deals with ranked.
- Kendal’s- categorical variables of some order such as Education level.
Example
Using the
If the confidence internal does not include the hypothesized value the population parameter, the null hypothesis is rejected otherwise it is accepted.
Chi-square test
It is a test of dependence or association between two variables which must be categorical such as marital status, education level, and religion e.t.c.
Example
Does religion of the respondent depend on marital status.
Procedure
Analyze >> descriptive statistics >> cross tabs
Select one variable for arrow and another for a column.
Click statistics >>………..square >> continue
Cells >>> row and column percent- age >> continue
Press ok
NOTE:
First state the Hypothesis
Ho: religion of the respondent does not depend on marital status.
Ha: Religion of the respondent depends on marital.
one-tailed test or a two-tailed test
Should you use a one-tailed test or a two-tailed test for your data analysis?
Quantitative Methodology
Quantitative Results
When creating your data analysis plan or working on your results, you may have to decide if your statistical test should be a one-tailed test or a two-tailed test (also known as “directional” and “non-directional” tests respectively). So, what exactly is the difference between the two? First, it may be helpful to know what the term “tail” means in this context.
The tail refers to the end of the distribution of the test statistic for the particular analysis that you are conducting. For example, a t-test uses the t distribution, and an analysis of variance (ANOVA) uses the F distribution. The distribution of the test statistic can have one or two tails depending on its shape (see the figure below). The black-shaded areas of the distributions in the figure are the tails. Symmetrical distributions like the t and z distributions have two tails. Asymmetrical distributions like the F and chi-square distributions have only one tail. This means that analyses such as ANOVA and chi-square tests do not have a “one-tailed vs. two-tailed” option, because the distributions they are based on have only one tail
SAMPLE T-TESTS
This is used for testing means.
SAMPLE TESTS IN SPSS
- One sample t-tests.
- Paired sample t-test.
- Independent sample t-test.
- ANOVA test.
Please always remember that;
- One sample t-test is used to compare the mean of one variable from a target value.
- Paired sample t-test is used to compared the mean of two variables for a single group.
- Independent sample t-test is used to compare means of two distinct groups of cases e.g alive or dead, on off, men or women e.t.c.
- ANOVA is used for testing several means.
One sample t-test
One sample t-test is performed when you want to determine if the mean value of a target variable is different from any pothesized value.
Examples
- A researcher might want to test whether the average age to respondent differs from 52.
- A researcher might want to test whether the average marked students differs from 75.
Assumptions for the one sample t-test
- The dependent variable is normally distributed with in the population.
- The data are independent.
Example 1
A study on the physical strength measured in Kilograms on 7 subjects before and after a specified training period gave the following results.
Subject | Before | After | diff |
1 | 100 | 115 | |
2 | 110 | 125 | |
3 | 90 | 105 | |
4 | 110 | 130 | |
5 | 125 | 140 | |
6 | 130 | 140 | |
7 | 105 | 125 |
Is there a difference in the physical strength before and after a specified training period?
- State the hypothesis.
- Use t-test to show that there is no mean difference in the physical strength before and after a specified training period.
Solution
First compute a new variable diff in the physical strength before and after a specified training period?
- State the hypothesis.
- Use t-test to show that there is no mean difference in the physical strength before and after a specified training period.
Solution
First compute a new variable diff-the difference between the value and the before value.
STEPS
- Transform- compute.
- For target variable type diff for numeric, expression type and after-before.
- Click ok.
- Analyze-compare means- one sample test.
- Select diff as the test variables and test value to be 0.
- Click on option and put 95%.
- Under missing value select “exclude cases analysis by analysis”
- Continue
Interpretation of the results
Ho: there is no significant mean difference in physical strength before and after a specified training period.
Ha: there is a significant mean difference in physical strength before and after a specified training period.
Since the P-value (0.000)<0.05, the null hypothesis is rejected implying that there is a significant mean difference in physical strength before and after a specified training period.
THE PAIRED T-TEST
- The paired sample t-test produce compares the means of two variables for a single group.
- It computes the difference between values of the two variables for each case and tests whether the average differs from 0.
- It is usually used in the mate….. pairs or case- control study.
STEPS
- Analyze – compare means- sample t-test.
- Select a pair of variables, as follows.
- Click each of two variables.
The first variables appear.
In the current selection group as variable 1 and the second appears as variable 2.
- After you have selected a pair of variables.
- Click the arrow button to move the pair into the paired variables lists click ok.
- You may select move pairs of variables.
- To move a pair of variables from the analysis.
- Select a pair in the paired variable list and click the arrow button.
- Click options to control treatment of missing data and the level of the confidence interval.
Example (page 48)
Independent sample t-test
This is used for testing means of a variable which has two disadvantaged groups of cases e.g is there a significant difference in income between the male and female respondent?
Assumptions for the independent sample t-test
- The variables of the dependent variable in the two populations are equal.
- The dependent variable is normally distributed with in the population.
- The data are independent (scores of one participant are not dependent on scores of others).
The independent sample t-test procedure
- It compares means for groups of cases.
- The subjects should be randomly assigned to two groups so that any difference in the response is due to the treatment or lack of treatment but not to other factors.
- Always ensure that the difference in other factors are not making or enhancing a significance difference in mean.
Example
- The researcher is interested to see if in the population men and women have the same scores in a test.
- If there is a difference in the highest year of school completed between the males and females.
STEPS
- Analyze- compare means- independent- sample t-test.
- Select one or more quantitative test variables.
- Select a single grouping variable.
- Click defines groups to specify two codes for the groups you want to compare.
- Click options to control the treatment of missing data the level of the confidence.
Example
Page 50
Procedure
- Analyze- compare means- independent- sample t-test.
- Select highest year of school completed for test variable.
- Select sex for grouping variable.
- Click on define groups- use specified values, put 1 for group 1 and 2 for group 2. This is because 1 stands for male 2 stands for female.
The results show sets of test statistics
- Equal variance assumed.
- Equal variance not assumed.
- If the F-statistics is significant (null is rejected) we used of equal variance not assumed for interpreting the t-test.
- If the F-statistics is not significant (null is accept) we use the row oof equal variance assumed for interpreting the t-test.
NOTE
Lerene’s test helps to determine which row to use make a decision of accepting or rejecting the null hypothesis.
Results are interpreted as below
Ho: there is no significant mean difference in the highest year of school completed between the male and female respondents.
Ha: there is a significant mean difference in the highest year of school completed between the male and female respondents.
Page 51
Put a diagram down.
Page 52
Analysis of variables (ANOVA)
One way Analysis of variance
- The one way ANOVA procedure produces a one way analysis of variance for a quantitative dependent variable by a single factor (independent) variable.
- Analysis of variance is used to test the hypothesis that several means are equal.
- This technique is an extension of the two sample t-test.
- In addition to determining that differences exist among the means, you may want to know which means differ.
- Here we can use post hoc test which are run after the experiment has been conducted.
Assumption (page 53)
- Independent random samples that is, the group belong compared are regarded as distinct populations, so samples from such population are said to be independent.
- The population are normally distributed.
- The population variances are equal.
STEPS FOR A ONE-WAY ANALYSIS OF VARIANCE
- Analyze- compare means- one way ANOVA.
- Select one or more dependent variables.
- Select a single independent factor variable.
Post hoc
If we want to know which groups are significant different from each other, use post hoc and select beneferrom tests?
Which group would you recommend?
Under options, click one mean plots or options >>>descriptive >>>continue.
STEPS
- Compare mean-mean.
- Select the dependent variable (continuous/ quantitative).
- Select the independent variable (categorical).
- Options- select means- continue- ok.
Page 54
Using the GSS 93 subject data,
- Is there any significant difference in the highest year of school completed (education) by religion.
- Which religion would you recommend to a respondent and why?
- Which religions are significant from each other?
STEPS (Page 54)
- Analyze- compare mean- one way ANOVA.
- Dependent list (Educ).
- Factor (relig).
- Option
- Select post- hoc- beneferrom.
- Select options, click mean plots and descriptive.
Interpretation:
Ho: there is no significant mean difference in the highest year of school completed by different religions.
Ha: there is a significant mean difference in the highest year of school completed by different religions.
Page (55)
Since P-value (0.000) <0.005, the null hypothesis is rejected and conclusion made that there is a significant mean difference in the highest year of school complete by different religions.
Answer to Questions 2
This has two approaches, using Descriptive table and means plot.
Regression Analysis
In regression Analysis we consider two sets of variables.
- Independent variables.
- Dependent variables.
Regression is important is enabling the predictions of the dependent variables gives the values of the independent variables using the formulated regression equation.
SIMPLE LINEAR REGRESSION MODEL
There are only two variables which must be quantitative in nature.
It always takes the form of Y=BO+B1 X error term/ standard error.
Where Y= dependent variable/ Exogenous variables/ Regressions/ Explanatory.
X= Independent variable/ Exogenous variables
B1 and Bo = coefficients of the regressions.
NOTE:
- Independent variables (x) are variables that drive or determine other variables or relationships.
- Dependent variables (Y) are variables that are caused by or influenced by the independent variables.
- Bo is the intercept when the independent variable is not in place or play or are explicitly zero (when X-0)
- B1 is the change in Y affected by the change in x.
- € contains other factors that affected the dependent variables e.g seasonality, the product sold.
STEPS
- Analyze- Regression- linear.
- In the linear Regression dialogue box, select a dependent variable highest year of school completed.
- Select an independent variable age.
- Interpret the Result.
Page 61
Model summary
The R- Square value = 0.067, this implies 6.7% of the variations in the highest year of school completed can be explained by age of the respondent hence it is a poor fit.
Coefficient
Note that since both variables are measured, the same units standard coefficients.
The coefficient (-0.259) shows that a unit increases age of the respondent.
MULTIPLE LINEAR REGRESSION MODELS
Under this regression, there are more than one independent variables. Both the independent and dependent variables are Quantitative.
Model specification
If one wants to predict a sales person total yearly sales (Dependent variable) from the independent variable such as years of education. (ED), Age and years of experience.
S= f (Ed, A, Ex)
S= Bo+B1 Ed+B2A+B3Ex+£
Example
Use the data below to answer the questions that below.
Consumption | Income | Price | Age |
88.9 | 57.5 | 91.7 | 26 |
88.9 | 59.3 | 92 | 36 |
89.1 | 62 | 93.1 | 25 |
88.7 | 56.3 | 90.1 | 12 |
88 | 52.7 | 82.3 | 36 |
85.9 | 44.4 | 76.3 | 25 |
86 | 43.8 | 78.3 | 14 |
87.1 | 47.8 | 84.3 | 52 |
85.4 | 52.1 | 88.1 | 63 |
88.5 | 58 | 88 | 45 |
- Regress consumption on income, price and age.
- Specify model.
- Interpret all your result.
Ho 1: consumption does not dependent on income.
Ha 1: consumption depends on income.
Ho 2: consumption does not depend on price.
Ho 3: consumption does not depend on age.
Ha 3: consumption depends on age.
Ha 2: consumption depends on price.
Procedure
- Regression >> linear.