Regression analysis

Regression analysis

CORRELATION ANALYSIS

This is a measure of relationship between two variables.

It tells us how strong the correlation between the two variables.

The relationship could be negative (-) or positive (+) if the correlation coefficient (p) =1, then there is perfect positive correlation between the variables and if it is = -1 then there is perfect negative correlation between variables.

If p>0.5, there is a strong relationship between the variables.

If p= 0.5, the relationship is moderate.

If p<0.5, there is weak relationship between the variables.

If p<0, then the relationship is very weak.

NOTE

In correlation analysis, we analyze the strength, direction and significance of relationship.

DIRECTION

If the correlation coefficient is negative, it implies the two variables are moving in the different directions, as one variables increases another one decreases. If the correlation coefficient positive, it implies that the two variables are moving in the same direction, as one variable increases, another variable also increases.

Significance of the relationship

If the P-value is less than the level of significance such as 0.05, and 0.01, then the relationship is significance otherwise it is insignificant.

Testing for correlation

  1. Graphical approach Ascalter plot is used.

The scalter plot illustrates relationship between the variables which can be positive, negative or non-existing.

  • Graphs legacy Dialogs_ scalter_ simple define
  • Select the Y and X-axis variables
  • Press ok
  • To add the line of best fit, double click in the plot and click on add a reference line from equation.

The following are scalter plots for visual interpretations of types of correlations

Example

Using the data

       Y       X
2441.13776.3
2476.93843.1
2503.73760.3
2619.43906.6
2746.14148.5
2865.84279.8
2969.14404.5
3052.24539.9
3162.44718.6
3223.34838
3260.44877.5
3240.84821

 

Is there any relationship between the variables?

Statistical tests

Reason correlation coefficient

This is used for qualitative variable such as age and income.

For example

Is there any significant correlation between age and income of the respondent.

The Hypotheses are stated as follows

Ho: There is no significant correlation between age and income of the respondent.

Ha: There is a significant correlation between age and income of the respondent.

STEPS

  • Analyze _ correlate_ bivariate.
  • Select the variables from the LH box into the RH box.
  • Select flag significant correlations.
  • Select type of correlation coefficient person.
  • Press ok.
  • Interpret the result.

Example

Interpretation

The correlation coefficient is -0.259, this implies that there a weak negative correlation between highest year of school completed and age of the respondent. The correlation significant at 1% level of significance since the P-value (0.000) < or thus the null hypothesis is rejected and conclusion made there is significant relationship between highest year of school complete and age of the respondent.

  1. Spearman- deals with ranked.
  2. Kendal’s- categorical variables of some order such as Education level.

Example

Using the

If the confidence internal does not include the hypothesized value the population parameter, the null hypothesis is rejected otherwise it is accepted.

 

 

Chi-square test

It is a test of dependence or association between two variables which must be categorical such as marital status, education level, and religion e.t.c.

Example

Does religion of the respondent depend on marital status.

Procedure

Analyze >> descriptive statistics >> cross tabs

Select one variable for arrow and another for a column.

Click statistics >>………..square >> continue

Cells >>> row and column percent- age >> continue

Press ok

NOTE:

First state the Hypothesis

Ho: religion of the respondent does not depend on marital status.

Ha: Religion of the respondent depends on marital.

one-tailed test or a two-tailed test

Should you use a one-tailed test or a two-tailed test for your data analysis?

Quantitative Methodology

Quantitative Results

When creating your data analysis plan or working on your results, you may have to decide if your statistical test should be a one-tailed test or a two-tailed test (also known as “directional” and “non-directional” tests respectively). So, what exactly is the difference between the two? First, it may be helpful to know what the term “tail” means in this context.

The tail refers to the end of the distribution of the test statistic for the particular analysis that you are conducting. For example, a t-test uses the t distribution, and an analysis of variance (ANOVA) uses the F distribution. The distribution of the test statistic can have one or two tails depending on its shape (see the figure below). The black-shaded areas of the distributions in the figure are the tails. Symmetrical distributions like the t and z distributions have two tails. Asymmetrical distributions like the F and chi-square distributions have only one tail. This means that analyses such as ANOVA and chi-square tests do not have a “one-tailed vs. two-tailed” option, because the distributions they are based on have only one tail

SAMPLE T-TESTS

This is used for testing means.

SAMPLE TESTS IN SPSS

  • One sample t-tests.
  • Paired sample t-test.
  • Independent sample t-test.
  • ANOVA test.

Please always remember that;

  • One sample t-test is used to compare the mean of one variable from a target value.
  • Paired sample t-test is used to compared the mean of two variables for a single group.
  • Independent sample t-test is used to compare means of two distinct groups of cases e.g alive or dead, on off, men or women e.t.c.
  • ANOVA is used for testing several means.

One sample t-test

One sample t-test is performed when you want to determine if the mean value of a target variable is different from any pothesized value.

Examples

  • A researcher might want to test whether the average age to respondent differs from 52.
  • A researcher might want to test whether the average marked students differs from 75.

Assumptions for the one sample t-test

  • The dependent variable is normally distributed with in the population.
  • The data are independent.

 

Example 1

A study on the physical strength measured in Kilograms on 7 subjects before and after a specified training period gave the following results.

Subject Before After diff
1100115 
2110125 
390105 
4110130 
5125140 
6130140 
7105125 

 

Is there a difference in the physical strength before and after a specified training period?

  • State the hypothesis.
  • Use t-test to show that there is no mean difference in the physical strength before and after a specified training period.

Solution

First compute a new variable diff in the physical strength before and after a specified training period?

  • State the hypothesis.
  • Use t-test to show that there is no mean difference in the physical strength before and after a specified training period.

Solution

First compute a new variable diff-the difference between the value and the before value.

STEPS

  • Transform- compute.
  • For target variable type diff for numeric, expression type and after-before.
  • Click ok.
  • Analyze-compare means- one sample test.
  • Select diff as the test variables and test value to be 0.
  • Click on option and put 95%.
  • Under missing value select “exclude cases analysis by analysis”
  • Continue

Interpretation of the results

Ho: there is no significant mean difference in physical strength before and after a specified training period.

Ha: there is a significant mean difference in physical strength before and after a specified training period.

Since the P-value (0.000)<0.05, the null hypothesis is rejected implying that there is a significant mean difference in physical strength before and after a specified training period.

THE PAIRED T-TEST

  • The paired sample t-test produce compares the means of two variables for a single group.
  • It computes the difference between values of the two variables for each case and tests whether the average differs from 0.
  • It is usually used in the mate….. pairs or case- control study.

STEPS

  • Analyze – compare means- sample t-test.
  • Select a pair of variables, as follows.
  • Click each of two variables.

The first variables appear.

In the current selection group as variable 1 and the second appears as variable 2.

  • After you have selected a pair of variables.
  • Click the arrow button to move the pair into the paired variables lists click ok.
  • You may select move pairs of variables.
  • To move a pair of variables from the analysis.
  • Select a pair in the paired variable list and click the arrow button.
  • Click options to control treatment of missing data and the level of the confidence interval.

 

Example (page 48)

Independent sample t-test

This is used for testing means of a variable which has two disadvantaged groups of cases e.g is there a significant difference in income between the male and female respondent?

Assumptions for the independent sample t-test

  • The variables of the dependent variable in the two populations are equal.
  • The dependent variable is normally distributed with in the population.
  • The data are independent (scores of one participant are not dependent on scores of others).

The independent sample t-test procedure

  • It compares means for groups of cases.
  • The subjects should be randomly assigned to two groups so that any difference in the response is due to the treatment or lack of treatment but not to other factors.
  • Always ensure that the difference in other factors are not making or enhancing a significance difference in mean.

Example

  • The researcher is interested to see if in the population men and women have the same scores in a test.
  • If there is a difference in the highest year of school completed between the males and females.

STEPS

  • Analyze- compare means- independent- sample t-test.
  • Select one or more quantitative test variables.
  • Select a single grouping variable.
  • Click defines groups to specify two codes for the groups you want to compare.
  • Click options to control the treatment of missing data the level of the confidence.

Example

Page 50

Procedure

  • Analyze- compare means- independent- sample t-test.
  • Select highest year of school completed for test variable.
  • Select sex for grouping variable.
  • Click on define groups- use specified values, put 1 for group 1 and 2 for group 2. This is because 1 stands for male 2 stands for female.

The results show sets of test statistics

  • Equal variance assumed.
  • Equal variance not assumed.
  • If the F-statistics is significant (null is rejected) we used of equal variance not assumed for interpreting the t-test.
  • If the F-statistics is not significant (null is accept) we use the row oof equal variance assumed for interpreting the t-test.

NOTE

Lerene’s test helps to determine which row to use make a decision of accepting or rejecting the null hypothesis.

Results are interpreted as below

Ho: there is no significant mean difference in the highest year of school completed between the male and female respondents.

Ha: there is a significant mean difference in the highest year of school completed between the male and female respondents.

Page 51

Put a diagram down.

Page 52

Analysis of variables (ANOVA)

One way Analysis of variance

  • The one way ANOVA procedure produces a one way analysis of variance for a quantitative dependent variable by a single factor (independent) variable.
  • Analysis of variance is used to test the hypothesis that several means are equal.
  • This technique is an extension of the two sample t-test.
  • In addition to determining that differences exist among the means, you may want to know which means differ.
  • Here we can use post hoc test which are run after the experiment has been conducted.

Assumption (page 53)

  • Independent random samples that is, the group belong compared are regarded as distinct populations, so samples from such population are said to be independent.
  • The population are normally distributed.
  • The population variances are equal.

 

STEPS FOR A ONE-WAY ANALYSIS OF VARIANCE

  • Analyze- compare means- one way ANOVA.
  • Select one or more dependent variables.
  • Select a single independent factor variable.

Post hoc

If we want to know which groups are significant different from each other, use post hoc and select beneferrom tests?

Which group would you recommend?

Under options, click one mean plots or options >>>descriptive >>>continue.

 

STEPS

  • Compare mean-mean.
  • Select the dependent variable (continuous/ quantitative).
  • Select the independent variable (categorical).
  • Options- select means- continue- ok.

Page 54

Using the GSS 93 subject data,

  • Is there any significant difference in the highest year of school completed (education) by religion.
  • Which religion would you recommend to a respondent and why?
  • Which religions are significant from each other?

STEPS (Page 54)

  • Analyze- compare mean- one way ANOVA.
  • Dependent list (Educ).
  • Factor (relig).
  • Option
  • Select post- hoc- beneferrom.
  • Select options, click mean plots and descriptive.

Interpretation:

Ho: there is no significant mean difference in the highest year of school completed by different religions.

Ha: there is a significant mean difference in the highest year of school completed by different religions.

Page (55)

Since P-value (0.000) <0.005, the null hypothesis is rejected and conclusion made that there is a significant mean difference in the highest year of school complete by different religions.

Answer to Questions 2

This has two approaches, using Descriptive table and means plot.

Regression Analysis

In regression Analysis we consider two sets of variables.

  1. Independent variables.
  2. Dependent variables.

Regression is important is enabling the predictions of the dependent variables gives the values of the independent variables using the formulated regression equation.

SIMPLE LINEAR REGRESSION MODEL

There are only two variables which must be quantitative in nature.

It always takes the form of Y=BO+B1 X  error term/ standard error.

Where Y= dependent variable/ Exogenous variables/ Regressions/ Explanatory.

X= Independent variable/ Exogenous variables

B1 and Bo = coefficients of the regressions.

NOTE:

  • Independent variables (x) are variables that drive or determine other variables or relationships.
  • Dependent variables (Y) are variables that are caused by or influenced by the independent variables.
  • Bo is the intercept when the independent variable is not in place or play or are explicitly zero (when X-0)
  • B1 is the change in Y affected by the change in x.
  • € contains other factors that affected the dependent variables e.g seasonality, the product sold.

STEPS

  • Analyze- Regression- linear.
  • In the linear Regression dialogue box, select a dependent variable highest year of school completed.
  • Select an independent variable age.
  • Interpret the Result.

Page 61

Model summary

The R- Square value = 0.067, this implies 6.7% of the variations in the highest year of school completed can be explained by age of the respondent hence it is a poor fit.

Coefficient

Note that since both variables are measured, the same units standard coefficients.

The coefficient (-0.259) shows that a unit increases age of the respondent.

MULTIPLE LINEAR REGRESSION MODELS

Under this regression, there are more than one independent variables. Both the independent and dependent variables are Quantitative.

Model specification

If one wants to predict a sales person total yearly sales (Dependent variable) from the independent variable such as years of education. (ED), Age and years of experience.

S= f (Ed, A, Ex)

S= Bo+B1 Ed+B2A+B3Ex+£

Example

Use the data below to answer the questions that below.

Consumption Income Price Age
88.957.591.726
88.959.39236
89.16293.125
88.756.390.112
8852.782.336
85.944.476.325
8643.878.314
87.147.884.352
85.452.188.163
88.5588845

 

  1. Regress consumption on income, price and age.
  2. Specify model.
  • Interpret all your result.

Ho 1: consumption does not dependent on income.

Ha 1: consumption depends on income.

Ho 2: consumption does not depend on price.

Ho 3: consumption does not depend on age.

Ha 3: consumption depends on age.

Ha 2: consumption depends on price.

Procedure

  • Regression >> linear.

 

 

 

 

REGRESSION ANALYSIS

 

HOW TO PERFORM REGRESSION IN SPSS

LINEAR REGRESSION ANALYSIS USING SPSS STATISTICS

Introduction

Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable’s value is called the independent variable (or sometimes, the predictor variable). For example, you could use linear regression to understand whether exam performance can be predicted based on revision time; whether cigarette consumption can be predicted based on smoking duration; and so forth. If you have two or more independent variables, rather than just one, you need to use multiple regression analysis.

This “quick start” guide shows you how to carry out linear regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for linear regression to give you a valid result. We discuss these assumptions next.

SPSS Statistics

Assumptions

When you choose to analyse your data using linear regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using linear regression. You need to do this because it is only appropriate to use linear regression if your data “passes” seven assumptions that are required for linear regression to give you a valid result. In practice, checking for these seven assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these seven assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out linear regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let’s take a look at these seven assumptions:

Assumption one: Your dependent variable should be measured at the continuous level (i.e., it is either an interval or ratio variable). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth.

Assumption two: Your independent variable should also be measured at the continuous level (i.e., it is either an interval or ratio variable). See the bullet above for examples of continuous variables.

Assumption three: There needs to be a linear relationship between the two variables. Whilst there are a number of ways to check whether a linear relationship exists between your two variables, we suggest creating a scatterplot using SPSS Statistics where you can plot the dependent variable against your independent variable and then visually inspect the scatterplot to check for linearity. Your scatterplot may look something like one of the following:

If the relationship displayed in your scatterplot is not linear, you will have to either run a non-linear regression analysis, perform a polynomial regression or “transform” your data, which you can do using SPSS Statistics. In our enhanced guides, we show you how to:

  1. a) Create a scatterplot to check for linearity when carrying out linear regression using SPSS Statistics;

(b) Interpret different scatterplot results; and

(c) Transform your data using SPSS Statistics if there is not a linear relationship between your two variables.

Assumption Four: There should be no significant outliers. An outlier is an observed data point that has a dependent variable value that is very different to the value predicted by the regression equation. As such, an outlier will be a point on a scatterplot that is (vertically) far away from the regression line indicating that it has a large residual, as highlighted below:

The problem with outliers is that they can have a negative effect on the regression analysis (e.g., reduce the fit of the regression equation) that is used to predict the value of the dependent (outcome) variable based on the independent (predictor) variable. This will change the output that SPSS Statistics produces and reduce the predictive accuracy of your results. Fortunately, when using SPSS Statistics to run a linear regression on your data, you can easily include criteria to help you detect possible outliers. In our enhanced linear regression guide, we: (a) show you how to detect outliers using “casewise diagnostics”, which is a simple process when using SPSS Statistics; and (b) discuss some of the options you have in order to deal with outliers.

  • Assumption FIVE:You should have independence of observations, which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic in our enhanced linear regression guide.
  • Assumption SIX:Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line. Whilst we explain more about what this means and how to assess the homoscedasticity of your data in our enhanced linear regression guide, take a look at the three scatterplots below, which provide three simple examples: two of data that fail the assumption (called heteroscedasticity) and one of data that meets this assumption (called homoscedasticity):

Whilst these helps to illustrate the differences in data that meets or violates the assumption of homoscedasticity, real-world data can be a lot more messy and illustrate different patterns of heteroscedasticity. Therefore, in our enhanced linear regression guide, we explain: (a) some of the things you will need to consider when interpreting your data; and (b) possible ways to continue with your analysis if your data fails to meet this assumption.

  • Assumption SEVEN:Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed (we explain these terms in our enhanced linear regression guide). Two common methods to check this assumption include using either a histogram (with a superimposed normal curve) or a Normal P-P Plot. Again, in our enhanced linear regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) or Normal P-P Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.

 

EXAMPLES

SPSS Statistics

Example

A salesperson for a large car brand wants to determine whether there is a relationship between an individual’s income and the price they pay for a car. As such, the individual’s “income” is the independent variable and the “price” they pay for a car is the dependent variable. The salesperson wants to use this information to determine which cars to offer potential customers in new areas where average income is known.

SPSS Statistics

Setup in SPSS Statistics

In SPSS Statistics, we created two variables so that we could enter our data: Income (the independent variable), and Price (the dependent variable). It can also be useful to create a third variable, caseno, to act as a chronological case number. This third variable is used to make it easy for you to eliminate cases (e.g., significant outliers) that you have identified when checking for assumptions. However, we do not include it in the SPSS Statistics procedure that follows because we assume that you have already checked these assumptions. In our enhanced linear regression guide, we show you how to correctly enter data in SPSS Statistics to run a linear regression when you are also checking for assumptions.

SPSS Statistics\Test Procedure in SPSS Statistics

The five steps below show you how to analyse your data using linear regression in SPSS Statistics when none of the seven assumptions in the previous section, Assumptions, have been violated. At the end of these four steps, we show you how to interpret the results from your linear regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6 and #7, which are required when using linear regression and can be tested using SPSS Statistics, you can learn more about our enhanced guides on our Features: Overview page.

Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28, as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version, SPSS Statistics introduced a new look to their interface called “SPSS Light”, replacing the previous look for versions 26 and earlier versions, which was called “SPSS Standard”. Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical.

  1. Click Analyze > Regression > L..on the top menu, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

You will be presented with the Linear Regression dialogue box:

Published with written permission from SPSS Statistics, IBM Corporation.

  1. Transfer the independent variable, Income, into the Independent(s):box and the dependent variable, Price, into the Dependent: You can do this by either drag-and-dropping the variables or by using the appropriate   buttons. You will end up with the following screen:

Published with written permission from SPSS Statistics, IBM Corporation.

  1. You now need to check four of the assumptions discussed in the Assumptionssection above: no significant outliers (assumption #3); independence of observations (assumption #4); homoscedasticity (assumption #5); and normal distribution of errors/residuals (assumptions #6). You can do this by using the and features, and then selecting the appropriate options within these two dialogue boxes. In our enhanced linear regression guide, we show you which options to select in order to test whether your data meets these four assumptions.
  2. Click on the button. This will generate the results.

HOW TO INTERPRETE REGRESSION RESULTS

THE OUT PUT OF SPSS ON REGRESSION ANALYSIS

Adjusted R-squared

Adjusted R-squared is a statistical measure that is closely related to the more commonly known R-squared (R²) value in the context of linear regression analysis. While R-squared measures the proportion of the variance in the dependent variable (the variable being predicted) that is explained by the independent variables (the predictors) in a regression model, Adjusted R-squared takes into account the number of independent variables used in the model, providing a more conservative and useful assessment of model fit.

R-squared (R²): R-squared is a measure of how well the independent variables in a regression model explain the variability in the dependent variable. It ranges from 0 to 1, with higher values indicating a better fit. Specifically, R-squared represents the proportion of the total variation in the dependent variable that is explained by the model. However, as you add more independent variables to the model, R-squared tends to increase, even if the additional variables do not significantly improve the model’s predictive power. This can lead to overfitting, where the model fits the training data extremely well but may not generalize well to new data. Adjusted R-squared: Adjusted R-squared addresses the issue of overfitting by penalizing the inclusion of unnecessary independent variables in the model. It takes into account the number of predictors in the model and adjusts R-squared accordingly.

The formula for Adjusted R-squared is; Adjusted R² = 1 – [(1 – R²) * (n – 1) / (n – k – 1)]

R² is the regular R-squared.

n is the number of data points (observations).

k is the number of independent variables in the model.

Adjusted R-squared will always be lower than R-squared when you have multiple independent variables, and it tends to decrease as you add irrelevant or redundant variables to the model. It provides a more realistic assessment of the model’s fit by accounting for model complexity.

 

In summary, while R-squared tells you how well your regression model fits the data, Adjusted R-squared helps you determine whether the improvement in model fit achieved by adding more independent variables is justified by the increased complexity. It is a useful tool for model selection and comparison, as it encourages the use of simpler models that explain the data adequately without unnecessary complexity.

 

 

 

 

 

 

 

SPSS Statistics will generate quite a few tables of output for a linear regression. In this section, we show you only the three main tables required to understand your results from the linear regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the six assumptions required to carry out linear regression is provided in our enhanced guide. This includes relevant scatterplots, histogram (with superimposed normal curve), Normal P-P Plot, casewise diagnostics and the Durbin-Watson statistic. Below, we focus on the results for the linear regression analysis only.

The first table of interest is the Model Summary table, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

This table provides the R and R2 values. The R value represents the simple correlation and is 0.873 (the “R” Column), which indicates a high degree of correlation. The R2 value (the “R Square” column) indicates how much of the total variation in the dependent variable, Price, can be explained by the independent variable, Income. In this case, 76.2% can be explained, which is very large.

The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts the dependent variable) and is shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

This table indicates that the regression model predicts the dependent variable significantly well. How do we know this? Look at the “Regression” row and go to the “Sig.” column. This indicates the statistical significance of the regression model that was run. Here, p < 0.0005, which is less than 0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).

The Coefficients table provides us with the necessary information to predict price from income, as well as determine whether income contributes statistically significantly to the model (by looking at the “Sig.” column). Furthermore, we can use the values in the “B” column under the “Unstandardized Coefficients” column, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

to present the regression equation as:

Price = 8287 + 0.564(Income)

 

CORRELATION ANALYSIS

Pearson Correlation

The bivariate Pearson Correlation produces a sample correlation coefficient, r, which measures the strength and direction of linear relationships between pairs of continuous variables. By extension, the Pearson Correlation evaluates whether there is statistical evidence for a linear relationship among the same pairs of variables in the population, represented by a population correlation coefficient, ρ (“rho”). The Pearson Correlation is a parametric measure.

This measure is also known as:

  • Pearson’s correlation
  • Pearson product-moment correlation (PPMC)

Common Uses

The bivariate Pearson Correlation is commonly used to measure the following:

  • Correlations among pairs of variables
  • Correlations within and between sets of variables

The bivariate Pearson correlation indicates the following:

  • Whether a statistically significant linear relationship exists between two continuous variables
  • The strength of a linear relationship (i.e., how close the relationship is to being a perfectly straight line)
  • The direction of a linear relationship (increasing or decreasing)

Note: The bivariate Pearson Correlation cannot address non-linear relationships or relationships among categorical variables. If you wish to understand relationships that involve categorical variables and/or non-linear relationships, you will need to choose another measure of association.

Note: The bivariate Pearson Correlation only reveals associations among continuous variables. The bivariate Pearson Correlation does not provide any inferences about causation, no matter how large the correlation coefficient is.

Data Requirements

To use Pearson correlation, your data must meet the following requirements:

  1. Two or more continuous variables (i.e., interval or ratio level)
  2. Cases must have non-missing values on both variables
  3. Linear relationship between the variables
  4. Independent cases (i.e., independence of observations)
    • There is no relationship between the values of variables between cases. This means that:
      • the values for all variables across cases are unrelated
      • for any case, the value for any variable cannot influence the value of any variable for other cases
      • no case can influence another case on any variable
    • The biviariate Pearson correlation coefficient and corresponding significance test are not robust when independence is violated.
  5. Bivariate normality
    • Each pair of variables is bivariately normally distributed
    • Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
    • This assumption ensures that the variables are linearly related; violations of this assumption may indicate that non-linear relationships among variables exist. Linearity can be assessed visually using a scatterplot of the data.
  6. Random sample of data from the population
  7. No outliers

Hypotheses

The null hypothesis (H0) and alternative hypothesis (H1) of the significance test for correlation can be expressed in the following ways, depending on whether a one-tailed or two-tailed test is requested:

Two-tailed significance test:

H0ρ = 0 (“the population correlation coefficient is 0; there is no association”)
H1: ρ ≠ 0 (“the population correlation coefficient is not 0; a nonzero correlation could exist”)

One-tailed significance test:

H0ρ = 0 (“the population correlation coefficient is 0; there is no association”)
H1ρ  > 0 (“the population correlation coefficient is greater than 0; a positive correlation could exist”)
OR
H1ρ  < 0 (“the population correlation coefficient is less than 0; a negative correlation could exist”)

where ρ is the population correlation coefficient.

Test Statistic

The sample correlation coefficient between two variables x and y is denoted r or rxy, and can be computed as:

Run a Bivariate Pearson Correlation

To run a bivariate Pearson Correlation in SPSS, click Analyze > Correlate > Bivariate.

The Bivariate Correlations window opens, where you will specify the variables to be used in the analysis. All of the variables in your dataset appear in the list on the left side. To select variables for the analysis, select the variables in the list on the left and click the blue arrow button to move them to the right, in the Variables field.

AVariables: The variables to be used in the bivariate Pearson Correlation. You must select at least two continuous variables, but may select more than two. The test will produce correlation coefficients for each pair of variables in this list.

BCorrelation Coefficients: There are multiple types of correlation coefficients. By default, Pearson is selected. Selecting Pearson will produce the test statistics for a bivariate Pearson Correlation.

CTest of Significance: Click Two-tailed or One-tailed, depending on your desired significance test. SPSS uses a two-tailed test by default.

DFlag significant correlations: Checking this option will include asterisks (**) next to statistically significant correlations in the output. By default, SPSS marks statistical significance at the alpha = 0.05 and alpha = 0.01 levels, but not at the alpha = 0.001 level (which is treated as alpha = 0.01)

EOptions: Clicking Options will open a window where you can specify which Statistics to include (i.e., Means and standard deviationsCross-product deviations and covariances) and how to address Missing Values (i.e., Exclude cases pairwise or Exclude cases listwise). Note that the pairwise/listwise setting does not affect your computations if you are only entering two variable, but can make a very large difference if you are entering three or more variables into the correlation procedure.

Example: Understanding the linear association between weight and height

PROBLEM STATEMENT

Perhaps you would like to test whether there is a statistically significant linear relationship between two continuous variables, weight and height (and by extension, infer whether the association is significant in the population). You can use a bivariate Pearson Correlation to test whether there is a statistically significant linear relationship between height and weight, and to determine the strength and direction of the association.

BEFORE THE TEST

In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height” is a continuous measure of height in inches and exhibits a range of values from 55.00 to 84.41 (Analyze > Descriptive Statistics > Descriptives). The variable “Weight” is a continuous measure of weight in pounds and exhibits a range of values from 101.71 to 350.07.

Before we look at the Pearson correlations, we should look at the scatterplots of our variables to get an idea of what to expect. In particular, we need to determine if it’s reasonable to assume that our variables have linear relationships. Click Graphs > Legacy Dialogs > Scatter/Dot. In the Scatter/Dot window, click Simple Scatter, then click Define. Move variable Height to the X Axis box, and move variable Weight to the Y Axis box. When finished, click OK.

To add a linear fit like the one depicted, double-click on the plot in the Output Viewer to open the Chart Editor. Click Elements > Fit Line at Total. In the Properties window, make sure the Fit Method is set to Linear, then click Apply. (Notice that adding the linear regression trend line will also add the R-squared value in the margin of the plot. If we take the square root of this number, it should match the value of the Pearson correlation we obtain.)

From the scatterplot, we can see that as height increases, weight also tends to increase. There does appear to be some linear relationship.

RUNNING THE TEST

To run the bivariate Pearson Correlation, click Analyze > Correlate > Bivariate. Select the variables Height and Weight and move them to the Variables box. In the Correlation Coefficients area, select Pearson. In the Test of Significance area, select your desired significance test, two-tailed or one-tailed. We will select a two-tailed significance test in this example. Check the box next to Flag significant correlations.

Click OK to run the bivariate Pearson Correlation. Output for the analysis will display in the Output Viewer.

Syntax

CORRELATIONS   /VARIABLES=Weight Height   /PRINT=TWOTAIL NOSIG   /MISSING=PAIRWISE.

OUTPUT

Tables

The results will display the correlations in a table, labeled Correlations.

A Correlation of Height with itself (r=1), and the number of non-missing observations for height (n=408).

B Correlation of height and weight (r=0.513), based on n=354 observations with pairwise nonmissing values.

C Correlation of height and weight (r=0.513), based on n=354 observations with pairwise nonmissing values.

D Correlation of weight with itself (r=1), and the number of nonmissing observations for weight (n=376).

The important cells we want to look at are either B or C. (Cells B and C are identical, because they include information about the same pair of variables.) Cells B and C contain the correlation coefficient for the correlation between height and weight, its p-value, and the number of complete pairwise observations that the calculation was based on.

The correlations in the main diagonal (cells A and D) are all equal to 1. This is because a variable is always perfectly correlated with itself. Notice, however, that the sample sizes are different in cell A (n=408) versus cell D (n=376). This is because of missing data — there are more missing observations for variable Weight than there are for variable Height.

If you have opted to flag significant correlations, SPSS will mark a 0.05 significance level with one asterisk (*) and a 0.01 significance level with two asterisks (0.01). In cell B (repeated in cell C), we can see that the Pearson correlation coefficient for height and weight is .513, which is significant (p < .001 for a two-tailed test), based on 354 complete observations (i.e., cases with nonmissing values for both height and weight).

DECISION AND CONCLUSIONS

Based on the results, we can state the following:

    • Weight and height have a statistically significant linear relationship (r=.513, p < .001).
  • The direction of the relationship is positive (i.e., height and weight are positively correlated), meaning that these variables tend to increase together (i.e., greater height is associated with greater weight).
  • The magnitude, or strength, of the association is approximately moderate (.3 < | | < .5).

 

 

HOW TO CARRY OUT DESCRIPTIVE STATISTICS IN SPSS

Variable View:

Data View:

The data can be summarized via the ‘Descriptive Statistics’ part of the ‘Analyse’ menu. Let’s explore;

ANALYZE

 

 

 

 

DESCRIPTIVE STATISTICS

 

 

 

 

FREQUENCIES.

This will open up a new dialogue box:

We can use this to make frequency tables of the variables. Statistics can be calculated and graphs can be drawn. Let’s make frequency tables for the categorical data: Passenger Class, Passenger Gender and Passenger Survived.

Ensure that ‘Display frequency tables’ is checked. Click on one of the three categories required in the left most box and then press the arrow button in the middle to move to the right hand box. Do the same for the other 2 categories.

Click on the ‘Charts’ button and choose the option to draw a bar chart (or a pie chart if you prefer). Press ‘Continue’ the ‘OK’. Output will now be written to the output window.

The frequency tables summarising the data are:

You should also find the charts that were selected.

The descriptive statistics feature of SPSS can also give summary statistics such as the mean, median and standard deviation. We have some scale data in the form of the passenger’s age. Go back to ;

Analyse               Descriptive Stats         Frequencies and return the previously moved categories back to the left box. Move over the ‘Passenger’s Age’ variable to the right box.

Choose the measures that you would like to get by clicking the check boxes.

Click continue. The frequency table is not needed for these data so uncheck and uncheck the options inside the chart menu.

When ready click ‘OK’

Above the chosen options were Mean, Median, Mode, Std. Deviation, Minimum and Maximum.

 

Continuous variables can also be analysed using the ‘Descriptives’ menu in SPSS. Go to Analyse -> Descriptive Statistics -> Descriptives.

First move the variable ‘Passenger Age’ to the ‘Variable(s)’ section.

Click on the ‘Options’ button. We can choose the statistics we want computed. Select those shown in the picture below:

The mean is a measure of average (sum of the values divided by the number of values). Standard deviation measures the spread of the data and can be used to describe normal distributions. Skewness is a measure of how symmetrical the distribution is. Values of skewness close to 0 represent symmetry, positive values mean that there are some high valued outliers and a negative value means some low valued outliers.

(image from www.managedfuturesinvesting.com)

Kurtosis values refer to how peaked the distribution is. A normal distribution would have a value of 0. Negative values mean that the distribution is flat, i,e, many cases in the extremes) and positive values meaning the distribution is clustered in the centre.

Choose

‘Continue’ and ‘OK’

The descriptives provide the requested information about the passenger age on the titanic.

 

SPSS BAR CHART

SPSS BAR CHARTS FOR CATEGORICAL VARIABLE

Our frequency table provides us with the necessary information but we need to look at it carefully for drawing conclusions. Doing so is greatly facilitated by creating a simple bar chart with bars representing frequencies. The fastest way to do so is including it in our FREQUENCIES command but this doesn’t allow us to add a title. We’ll therefore do it differently as shown by the screenshots below.

 

 

 

 

Categorical variables

Summary Measures for Categorical Data

Last Updated: 2021-03-22

For categorical data, the most typical summary measure is the number or percentage of cases in each category. The mode is the category with the greatest number of cases. For ordinal data, the median (the value at which half of the cases fall above and below) may also be a useful summary measure if there is a large number of categories.

The Frequencies procedure produces frequency tables that display both the number and percentage of cases for each observed value of a variable.

From the menus choose:

Analyze > Descriptive Statistics > Frequencies…

Note: This feature requires Statistics Base Edition.

Select Owns PDA [ownpda] and Owns TV [owntv] and move them into the Variable(s) list.

Figure 1. Categorical variables selected for analysis

Click OK to run the procedure.

Figure 2. Frequency tables

The frequency tables are displayed in the Viewer window. The frequency tables reveal that only 20.4% of the people own PDAs, but almost everybody owns a TV (99.0%). These might not be interesting revelations, although it might be interesting to find out more about the small group of people who do not own televisions.

 

 

 

 

 

 

 

 

 

 

 

 

 

HOW TO ENTER QUESTIONNAIRE ON SPSS

 

EXAMPLE OF QUESTIONNAIRE

 

Section A:      Respondent Details:

Gender:    Male                                      Female

Level of Education:

Postgraduate             Bachelor Degree Diploma               Certificate           None of These

Department

Clinical               Records Management/M&E                    Other(specify)

Cadre

Medical Officer     Records/Data Officer    Nurse/Midwife    Clinical Officer    Other

 

Steps in entering the question in SPSS

JUST TYPE

DIFFERENT COLUMS

NAME: THIS IS THE INITIALS you give to a specific variable, however when filling in the name for example, Gender, Age, Education.    You don’t have to leve any space during the process of writing.

Type : this indicates he type of data you are going to enter into the system.

There are different types of data; Numeric, comma, Dot, scientific notation, Date, Dollar, custom currency, string and restricted Numeric Interger.

You choose according to the type of data u have, Basing on the type of Data I have the I will choose numeric because I am going to give them value labels, but I could as well choose string if I have the energy of typing.

 

Question entered in SPSS file

DECIMALS: This indicates the number of decimal numbers in a specific number if the number do not have indicate zero ( 0 ).

Label:  This is the type of name for example Gender of respondents.

Values: This is the value that is put for categorical variables for example gender

 

Technical capacityHere you are requested to indicate the level at which you agree with the statement by circling a number on a scale of 1-5. The keys have been displayed below where; SA=Strongly Agree,  A=Agree,  NS=Not Sure,   D=Disagree,   SD=Strongly Disagree

NoQuestionSAANSDSD
1Ibanda district employs competent staff54321
2The staff have required technical capacity to execute their tasks54321
3Staff are placed in departments for which they qualify54321
4The staff employed have the skills that match their job54321
5Employees are taken through professional development courses periodically54321
6Employees are given enough orientation to ensure that they are well equipped for their roles54321
7The staff have enough experience to execute the tasks at hand54321
8The staff  in the public level III and IV health facilities in Ibanda District have the technical capacity to collect M&E data54321
9The staff  in the public level III and IV health facilities in Ibanda District have the technical capacity to analyze M&E data54321
10The staff  in the public level III and IV health facilities in Ibanda District have the technical capacity to prepare M&E reports using the M&E findings54321
11The number of staff at the public level III and IV health facilities for the different tasks is adequate54321
12The staff  in the public level III and IV health facilities in Ibanda District have the technical capacity to interpret M&E reports using the M&E findings54321

 

 

 

 

 

VALUES

 

 

 

 

 

 

HOW TO ADD IN SPSS AND CREATE A NEW VARIABLE

Click

TRANSFORM

 

 

 

 

 

COMPUTE VARIABLE

 

Compute variable

 

 

 

 

 

 

 

Analysis of addition in SPSS

 

 

 

 

 

 

 

 

 

CREATE A FREQUENCY TABLE IN SPSS

In SPSS, the Frequencies procedure can produce summary measures for categorical variables in the form of frequency tables, bar charts, or pie charts.

 

 

 

TO RUN THE FREQUENCIES PROCEDURE, CLICK

ANALYZE

 

 

 

DESCRIPTIVE STATISTICS

 

 

 

FREQUENCIES.

 

A Variable(s): The variables to produce Frequencies output for. To include a variable for analysis, double-click on its name to move it to the Variables box. Moving several variables to this box will create several frequency tables at once.

B Statistics: Opens the Frequencies: Statistics window, which contains various descriptive statistics.

The vast majority of the descriptive statistics available in the Frequencies: Statistics window are never appropriate for nominal variables, and are rarely appropriate for ordinal variables in most situations. There are two exceptions to this:

The Mode (which is the most frequent response) has a clear interpretation when applied to most nominal and ordinal categorical variables.

The Values are group midpoints option can be applied to certain ordinal variables that have been coded in such a way that their value takes on the midpoint of a range. For example, this would be the case if you had measured subjects’ ages and had coded anyone between the ages of 20 and 29 as 25, or between the 30 and 39 as 35

If your categorical variables are coded numerically, it is very easy to mis-use measures like the mean and standard deviation. SPSS will compute those statistics if they are requested, regardless of whether or not they are meaningful. It is up to the researcher to determine if these measures are appropriate for their data. In general, you should never use any of these statistics for dichotomous variables or nominal variables, and should only use these statistics with caution for ordinal variables.

 

C Charts: Opens the Frequencies: Charts window, which contains various graphical options. Options include bar charts, pie charts, and histograms. For categorical variables, bar charts and pie charts are appropriate. Histograms should only be used for continuous variables; they should not be used for ordinal variables, and should never be used with nominal variables.

  • Bar chart displays the categories on the graph’s x-axis, and either the frequencies or the percentages on the y-axis
  • Pie chart depicts the categories of a variable as “slices” of a circular “pie”.

Note that the options in the Chart Values area apply only to bar charts and pie charts. In particular, these options affect whether the labeling for the pie slices or the y-axis of the bar chart uses counts or percentages. This setting will greyed out if Histograms is selected.

D Format: Opens the Frequencies: Format window, which contains options for how to sort and organize the table output.

The Order by options affect only categorical variables:

  • Ascending values arranges the rows of the frequency table in increasing order with respect to the category values: (alphabetically if string, or by numeric code if numeric)
  • Descending values arranges the rows of the frequency table in decreasing order with respect to the category values.
    • Note: If your categorical variable is coded numerically as 0, 1, 2, …, sorting by ascending or descending value will arrange the rows with respect to the numeric code, not with respect to any assigned labels.)
  • Ascending counts orders the rows of the frequency table from least frequent (lowest count) to most frequent (highest count).
  • Descending counts orders the rows of the frequency table from most frequent (highest count) to least frequent (lowest count).

When working with two or more categorical variables, the Multiple Variables options only affects the order of the output. If Compare variables is selected, then the frequency tables for all of the variables will appear first, and all of the graphs for the variables will appear after. If Organize output by variables is selected, then the frequency table and graph for the first variable will appear together; then the frequency table and graph for the second variable will appear together; etc.

E Display frequency tables: When checked, frequency tables will be printed. (This box is checked by default.) If this check box is not checked, no frequency tables will be produced, and the only output will come from supplementary options from Statistics or ChartsFor categorical variables, you will usually want to leave this box checked.

What if my frequency table has a blank row in it?

What should I do if I create a frequency table in SPSS and one of the rows is blank?

If you are creating a frequency table using a string variable and notice that the first row has a blank category label, similar to this example:

This particular issue is specific to frequency tables created from string variables. The blank row represents observations with missing values. SPSS does not automatically recognize blank (i.e., empty) strings as missing values, so the blank values appear as one of the “Valid” (i.e., non-missing) categories.

This issue should not be ignored! When missing values are treated as valid values, it causes the “Valid Percent” columns to be calculated incorrectly. If the blank values were correctly treated as missing values, the valid, non-missing sample size for this table would be 314 + 94 = 408 — not 435! — and the valid percent values would change to 314/408 = 76.9% and 94/408 = 23.0%. Depending on the number of missing values in your sample, the differences could be even more dramatic.

To fix this problem: To get SPSS to recognize blank strings as missing values, you’ll need to run the variable through the Automatic Recode procedure. This procedure takes a string variable and converts it to a new, coded numeric variable with value labels attached. During this process, blank string values are recoded to a special missing value code.

Example: Summarizing a Categorical Variable

Using the sample dataset, let’s a create a frequency table and a corresponding bar chart for the class rank (variable Rank), and let’s also request the Mode statistic for this variable.

RUNNING THE PROCEDURE

Using the Frequencies Dialog Window

  1. Open the Frequencies window (Analyze > Descriptive Statistics > Frequencies) and double-click on variable Rank.
  2. To request the mode statistic, click Statistics. Check the box next to Mode, then click Continue.
  3. To turn on the bar chart option, click Charts. Select the radio button for Bar Charts. Then click Continue.
  4. When finished, click OK.

Using Syntax

FREQUENCIES VARIABLES=Rank  /STATISTICS=MODE  /BARCHART FREQ  /ORDER=ANALYSIS.

OUTPUT

Two tables appear in the output: Statistics, which reports the number of missing and nonmissing observations in the dataset, plus any requested statistics; and the frequency table for variable Rank. The table title for the frequency table is determined by the variable’s label (or the variable name, if a label is not assigned).

Here, the Statistics table shows that there are 406 valid and 29 missing values. It also shows the Mode statistic: here, the mode value is “1”, which is the numeric code for the category Freshman. Notice that the Mode statistic isn’t displaying the value labels, even though they have been assigned. (For this reason, we recommend not requesting the mode statistic; instead, determine the mode from the frequency table.)

Notice how the rows are grouped into “Valid” and “Missing” sections. This grouping allows for easy comparison of missing versus nonmissing observations. Note that “System” missing responses are observations that use SPSS’s default symbol  — a period (.) — for indicating missing values. If a user has assigned special codes for missing values in the Variable View window, those codes would appear here.

ANALYSIS OF CATEGORICAL VARIABLES

The basic statistics available for categorical variables are counts and percentages.

Count

Number of cases in each cell of the table or number of responses for multiple response sets. If weighting is in effect, this value is the weighted count.

  • If weighting is in effect, the value is the weighted count.
  • The weighted count is the same for both global dataset weighting (DataWeight Cases…).

Unweighted Count

Unweighted number of cases in each cell of the table. This only differs from count if weighting is in effect.

Adjusted Count

The adjusted count used in effective base weight calculations. If you do not use an effective base weight variable, the adjusted count is the same as the count.

Row Percent

Percentages within each row. The percentages in each row of a subtable (for simple percentages) sum to 100%. Row percentages are typically useful only if you have a categorical column variable.

Column Percent

Percentages within each column. The percentages in each column of a subtable (for simple percentages) sum to 100%. Column percentages are typically useful only if you have a categorical row variable.

Subtable Percent

Percentages in each cell are based on the subtable. All cell percentages in the subtable are based the same total number of cases and sum to 100% within the subtable. In nested tables, the variable that precedes the innermost nesting level defines subtables. For example, in a table of Marital status within Gender within Age categoryGender defines subtables.

Table Percent

Percentages for each cell are based on the entire table. All cell percentages are based on the same total number of cases and sum to 100% (for simple percentages) over the entire table.

Percentages are affected by the base (denominator) used to calculate them, and there are a number of options for determining the base.

Confidence intervals

  • Lower and upper confidence limits are available for counts, percentages, mean, median, percentiles, and sum.
  • The text string “&[Confidence Level]” in the label includes the confidence level in the column label in the table.
  • Standard error is available for counts, percentages, mean, and sum.
  • Confidence intervals and standard error are not available for multiple response sets.

Level (%)

The confidence level for confidence intervals, expressed as a percentage. The value must be greater than 0 and less than 100. The default value is 95.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

EPIDATA DATA

 

EpiData is widely used by organizations and individuals to create and analyze large amounts of data. The World Health Organization (WHO) uses EpiData in its STEPS method of collecting epidemiological, medical, and public health data, for biostatistics, and for other quantitative-based projects.

“EpiData” typically refers to EpiData software, which is a suite of free and open-source software tools for epidemiological data management and basic statistical analysis. EpiData is widely used in the field of epidemiology and public health for various purposes. Here are some common uses of EpiData:

Data Entry: EpiData provides a user-friendly interface for data entry, making it easy to input data collected during epidemiological studies. It supports various data types, including text, numbers, dates, and more.

Data Validation: EpiData allows researchers to set up data validation rules, ensuring that the entered data is accurate and consistent. It can check for out-of-range values, missing data, and other errors.

Data Cleaning: After data entry, researchers often need to clean and prepare the data for analysis. EpiData provides tools to detect and correct errors in the dataset.

Data Documentation: Proper documentation of the dataset is crucial for reproducibility and transparency. EpiData enables researchers to create data documentation files that describe the dataset’s structure, variables, and codes.

Data Export: Researchers can export the cleaned and validated data from EpiData to various formats, including Excel, CSV, and more. This makes it easier to perform further analyses using statistical software like R or SPSS.

Descriptive Statistics: While EpiData is not a full-fledged statistical analysis tool, it can generate basic descriptive statistics such as frequencies, means, and standard deviations to summarize the data.

Data Security: EpiData allows users to implement data security measures, such as password protection and user access controls, to safeguard sensitive epidemiological data.

Questionnaire Design: Researchers can design questionnaires for data collection within EpiData, ensuring that the data collected aligns with the study’s objectives.

 

Longitudinal Data Management: EpiData supports the management of longitudinal data, where data is collected from the same subjects at multiple time points. This is essential for tracking changes in health outcomes over time.

Data Quality Assurance: EpiData facilitates the implementation of quality assurance protocols to maintain data integrity throughout the study.

Epidemiological Studies: EpiData is particularly useful for managing data in various epidemiological studies, including cohort studies, case-control studies, and cross-sectional studies.

Public Health Surveillance: Public health agencies often use EpiData to manage and analyze data related to disease outbreaks and health trends in populations.

Teaching and Training: EpiData is also used in educational settings to teach students and train professionals in epidemiological data management and analysis.

Remember that while EpiData is valuable for data management and basic statistical analysis, more advanced statistical analyses may require the use of dedicated statistical software packages like R, SPSS, or Stata. EpiData can complement these tools by providing a robust data management platform.

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

RSS
Follow by Email
YouTube
Pinterest
LinkedIn
Share
Instagram
WhatsApp
FbMessenger
Tiktok