In this ar we discuss correlation analysis which is a technique used come quantify the associations in between two continuous variables. For example, we could want come quantify the association in between body massive index and also systolic blood pressure, or between hours of practice per week and also percent human body fat. Regression analysis is a related method to evaluate the connection between an outcome variable and one or more risk components or confounding variables (confounding is disputed later). The outcome variable is additionally called the response or dependent variable, and the risk factors and also confounders are called the predictors, or explanatory or independent variables. In regression analysis, the dependent variable is denoted "Y" and also the elevation variables space denoted by "X".

You are watching: What is the variable used to predict another variable called?

< NOTE: The term "predictor" have the right to be misleading if that is understood as the ability to guess even beyond the borders of the data. Also, the hatchet "explanatory variable" could give one impression that a causal impact in a case in i beg your pardon inferences have to be restricted to identifying associations. The state "independent" and "dependent" change are less subject to these interpretations as they carry out not strongly indicate cause and also effect.

Learning Objectives

After perfect this module, the student will certainly be able to:

Define and provide examples of dependent and independent variables in a study of a public health problemCompute and interpret a correlation coefficientCompute and interpret coefficients in a direct regression analysis

*

Correlation Analysis

In correlation analysis, we estimate a sample correlation coefficient, much more specifically the Pearson Product minute correlation coefficient. The sample correlation coefficient, denoted r,

ranges in between -1 and also +1 and quantifies the direction and also strength of the straight association between the two variables. The correlation in between two variables can be positive (i.e., higher levels that one variable are connected with higher levels of the other) or an adverse (i.e., higher levels the one variable are associated with lower levels that the other).

The sign of the correlation coefficient shows the direction the the association. The size of the correlation coefficient suggests the strength of the association.

For example, a correlation that r = 0.9 argues a strong, confident association between two variables, vice versa, a correlation that r = -0.2 imply a weak, an unfavorable association. A correlation close come zero says no linear association in between two continuous variables.

It is necessary to note that there might be a non-linear association in between two constant variables, however computation the a correlation coefficient does not detect this. Therefore, it is constantly important to evaluate the data carefully before computing a correlation coefficient. Graphical screens are an especially useful to discover associations between variables.

The figure listed below shows four hypothetical scenarios in i m sorry one consistent variable is plotted follow me the X-axis and the various other along the Y-axis.

*

Scenario 1 depicts a strong positive association (r=0.9), similar to what we can see for the correlation between infant birth weight and also birth length.Scenario 2 depicts a weaker combination (r=0,2) that we might expect come see between age and body mass index (which has tendency to rise with age).Scenario 3 could depict the absence of association (r roughly = 0) in between the level of media exposure in adolescence and age in ~ which teenagers initiate sexual activity.Scenario 4 might depict the strong an adverse association (r= -0.9) usually observed in between the number of hours that aerobic practice per week and also percent human body fat.

*

Example - Correlation of Gestational Age and also Birth Weight

A tiny study is performed involving 17 babies to inspection the association in between gestational age at birth, measure in weeks, and birth weight, measure up in grams.

Infant identifier #

Gestational age (weeks)

Birth load (grams)

1

34.7

1895

2

36.0

2030

3

29.3

1440

4

40.1

2835

5

35.7

3090

6

42.4

3827

7

40.3

3260

8

37.3

2690

9

40.9

3285

10

38.3

2920

11

38.5

3430

12

41.4

3657

13

39.7

3685

14

39.7

3345

15

41.1

3260

16

38.0

2680

17

38.7

2005

We great to estimate the association in between gestational age and also infant birth weight. In this example, birth load is the dependency variable and gestational period is the live independence variable. Hence y=birth weight and also x=gestational age. The data are presented in a scatter diagram in the number below.

*

Each suggest represents one (x,y) pair (in this case the gestational age, measured in weeks, and the birth weight, measure in grams). Keep in mind that the independent variable, gestational age) is on the horizontal axis (or X-axis), and the dependent variable (birth weight) is on the vertical axis (or Y-axis). The scatter plot mirrors a confident or straight association in between gestational age and birth weight. Babies with much shorter gestational eras are more likely to it is in born with lower weights and also infants with longer gestational ages are more likely to it is in born with greater weights.

Computing the Correlation Coefficient

The formula for the sample correlation coefficient is:

*

where Cov(x,y) is the covariance the x and also y characterized as

*
and also
*
are the sample variances the x and also y, characterized as follows:

*
and
*

The variances that x and y measure up the variability the the x scores and also y scores about their particular sample method of X and also Y thought about separately. The covariance measures the variability of the (x,y) pairs around the average of x and mean that y, taken into consideration simultaneously.

*

To compute the sample correlation coefficient, we have to compute the variance that gestational age, the variance of bear weight, and likewise the covariance the gestational age and also birth weight.

We very first summarize the gestational period data. The average gestational age is:

*

To compute the variance that gestational age, we have to sum the squared deviations (or differences) in between each it was observed gestational age and the average gestational age. The computations are summarized below.

Infant i would #

Gestational period (weeks)

*

*

1

34.7

-3.7

13.69

2

36.0

-2.4

5.76

3

29.3

-9.1

82,81

4

40.1

1.7

2.89

5

35.7

-2.7

7.29

6

42.4

4.0

16.0

7

40.3

1.9

3.61

8

37.3

-1.1

1.21

9

40.9

2.5

6.25

10

38.3

-0.1

0.01

11

38.5

0.1

0.01

12

41.4

3.0

9.0

13

39.7

1.3

1.69

14

39.7

1.3

1.69

15

41.1

2.7

7.29

16

38.0

-0.4

0.16

17

38.7

0.3

0.09

*

*

*

The variance that gestational period is:

*

Next, us summarize the birth weight data. The median birth load is:

*

The variance the birth load is computed simply as us did because that gestational age as displayed in the table below.

Infant ID#

Birth Weight

*

*

1

1895

-1007

1,014,049

2

2030

-872

760,384

3

1440

-1462

2,137,444

4

2835

-67

4,489

5

3090

188

35,344

6

3827

925

855,625

7

3260

358

128,164

8

2690

-212

44,944

9

3285

383

146,689

10

2920

18

324

11

3430

528

278,764

12

3657

755

570,025

13

3685

783

613,089

14

3345

443

196,249

15

3260

358

128,164

16

2680

-222

49,284

17

2005

-897

804,609

*

*

*

The variance of birth load is:

*

Next us compute the covariance:

To compute the covariance that gestational age and also birth weight, we need to multiply the deviation from the median gestational age by the deviation native the mean birth weight for every participant, that is:

*

The computations room summarized below. Notification that we merely copy the deviations from the median gestational age and also birth weight from the two tables over into the table below and multiply.

Infant ID#

*

*

*

1

-3.7

-1007

3725.9

2

-2.4

-872

2092.8

3

-9,1

-1462

13,304.2

4

1.7

-67

-113.9

5

-2.7

188

-507.6

6

4.0

925

3700.0

7

1.9

358

680.2

8

-1.1

-212

233.2

9

2.5

383

957.5

10

-0.1

18

-1.8

11

0.1

528

52.8

12

3.0

755

2265.0

13

1.3

783

1017.9

14

1.3

443

575.9

15

2.7

358

966.6

16

-0.4

-222

88.8

17

0.3

-897

-269.1

Total = 28,768.4

The covariance that gestational age and birth load is:

*

Finally, we deserve to ow compute the sample correlation coefficient:

*

Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.

As we noted, sample correlation coefficients range from -1 to +1. In practice, coherent correlations (i.e., correlations that space clinically or almost important) can be as small as 0.4 (or -0.4) for optimistic (or negative) associations. There are also statistical tests to determine whether an observed correlation is statistically far-reaching or no (i.e., statistically significantly different indigenous zero). Steps to test whether an observed sample correlation is suggestive that a statistically far-ranging correlation are described in detail in Kleinbaum, Kupper and Muller.1

Regression Analysis

Regression evaluation is a extensively used technique which is valuable for countless applications. We introduce the an approach here and also expand ~ above its supplies in subsequent modules.

Simple straight Regression

Simple straight regression is a technique that is appropriate to know the association in between one elevation (or predictor) variable and also one constant dependent (or outcome) variable. For example, mean we want to assess the association between total cholesterol (in milligrams per deciliter, mg/dL) and body mass index (BMI, measured as the ratio of weight in kilograms to elevation in meters2) where total cholesterol is the dependence variable, and also BMI is the independent variable. In regression analysis, the dependent change is denoted Y and also the independent variable is denoted X. So, in this case, Y=total cholesterol and also X=BMI.

When over there is a single consistent dependent variable and a single independent variable, the evaluation is referred to as a an easy linear regression analysis . This evaluation assumes that there is a linear association in between the 2 variables. (If a different relationship is hypothesized, such as a curvilinear or exponential relationship, different regression analyses space performed.)

The figure below is a scatter diagram portraying the relationship in between BMI and total cholesterol. Each suggest represents the it was observed (x, y) pair, in this case, BMI and also the corresponding full cholesterol measure in every participant. Note that the independent change (BMI) is on the horizontal axis and also the dependent change (Total Serum Cholesterol) top top the vertical axis.

BMI and Total Cholesterol

*

The graph reflects that over there is a confident or direct association between BMI and total cholesterol; attendees with reduced BMI are more likely to have lower complete cholesterol levels and participants with higher BMI are more likely come have greater total cholesterol levels. In contrast, mean we research the association between BMI and HDL cholesterol.

In contrast, the graph listed below depicts the relationship in between BMI and also HDL cholesterol in the same sample of n=20 participants.

BMI and HDL Cholesterol

*

This graph shows a an adverse or train station association in between BMI and also HDL cholesterol, i.e., those with lower BMI are an ext likely to have higher HDL cholesterol levels and also those with greater BMI are more likely to have lower HDL cholesterol levels.

For one of two people of these relationships we could use basic linear regression analysis to calculation the equation the the heat that ideal describes the association in between the elevation variable and also the dependence variable. The simple linear regression equation is as follows:

*

where Y is the predicted or expected value the the outcome, X is the predictor, b0 is the approximated Y-intercept, and also b1 is the approximated slope. The Y-intercept and slope are approximated from the sample data, and also they are the values that minimize the amount of the squared differences in between the observed and the predicted worths of the outcome, i.e., the approximates minimize:

*

These differences between observed and also predicted values of the outcome are referred to as residuals. The approximates of the Y-intercept and also slope minimization the sum of the squared residuals, and are called the least squares estimates.1

Residuals

Conceptually, if the values of X listed a perfect prediction of Y then the sum of the squared differences in between observed and predicted values of Y would be 0. That would average that variability in Y could be fully explained by distinctions in X. However, if the differences in between observed and also predicted values are not 0, climate we room unable to entirely account for differences in Y based on X, then there room residual errors in the prediction. The residual error could an outcome from inaccurate measurements of X or Y, or there could be various other variables besides X that impact the value of Y.

Based ~ above the observed data, the best estimate of a linear relationship will certainly be acquired from one equation for the line that minimizes the differences between observed and also predicted values of the outcome. The Y-intercept the this heat is the value of the dependent variable (Y) when the independent change (X) is zero. The slope the the line is the readjust in the dependent variable (Y) relative to a one unit adjust in the independent change (X). The the very least squares approximates of the y-intercept and slope are computed together follows:

*

and

*

where

r is the sample correlation coefficient,the sample way are
*
and
*
and Sx and also Sy space the traditional deviations the the independent change x and also the dependent change y, respectively.

BMI and Total Cholesterol

The least squares approximates of the regression coefficients, b 0 and also b1, relenten the relationship between BMI and total cholesterol are b0 = 28.07 and also b1=6.49. These space computed as follows:

*

and

*

The calculation of the Y-intercept (b0 = 28.07) represents the estimated full cholesterol level once BMI is zero. Because a BMI the zero is meaningless, the Y-intercept is not informative. The calculation of the slope (b1 = 6.49) represents the adjust in full cholesterol relative to a one unit adjust in BMI. Because that example, if we compare 2 participants who BMIs different by 1 unit, us would suppose their total cholesterols to differ by around 6.49 systems (with the person with the higher BMI having the greater total cholesterol).

The equation of the regression line is as follows:

*

The graph listed below shows the estimated regression line superimposed on the scatter diagram.

*

The regression equation can be used to estimate a participant"s complete cholesterol together a function of his/her BMI. Because that example, mean a participant has a BMI that 25. We would estimate their full cholesterol to be 28.07 + 6.49(25) = 190.32. The equation can also be supplied to estimate complete cholesterol for various other values that BMI. However, the equation have to only be supplied to estimate cholesterol levels because that persons whose BMIs are in the selection of the data used to generate the regression equation. In our sample, BMI varieties from 20 to 32, for this reason the equation should only be offered to generate approximates of total cholesterol because that persons with BMI in the range.

There are statistical exam that deserve to be carry out to evaluate whether the approximated regression coefficients (b0 and b1) are statistically substantially different indigenous zero. The test of most interest is generally H0: b1=0 versus H1: b1≠0, wherein b1 is the population slope. If the population slope is substantially different native zero, we conclude that there is a statistically far-reaching association in between the independent and dependent variables.

BMI and HDL Cholesterol

The the very least squares estimates of the regression coefficients, b0 and also b1, describing the relationship between BMI and HDL cholesterol are as follows: b0 = 111.77 and b1 = -2.35. These space computed as follows:

*

and

*

Again, the Y-intercept in uninformative because a BMI that zero is meaningless. The calculation of the slope (b1 = -2.35) to represent the readjust in HDL cholesterol family member to a one unit readjust in BMI. If us compare two participants whose BMIs different by 1 unit, us would intend their HDL cholesterols to different by about 2.35 systems (with the human being with the greater BMI having the reduced HDL cholesterol. The figure listed below shows the regression line superimposed top top the scatter diagram because that BMI and HDL cholesterol.

*

Linear regression analysis rests top top the assumption that the dependent variable is consistent and that the distribution of the dependent variable (Y) in ~ each value of the independent change (X) is approximately normally distributed. Note, however, that the elevation variable have the right to be constant (e.g., BMI) or deserve to be dichotomous (see below).

Comparing average HDL Levels through Regression Analysis

Consider a clinical trial to advice the efficacy that a brand-new drug to boost HDL cholesterol. We could compare the median HDL levels between treatment groups statistically using a two independent samples t test. Below we consider an alternative approach. Review data because that the attempt are presented below:

Sample Size

Mean HDL

Standard Deviation that HDL

New Drug

Placebo

50

40.16

4.46

50

39.21

3.91

HDL cholesterol is the constant dependent variable and treatment assignment (new drug versus placebo) is the independent variable. Suppose the data top top n=100 attendees are gone into into a statistical computing package. The outcome (Y) is HDL cholesterol in mg/dL and the independent variable (X) is treatment assignment. For this analysis, X is coded together 1 because that participants who got the new drug and also as 0 because that participants who obtained the placebo. A basic linear regression equation is estimated as follows:

*

where Y is the estimated HDL level and X is a dichotomous change (also called an indicator variable, in this case indicating even if it is the participant to be assigned to the brand-new drug or to placebo). The calculation of the Y-intercept is b0=39.21. The Y-intercept is the value of Y (HDL cholesterol) when X is zero. In this example, X=0 shows assignment to the placebo group. Thus, the Y-intercept is specifically equal come the median HDL level in the placebo group. The steep is estimated as b1=0.95. The slope represents the estimated readjust in Y (HDL cholesterol) loved one to a one unit change in X. A one unit readjust in X represents a distinction in therapy assignment (placebo versus new drug). The slope to represent the distinction in average HDL levels in between the treatment groups. Thus, the median HDL because that participants receiving the brand-new drug is:

*

*
-----
*

A research was conducted to evaluate the association between a person"s intelligence and the dimension of their brain. Participants perfect a standardized IQ test and researchers provided Magnetic Resonance Imaging (MRI) come determine mind size. Demography information, including the patient"s gender, was also recorded.

*

The dispute Over eco-friendly Tobacco acting Exposure

There is convincing proof that active smoking is a cause the lung cancer and also heart disease. Countless studies excellent in a wide variety of circumstances have consistently prove a solid association and also indicate the the hazard of lung cancer and also cardiovascular an illness (i.e.., heart attacks) increases in a dose-related way. These studies have actually led to the conclusion that energetic smoking is causally pertained to lung cancer and also cardiovascular disease. Studies in energetic smokers have had actually the benefit that the life time exposure to tobacco smoke can be quantified with reasonable accuracy, due to the fact that the unit sheep is constant (one cigarette) and the habitual nature the tobacco smoking makes it possible for most smokers to carry out a reasonable calculation of their full lifetime exposure quantified in terms of cigarettes every day or packs every day. Frequently, average day-to-day exposure (cigarettes or packs) is an unified with duration of usage in year in order to quantify exposure as "pack-years".

It has been lot more an overwhelming to develop whether eco-friendly tobacco acting (ETS) exposure is causally pertained to chronic conditions like heart condition and lung cancer, due to the fact that the full lifetime exposure dosage is lower, and also it is lot more challenging to that s right estimate complete lifetime exposure. In addition, quantifying these threats is also complex because of confounding factors. Because that example, ETS exposure is generally classified based on parental or spousal smoking, however these studies are unable come quantify other environmental exposures to tobacco smoke, and also inability come quantify and adjust for other ecological exposures such together air contamination makes it complicated to demonstrate an association even if one existed. Together a result, there continues to be debate over the risk imposed by eco-friendly tobacco smoke (ETS). Some have actually gone so far as to claim the even very brief exposure to ETS can cause a myocardial infarction (heart attack), however a very huge prospective cohort study by Enstrom and Kabat to be unable come demonstrate significant associations in between exposure come spousal ETS and coronary love disease, chronic obstructive pulmonary disease, or lung cancer. (It must be noted, however, that the report by Enstrom and Kabat has been widely criticized because that methodological problems, and also these authors likewise had gaue won ties come the tobacco industry.)

Correlation evaluation provides a useful tool because that thinking around this controversy. Think about data from the British physicians Cohort. They report the annual mortality for a selection of disease at four levels of cigarette smoking cigarettes per day: never ever smoked, 1-14/day, 15-24/day, and also 25+/day. In stimulate to carry out a correlation analysis, i rounded the exposure levels to 0, 10, 20, and 30 respectively.

Cigarettes Smoked

Per Day

CVD Mortality

Per 100,000 males Per Year

Lung Cancer Mortality

Per 100,000 guys Per Year

0

10 (actually 1-14)

20 (actually 15-24)

30 (actually >24)

572

14

802

105

892

208

1025

355

The figures listed below show the two approximated regression present superimposed ~ above the scatter diagram. The correlation through amount of cigarette smoking was solid for both CVD mortality (r= 0.98) and for lung cancer (r = 0.99). Note additionally that the Y-intercept is a meaningful number here; it to represent the predicted annual death price from these disease in individuals who never smoked. The Y-intercept for prediction of CVD is slightly higher than the observed rate in never smokers, when the Y-intercept for lung cancer is lower than the observed rate in never smokers.

The linearity of these relationships suggests that there is one incremental threat with each additional cigarette smoked every day, and the extr risk is approximated by the slopes. This perhaps helps us think about the after-effects of ETS exposure. Because that example, the risk of lung cancer in never smokers is quite low, yet there is a limited risk; miscellaneous reports suggest a threat of 10-15 lung cancers/100,000 per year. If an individual who never smoked actively was exposed to the tantamount of one cigarette"s exhilaration in the kind of ETS, climate the regression says that their threat would boost by 11.26 lung cancer deaths every 100,000 every year. However, the danger is clearly dose-related. Therefore, if a non-smoker to be employed through a tavern with heavy levels that ETS, the risk could be significantly greater.

*

*

Finally, it must be detailed that part findings indicate that the association between smoking and heart disease is non-linear in ~ the an extremely lowest exposure levels, definition that non-smokers have actually a disproportionate rise in risk as soon as exposed come ETS due to boost in platelet aggregation.

Summary

Correlation and linear regression evaluation are statistical techniques to quantify associations in between an independent, sometimes dubbed a predictor, variable (X) and a continuous dependent result variable (Y). For correlation analysis, the independent variable (X) can be continuous (e.g., gestational age) or ordinal (e.g., enhancing categories of cigarettes every day). Regression evaluation can additionally accommodate dichotomous elevation variables.

See more: Difference Between Wrought Iron And Cast Iron, Wrought Iron Vs Cast Iron

The procedures described here assume that the association between the independent and dependent variables is linear. With some adjustments, regression analysis can likewise be supplied to estimate associations the follow one more functional type (e.g., curvilinear, quadratic). Right here we take into consideration associations between one elevation variable and also one continuous dependent variable. The regression evaluation is called an easy linear regression - straightforward in this instance refers come the fact that there is a single elevation variable. In the next module, we consider regression analysis with several independent variables, or predictors, considered simultaneously.