Where business comes to life
Business in Practice: Data Analytics
Week 8
Dr. Markos Kyritsis
OBJECTIVES – LEARNING OUTCOMES
• By the end of this week, you should be able to:
• Critically discuss the difference between correlation and causation
• Use a correlation test, and argue when it is suitable to use each one
• Pearson’s r or Spearman’s Rho
• Report the results of a correlation test
CORRELATIONS
WHAT IS A CORRELATION?
• A symmetrical relationship between two numeric variables
POSITIVE CORRELATION. AS X GOES UP, Y
GOES UP
NEGATIVE CORRELATION. AS X GOES UP, Y
GOES DOWN
STRONG VS WEAK CORRELATION
CORRELATION AND CAUSALITY
• Often confused. A symmetrical relationship does not mean a causal effect. Mediators may be the actual
cause for a rise in the bivariate relationship.
Declining
Poverty
Quality of
GDP
Life
DIRECTIONAL RELATIONSHIPS
• Clearly there is a direction here (which of the two has a causative effect?):
EU
Increase in %
Refugee/Migrant
votes for far right
crisis
UNCLEAR RELATIONSHIPS
• The direction of this relationship may not be so clear:
• Are you more likely to become frustrated with an increase in the number of errors you make while using a
system (e.g., a CRM)?
• Does frustration increase the number of errors you make while using a system?
Number of
Frustration
errors
NOT ALL CORRELATIONS ARE MEANINGFUL
• There are pages dedicated to finding
correlations between seemingly unrelated
variables.
Source url:
https://www.tylervigen.com/spurious-correlations
CORRELATION
• By standardising the covariance we can get an idea of the true effect size of the
relationship. This would make the measurement independent of variable scale and
would allow us to compare relationships of variables on any scale.
• The correlation coefficient (denoted as r) ranges from -1 to 1, with 0 being no
relationship and 1 being the strongest possible positive relationship.
• The formula for standardising the covariance is:
• Where s is the standard deviation of x and y.
• This coefficient is called the Pearson Correlation Coefficient.
PARAMETRIC ASSUMPTION FOR
HYPOTHESIS TEST (PEARSON’S R)
• For the hypothesis test part, we use a z or t distribution. Therefore,
we assume bivariate normality (so test both variables using
Shapiro-Wilk, or look at plots for large n)
• If the assumption violated, switch to a non-parametric test
(Spearman’s Rho is the most popular).
SALARY AND YEARS OF SERVICE
• In the salaries dataset, let’s check if there is a correlation between salary and years of service
BIVARIATE NORMALITY
PEARSON’S R OR SPEARMAN’S RHO?
• The bivariate normality assumption is violated (it’s actually not terrible, but let’s play it safe)
• So let’s use Spearman’s
STEP 1
STEP 2
Hold ‘ctrl’ key to select both variables
STEP 3
REPORTING THE RESULTS
• There was a medium correlation between salary and years of service [rs = 0.43, p < 0.05] *
EFFECT SIZE ESTIMATES
Coefficient (absolute values, i.e., positive or Effect Size
negative)
< |0.3| Small
|0.3| <= r < |0.5| Medium
>=|0.5| Large
It is possible to have a negligible but significant correlation, especially as sample size increases
SUMMARY
• Correlation is the symmetrical relationship between two variables
• Correlation is not necessarily causation
• Pearson’s r is the parametric test
• Spearman’s rho is non-parametric
• Report r or rs along with the p-value