From the course: Complete Guide to Generative AI for Data Analysis and Data Science

Correlation analysis

- [Instructor] Correlation analysis is a set of techniques for analyzing how two numeric variables may be related. And with correlation analysis, we can measure the strength and the direction of relationships between quantitative variables. So this is a little bit different from working with categorical variables using Chi squared tests where we can determine whether or not they're related. In the case of correlation analysis and working with quantitative measures, we can measure the strength and the direction of the relationship. Now we can determine how changes in one variable are associated with the other variable when we're doing correlation analysis. So for example, as salary increases, we may find that disposable income also increases, assuming expenses don't increase of course at the same time, or that is as inflation increases in the macroeconomic environment, we might find that retailers are witnessing a shift or a downward trend in the number of products purchased. So if someone, for example, if their purchasing power is decreasing as inflation increases, that might correlate with the fact that they're able to purchase fewer items. So that's the kind of question that correlation analysis can help us answer. Now in the various types of correlation analysis, there's generally a correlation coefficient, and we typically denote that by the letter R. And the value of R ranges from negative one to positive one. And if R the correlation coefficient is equal to zero, then that indicates that there's just no relationship, no linear relationship between the variables. And I want to emphasize that it's no linear relationship. There may be a non-linear relationship, but these kind of techniques are not going to give us into those. Now, when R is greater than zero, we refer to that as a positive relationship, and that's one where as one variable increases, so does the other variable. So for example, as salary increases, disposable income increases, that's a positive relationship. Now, when the correlation coefficient or R is less than zero, that indicates a negative relationship, and that is as one variable increases, the other one decreases. So an example here is the variable about rate of inflation that might be going up and our ability to purchase product or the number of products purchased during a period of time in a particular store might go down. So that's a negative relationship. We would expect the R value or the correlation coefficient to be less than zero in that case. Now, as the absolute value of R gets closer to one, it indicates an increasingly strong relationship. So the absolute value of one is just the positive value of R, our correlation coefficient. So the absolute value of negative one is one, the absolute value of positive one is one. So the closer we get to either one or negative one, that indicates an increasingly strong relationship. So a R value of negative one indicates a strong and negative relationship, and R value close to positive one indicates a positive and strong relationship. Now, there are a few different statistics we can use for correlation analysis. One is called Pearson's correlation coefficient, and that measures the linear correlation between two variables, two continuous variables that are normally distributed. And that's really important. We can only use Pearson's correlation when our two variables are normally distributed. So we want to make sure that we check before we use Pearson's correlation whether or not our two variables are reasonably normally distributed. Now, it's okay if they're not because we have other techniques we can use. For example, the Spearman's rank correlation is a measure of the relationship, the monotonic relationship between two non normally distributed variables. So monotonic means it's going to be increasing or decreasing, it's not going to be going up and down. So Spearman's rank correlation is really useful when we're dealing with non normally distributed variables. So if we check our variables and we see that they're not normally distributed, we can not use Pearsons, but we can switch to Spearmans. There is a third type of correlation statistic we can use called Kendall tau or the Kendall tau correlation coefficient. And that's used with small data sets with variables that are not normally distributed. So just keep that in mind. If your data set is small, like we tend to work with small data sets, Kendall tau can be a good one. But in general, typically, you know, Spearman's rank correlation is a good one to use when we're dealing with non normally distributed variables. Now, I just want to point out some limitations that we want to keep in mind when we're doing correlation analysis. These techniques assume a linear relationship. So if it's not a linear relationship, if it's some other kind of non-linear relationship, these techniques are not going to be helpful. We need to watch out for outliers. This is the case where maybe there was an error in a measurement, or you know, there might have been a faulty sensor, we didn't get the correct data, there might have been a data loading error. You know, if there's some outliers that really don't belong and we're doing correlation analysis, we can really be thrown off in our results by that. So this is why oftentimes when we're working in data analysis or data science or machine learning, one of the first things we do is kind of do a data quality assessment. We check for outliers, and then if there are outliers that really don't belong, because it's like a measurement error, we want to eliminate those from our data set. Now, it is important not to eliminate outliers because they are outliers by virtue of not fitting with our hypothesis. That is something we definitely need to avoid. We don't want to doctor our data set just because it's not fitting with what we're expecting. So there are different kinds of outliers. So we need to be careful about which kind we eliminate. We want to eliminate the kinds that are associated with things like measurement errors. But if we do have outliers, these statistical tests may not work very well, they may be adversely influenced by those outliers. Another thing to keep in mind is that correlation does not imply causation, and that's important. Maybe it may feel like, oh, the rising rate of inflation, that is the cause of people purchasing fewer items. That may be the case. Intuitively, it sounds plausible, but the results of the correlation analysis statistics don't tell us that, that's not something it is safe to assume. There has been a lot of work in statistics in areas of like AI related to trying to understand or develop techniques for detecting causation. Those are not these correlation analysis statistics that we are talking about. So it's important to remember that correlation doesn't mean that one variable or a change in one variable necessarily causes the change in the other. We need to do more investigation beyond this kind of correlation analysis to come to that kind of conclusion.

Contents