From the course: Complete Guide to Generative AI for Data Analysis and Data Science

Linear regression

- [Instructor] Linear regression is a method for predicting relationships between variables. Now, there's two types of variables that we're going to be talking about with regards to linear relationships, and they are independent and dependent variables. Now, independent variables are thought to influence a dependent variable. And a dependent variable is one that we presume is influenced by one or more independent variables. So, an example can really help here trying to distinguish what does this really mean? So let's think about the amount of fertilizer that we might put into a growing area. That will influence the amount of crops, or the number of crops, or the size of the crops that are growing. So in this case, the size of the crops or the number of crops, that's the dependent variable. The amount of the fertilizer, what we choose to put in there, is an independent variable. In other words, it doesn't depend on, say, the size of the crop that finally comes out. Instead, the size of the crop itself is highly influenced by how much fertilizer we put in. Another example is how much exercise and what our diet is. Those are independent variables that can influence things about our health, including things like our blood pressure. So in that case, our blood pressure would be a dependent variable, and exercise and diet would be a couple of independent variables. Now, there are a couple of types of linear regression. There's simple linear regression where we have one dependent continuous variable, so we're trying to predict one continuous value, and one independent continuous binary or categorical variable could be any of those. Now we also have multiple linear regression. Now here we're still trying to predict just one dependent, continuous variable like average size of the crops, or our average blood pressure over time. But there may be two or more independent variables. And those independent variables can be continuous, binary, or categorical. So we're going to focus primarily on simple linear regression because we just want to get the ideas across. But I do want to point out an example of a simple linear regression. Here's kind of what we're trying to do. So what we've done here is plot some data points. And when we talk about building a regression model, what we're doing is fitting a line, a linear line, so there's no curves, no bends in the line, but we want to fit a line that fits those data points as close as possible. And we measure, or determine the overall closeness by looking at the distance between where a data point actually is and where the corresponding point is on the line. So for example, if we look, we see there at the X value about six, we see that we have a prediction that's about 164 or so, but the actual value's slightly higher, it's maybe 166, 167 on the Y axis, so there's a difference. Say the predicted value is 164, the actual value is 167, our difference or the residual is three. So what we try and do is minimize the sum of those residuals when we're fitting this line, and that's what linear regression does for us. And this is useful because now we can start making predictions. So we see that our X axis goes out to 16. Well, we might want to predict, well, what's the value if the X value, the independent variable is 18 or 20 or 30. Well, we now have a formula in a sense, because we have a line, so we can make predictions by extending the line out and then calculating the Y value for any given X by using that line or the formula for that line. So there are some components in the linear regression equation, or the formula. There is the dependent variable, and that's the Y axis. That's the thing that we're trying to predict. There's the independent variable or variables. In this case, it's just one, and that's just on the X axis. There's the Y intercept. That's where this line that we fit, the regression line, crosses the Y axis. Now in this case, in our example, it looks pretty close to zero. It's going to cross somewhere near there, maybe a little above or a little bit below zero. But the Y intercept is pretty close to zero. And then for each of the independent variables, each of the X values, there is a coefficient or a number that we're trying to learn or trying to calculate. And that number is the thing that influences, or is the thing that we're going to tweak to minimize some of the residuals, yeah, some of the residual, the total distance of the errors. And then finally there is an error term, which just kind of accounts for random error. Now, for those of you who like equations, this is the general equation for linear regression. And in this it says that Y, our dependent variable, is equal to beta subzero, and that's our Y intercept, plus beta sub one, and that's the coefficient that goes along with X-one. So our independent variable, X-one, we learn a coefficient for that, or calculate a coefficient for that called B-one. Now if this is multiple linear regression, we might have an X-two, X-three, X-four, and so on up through X-K. Each of those would have their own coefficient associated with them, and that would be beta sub two, beta sub three, and so on up through beta sub K. So that's the general formula for a linear regression. Now, the coefficients that we were talking about, those beta numbers, some things to keep in mind about that, the change in the mean of the dependent variable, that's the Y, is associated with one unit increase in the independent variable. So one unit of change, increase in the independent variable in X, is the change in the mean of the dependent variable that we would expect to see. Now, positive coefficients indicate a positive relationship. So as X goes up, Y goes up, as X go down, Y goes down, and negative coefficients indicate a negative relationship. So as X goes up, Y goes down, and so forth. So, those are some things to keep in mind just about general properties of the coefficients. And again, what we're trying to do when we build a model is we're figuring out what those coefficients are. Now, some other assumptions we want to keep in mind, we're assuming that the data has a linear relationship between the independent and dependent variable. That means like there's no curves, there's no seasonality, there's no like peaks and troughs if you were to graph it out. The observations are independent. That's each data point, they're independent. So for example, if we're building a linear regression model around the selling price of houses, if we assume that the selling events are independent, so if I sell my house and you sell your house, they're independent, they're not tied so we consider those independent. Also, we assume that for any value of X, Y is normally distributed. And then finally, the final assumption is that variance in the residuals is constant across all the independent variables. So we're not seeing wide swings in variants if we have multiple variables, but that really just applies when we have multiple independent variables when we're doing multiple linear regression. And the term for that is homoscedasticity. You hear that sometimes when you're talking about data, both in statistics and in machine learning. So if you come across that term, that's basically what we're talking about. We want to make sure the variance across the different independent variables, or in the case of machine learning, the features is roughly the same.

Contents