📉 Why Do Linear Models Fail to Fit a Linear Trend with a Change Point? When fitting a linear regression to a time series with a linear trend, the results are often surprisingly good. But what happens when the underlying trend is still linear—yet includes a change point? In practice, a single linear regression will usually introduce bias. If you use that model for forecasting, the error becomes systematic and typically grows with the forecast horizon (assuming the most recent trend continues). 🤔 Why does this happen? The mathematical intuition - A linear trend with one or more change points effectively turns the series into a shape that closely resembles a convex function. From basic properties of convex functions, a straight line can intersect a convex curve at most twice. After the second intersection, the distance between the line and the curve grows as time moves forward—exactly what we observe as increasing forecast error. 🧠 A more intuitive explanation (via linear regression properties) Simple linear regression has a few key constraints: ➡️ It is a weighted average of all observations ➡️ It must pass through the point defined by the mean of x and the mean of y When a change point is present, these constraints force the model to “compromise” between different regimes. As a result, the fitted line struggles to represent either trend accurately—especially near the boundaries—leading to biased fits and poor extrapolation. The plot below illustrates this effect clearly 👇 🛠️ A practical fix A simple and effective solution is piecewise regression with knot features, which allows the model to estimate separate linear trends before and after the change point—without abandoning interpretability. Want to learn more about time series analysis and forecasting? 📬 Subscribe to 𝐓𝐡𝐞 𝐅𝐨𝐫𝐞𝐜𝐚𝐬𝐭𝐞𝐫 newsletter: https://lnkd.in/gwsWSD_d Follow Rami Krispin for more content on AI, MLOps, and forecasting. #datascience #forecasting #stats
Regression Analysis and Modeling
Explore top LinkedIn content from expert professionals.
Summary
Regression analysis and modeling help researchers understand and predict relationships between variables by fitting mathematical models to data. Whether you’re working with continuous, binary, or complex interconnected data, these methods guide decision-making and uncover patterns that would otherwise be hard to see.
- Map relationships: Start by clarifying your research question and use domain knowledge to identify which variables are connected and worth investigating with regression models.
- Choose wisely: Select the type of regression or modeling approach based on the nature of your data, whether it’s continuous, categorical, or involves complex indirect effects.
- Validate and interpret: Always check that your model fits the data well and focus on making results understandable and useful for your specific context or audience.
-
-
I’ve seen these formulas so many times, but I never noticed this until now... Why is there an explicit error term in multiple linear regression (OLS) but not in binary logistic regression? 🤔 This difference is more than just a minor detail; it reflects fundamental distinctions in how these models operate and what they’re designed to predict. In multiple linear regression, we predict a continuous outcome and include an explicit error term to capture the difference between the observed and predicted values. This error term, or residual, helps measure how well the model fits the data, aiming to minimize these errors for better accuracy. But in binary logistic regression, the model predicts probabilities of binary outcomes, such as yes/no or true/false. Instead of using an explicit error term, logistic regression employs the logistic function to map predictions to probabilities between 0 and 1. The "error" here is implicit, captured through the likelihood of the observed outcomes given the model parameters, and is optimized using maximum likelihood estimation. ✔️ Key takeaway: Linear regression uses an error term to minimize errors in predicting continuous outcomes, while logistic regression optimizes for probabilities in binary outcomes without an explicit error term. It's fascinating how these subtle differences in formulas can have a big impact on how we model and understand data. Additional Notes: I've received some interesting insights from comments and discussions on this topic that may offer further clarity. Here’s a brief summary: 🔹GLM Framework: Linear and logistic regression both fit within the Generalized Linear Model (GLM) family. Linear regression models continuous outcomes directly, while logistic regression models the probability of binary outcomes using a link function (logit) to linearize probabilities. 🔹Zero Mean Error Assumption: In linear regression, the error term accounts for deviations, with the assumption of a zero mean, allowing accurate estimation of the expected outcome. In logistic regression, no explicit error term is needed, as the model directly estimates probabilities through the link function. 🔹Residuals in Log-Odds: While logistic regression doesn’t have traditional residuals as in linear regression, model fit can still be assessed by examining differences between predicted and observed values in log-odds. 🔹Single vs. Two-Parameter Models: Logistic regression is a single-parameter model focused on the predictor, while linear regression typically includes a variance parameter, making it more flexible for data with additional variability. More: https://lnkd.in/d-UAgcYf #analysisskills #database #datastruc…
-
Beyond Regression: Why Structural Equation Modeling (SEM) is a Game Changer for UX Research In user experience research, the tools we choose can shape the depth and clarity of our findings. Regression analysis, a reliable workhorse for many researchers, often provides an excellent starting point for identifying relationships between variables. But when it comes to uncovering the nuanced and interconnected dynamics of user behavior, regression may not always be enough. This realization hit home during a project on optimizing visual design for memory recall using the Rule of Thirds, where structural equation modeling (SEM) proved invaluable. Initially, regression helped us establish a direct relationship between the visual alignment of elements and memory performance. It was quick and clear, showing a correlation that seemed actionable. However, the more we probed, the more evident it became that we were missing the full picture. Turning to SEM, we were able to model not just the direct effects but also the indirect relationships, like how visual attention mediated the link between alignment and memory. SEM also allowed us to explore latent variables, such as user focus, which regression couldn’t adequately address. The insights were richer, more actionable, and far better aligned with the complexity of real-world user interactions. So, what’s the real difference between regression and SEM? Regression shines when the relationships between variables are straightforward and linear. It’s efficient and excellent for testing direct effects. But UX research often deals with interconnected systems where user satisfaction, cognitive load, and task completion influence each other in intricate ways. SEM steps in here as a more advanced method that models these complexities. It allows you to include latent variables, account for indirect effects, and visualize the interplay between multiple factors, all within a single framework. One of the most valuable aspects of SEM is its ability to uncover relationships you might not even think to test using regression. For example, while regression can tell you that a particular design change improves task completion rates, SEM can show how that improvement is mediated by reduced cognitive load or increased user trust. This kind of insight is critical for designing experiences that go beyond surface-level success metrics and truly resonate with users. To be clear, this isn’t an argument to abandon regression altogether. Each method has its place in a researcher’s toolkit. Regression is great for quick analyses and when the problem is relatively simple. But when your research involves complex systems or layered relationships, SEM provides the depth and clarity needed to make sense of it all. Yes, it’s more resource-intensive and requires a steeper learning curve, but the payoff in terms of actionable insights makes it worth the effort.
-
📊 How NOT to Select Variables in a Regression Model Let’s start with a truth that’s often overlooked: 🔍 Do you even need a regression model in the first place? Many researchers think they must include a regression model to publish. But that’s far from reality. Take MMWR—one of the most influential journals in public health (Impact Factor: 70.2 for the Recommendations and Reports series, ranked #1 in its category). MMWR is almost entirely based on descriptive analysis, and regression models are rare. I’ve published maybe a dozen articles in MMWR—only one or two included a regression model. So yes, you can produce impactful, policy-shaping work without regression. 🎯 But let’s say you must use a regression model—how do you select variables? Here are four wrong approaches (sadly, very common): 🚫1. Backward Elimination You start with a huge model, then drop variables one by one until only the significant ones remain. 📉 Why it’s wrong: It’s like gaming the system. There’s no theoretical basis—you’re outsourcing your brain to a computer. 🚫2. Forward Selection Start with one variable, keep adding more while checking model performance. 📉 Same problem: The machine drives the science, not your subject matter expertise. These methods: ❌ Are sensitive to sample size and quirks in the data ❌ Often lead to overfitting or spurious results ❌ Ignore conditional confounding, colliders, and mediators 🚫3. Blindly Copying Confounders from the Literature Just because something was a confounder in their study doesn’t mean it’s a confounder in yours. Confounding is contextual, especially in social and behavioral research. You must evaluate confounders in the context of your data, your population, and your exposure-outcome relationship. 🚫4. Outsourcing Your Thinking It’s okay to seek advice—but if this is your study, you should have a solid grasp of the relationships you're examining. You’re the tailor; the model is the dress. You need to know what kind of stitches (variables) will fit well. A good tailor doesn’t pick threads by lucky draw. ✅ So What’s the Right Way? 👉 Use subject matter expertise as your foundation. 👉 Use tools like Directed Acyclic Graphs (DAGs) to map and visualize relationships between your exposure, outcome, and potential confounders. 👉 Always ask: “What variables lie on the causal path? What blocks bias? What opens a backdoor?” If you don’t have a firm understanding of the causal structure, don’t fake it with machine-driven variable selection. Stick to descriptive analysis. That’s not inferior—it’s honest. 📌 Regression models are not scripture. As the famous quote goes: “All models are wrong, but some are useful.” Let’s make sure ours are useful. 🧠 Scope of Applicability This guidance applies most directly to epidemiological causal inference studies. If you’re doing predictive modeling (e.g., machine learning for forecasting), different variable selection techniques may apply—but always with context and clarity.
-
*** Selecting the Statistical Model for Health Research *** Choosing the correct statistical model for health research involves several key steps: 1. Define the Research Question: Clearly articulate your investigation’s aim, identifying risk factors, evaluating treatment effectiveness, or forecasting outcomes. 2. Understand the Data Type: Determine the nature of your dependent variable—continuous (e.g., blood pressure), categorical (e.g., disease presence), or time-dependent (e.g., survival data)—as this will influence your model choice. 3. Explore the Data: Perform exploratory data analysis (EDA) to uncover insights. Check for missing values, analyze variable distributions, and visualize relationships using scatterplots and histograms. 4. Choose a Model Based on Data Structure: Select an appropriate model: Regression Models: Linear regression is used for continuous outcomes, and logistic regression is used for binary outcomes. - Survival Analysis: Use Cox proportional hazards models for time-to-event data. - Longitudinal Models: Consider mixed-effects models for repeated measures. Machine Learning: Deploy techniques like random forests or neural networks for complex datasets. 5. Check Assumptions: Verify your model meets necessary assumptions (normality, homoscedasticity, independence). If not, consider data transformations or alternative models. 6. Model Selection Criteria: Use evaluation metrics like adjusted R², AIC, and BIC to assess model performance and apply cross-validation techniques for generalization. 7. Interpretability and Clinical Significance: Ensure the model’s outcomes are clinically relevant and interpretable, as practical applicability is crucial for making a meaningful impact. By following these steps, researchers can effectively select statistical models that yield valuable insights to improve healthcare. --- B. Noted
-
Choosing the right model shouldn’t be a guessing game. ��� Not all regression models "see" data the same way. While a Linear Regression seeks a straight path, an XGBoost or Decision Tree navigates the noise through segments. In this visual guide, we’ve broken down 9 essential regression techniques to help you visualize: ✅ Bias vs. Variance: See which models overfit vs. which remain smooth. ✅ Linear vs. Non-Linear: Compare the rigid lines of Linear Regression to the flexible curves of Polynomial and Neural Networks. ✅ Model Architecture: Notice how Tree-based models (Random Forest, XGBoost) create "steps" while distance-based models (k-NN) react to local density. Whether you're a student or a seasoned ML Engineer, this chart serves as a quick mental map for your next project. Created by Antara and Aditya at Neuroxsentinel. #DataScience #MachineLearning #ArtificialIntelligence #Regression #Neuroxsentinel #Coding #DeepLearning #Statistics
-
����𝗶𝗻𝗲𝗮𝗿 𝗥𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗘𝘅𝗽𝗹𝗮𝗶𝗻𝗲𝗱 (𝗹𝗶𝗸𝗲 𝗮 𝗿𝗲𝗮𝗹 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿) Most beginners think Linear Regression is just a formula. It’s not. It’s your first real predictive system. 𝗪𝗵𝗮𝘁 𝗶𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗱𝗼𝗲𝘀 Linear Regression learns a straight-line relationship between input and output. Example: House size → Price Study hours → Marks Experience → Salary You give historical data. It learns a rule. Then it predicts future values. 𝗧𝗵𝗲 𝗰𝗼𝗿𝗲 𝗶𝗱𝗲𝗮 𝒚 = 𝒎𝒙 + 𝒃 Where: y ⤷ prediction x ⤷ input feature m ⤷ slope (how strongly x affects y) b ⤷ intercept (base value) In ML terms: Prediction = (weight × input) + bias That’s it. Everything else is optimization. 𝗛𝗼𝘄 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 𝗹𝗲𝗮𝗿𝗻𝘀 It starts with random weights. Then repeats this loop: Predict → Measure error → Adjust weights → Repeat The error is calculated using Mean Squared Error. Weights are updated using Gradient Descent. In simple words: It keeps nudging the line until predictions fit data. 𝗙𝗹𝗼𝘄 Data ↳ Model guesses ↳ Error calculated ↳ Weights updated ↳ Better guesses ↳ Repeat Eventually: Best-fit line achieved. 𝗪𝗵𝗲𝗿𝗲 𝗶𝘁 𝗶𝘀 𝘂𝘀𝗲𝗱 𝗶𝗻 𝗿𝗲𝗮𝗹 𝗹𝗶𝗳𝗲 Salary prediction Sales forecasting Demand estimation Risk scoring Baseline ML models Almost every ML pipeline starts here. Even deep learning engineers use Linear Regression as a sanity check. 𝗦𝗶𝗺𝗽𝗹𝗲 𝗣𝘆𝘁𝗵𝗼𝗻 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) prediction = model.predict([[5]]) print(prediction) That’s production-grade regression in 4 lines. 𝗪𝗵𝗲𝗻 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀 𝘄𝗲𝗹𝗹 ⤷ Relationship is roughly linear ⤷ Data is clean ⤷ Outliers are controlled 𝗪𝗵𝗲𝗻 𝗶𝘁 𝗯𝗿𝗲𝗮𝗸𝘀 ⤷ Complex nonlinear patterns ⤷ Heavy outliers ⤷ Feature interactions That’s when trees or neural nets step in. 𝗧𝗵𝗲 𝗯𝗶𝗴 𝗹𝗲𝘀𝘀𝗼𝗻 Linear Regression teaches you: How models learn How loss works How optimization behaves How features influence predictions If you truly understand this… everything else in ML becomes easier. 𝗧𝗟;𝗗𝗥 Linear Regression isn’t basic. It’s foundational. It shows how machines turn data into decisions. Master this properly, and half of ML stops feeling mysterious. --- 📕 400+ 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: https://lnkd.in/gv9yvfdd
-
Statistics - The Generalised Linear Model (GLM). The GLM is a very useful and powerful modeling tool. When we try to model some response variable Y, the first thing that usually comes to mind is Linear Regression. However, sometimes Y doesn't fit a linear regression. This is because the process that generated the data Y may come from different distributions. Modeling Y becomes a tad bit difficult. Instead, the GLM models the average response given some predictors E(Y|X). The average response may be modelled as a direct linear combination of predictors, or some transformation of the average response is linear. Written as: g[E(Y|X)] = XB. Here g() is called the link function. The link function is chosen based on the distribution of data Y. If Y is normally distributed, an identity link is chosen, and the GLM becomes ordinary Linear Regression. If Y is Binomial i.e binary outputs, a logit link is required to transform the average response i.e estimated probability into the logit space, that can then be modelled as a linear combination of X. This is the Logistic Regression. If Y is Poisson i.e count data, then a log link is required to transform the average response i.e estimated Poisson mean into a linear model. This is the Poisson Regression. Poisson distribution is has heavy assumptions, that the mean and variance are equal. More generally for count data, a Negative Binomial link may be used. For binary data, it is clear it is the logit link. In general, the distribution of Y must be established and tested first. Uses: 1. Ordinary linear regression - this is common. 2. Logit link - logistic regression. Used for Yes/No, Good/Bad modeling for example, credit risk modeling, or fraud modeling, or churn models. 3. Poisson / Negative Binomial link - count data. Common in the insurance industry, to model claims count and premium pricing. These are only few common link functions. There are plenty more. Remember, when studying data, often we forget about the Y. How Y behaves. The behaviour of the response variable Y guides us in terms of what model to use. PS: The Generalised Linear Model isn't exactly the same as General Linear Model. Don't confuse the two.
-
𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧: After fitting a linear regression model, you notice that the residuals are not normally distributed. How would you diagnose the cause of this issue? A go-to plot for Linear Regression is the Residual vs Fitted plot. It plots the residuals (the differences between the observed values and the predicted values) against the fitted values (the values predicted by the model). Its a good tool to check for non-linearity, heteroscedasticity, and outliers. A few simple patterns to check for - 𝘙𝘢𝘯𝘥𝘰𝘮 𝘚𝘤𝘢𝘵𝘵𝘦𝘳 (𝘐𝘥𝘦𝘢𝘭 𝘊𝘢𝘴𝘦): If the residuals are randomly scattered around 0, this suggests that the model fits the data well and that the assumptions of linearity and homoscedasticity (constant variance of residuals) are reasonably satisfied 𝘊𝘶𝘳𝘷𝘦𝘥 𝘗𝘢𝘵𝘵𝘦𝘳𝘯 (𝘕𝘰𝘯-𝘓𝘪𝘯𝘦𝘢𝘳𝘪𝘵𝘺): A curved or systematic pattern in the residuals suggests that the relationship between the predictors and the target variable is not linear. This indicates that the model may be missing key non-linear relationships. Solution: Consider adding polynomial terms (e.g., square or cubic terms) or trying transformations (e.g., log or square root) on either the dependent or independent variables. 𝘍𝘶𝘯𝘯𝘦𝘭 𝘚𝘩𝘢𝘱𝘦 (𝘏𝘦𝘵𝘦𝘳𝘰𝘴𝘤𝘦𝘥𝘢𝘴𝘵𝘪𝘤𝘪𝘵𝘺): A funnel shape in the residuals (where the spread of residuals increases or decreases as fitted values increase) indicates heteroscedasticity, meaning the residual variance changes across the range of fitted values. Solution: You can address heteroscedasticity by applying transformations to the dependent variable (e.g., log-transforming the target variable) or using Weighted Least Squares (WLS) regression. 𝘖𝘶𝘵𝘭𝘪𝘦𝘳𝘴 𝘰𝘳 𝘏𝘪𝘨𝘩 𝘓𝘦𝘷𝘦𝘳𝘢𝘨𝘦 𝘗𝘰𝘪𝘯𝘵𝘴: Residuals far from the horizontal axis or points that significantly deviate from the bulk of other points may be outliers or influential points that disproportionately affect the model. Solution: Investigate these points further using measures like Cook's distance or leverage values. Outliers might be removed or treated depending on the context. 𝘏𝘰𝘳𝘪𝘻𝘰𝘯𝘵𝘢𝘭 𝘉𝘢𝘯𝘥𝘴 𝘸𝘪𝘵𝘩 𝘕𝘰 𝘚𝘵𝘳𝘶𝘤𝘵𝘶𝘳𝘦: Points are uniformly spread with no visible clustering or patterns, which is the desired case. What it Means: If residuals form random, horizontal bands around 0 with no discernible pattern, the model is likely correctly specified, and linearity and homoscedasticity assumptions are likely satisfied. For more questions, grab a copy of Decoding ML Interviews - a book with 100+ ML questions here - https://lnkd.in/gc76-4eP 𝐋𝐢𝐤𝐞/𝐂𝐨𝐦𝐦𝐞𝐧𝐭 to see more such content. 𝗙𝗼𝗹𝗹𝗼𝘄 Karun Thankachan for all things Data Science.
-
A/B Testing/Experimentation almost always is just regression under the hood. Basic A/B tests, Multivariate tests/ANOVA, interaction checks between simultaneous A/B tests, checks if different user segments respond differently to treatments, and variance reduction - all them can be performed with regression. Based on the choice of the data coding scheme and what/if additional data to include in the model the following can be easily recast/performed with basic regression: 1) Pairwise t-tests - this is the basic difference in means A/B Test. Design Matrix coded with an Intercept and single dummy variable, with the 'Control' as reference class. 2) MVT/Factorial ANOVA/f-tests - Design Matrix Effects coding such that the grand mean is 'reference'. 3a) User Segment Treatment Effects via partial F-test. Two Design Matrices: a) Main effect dummy variables only; b) A Fully interacted Design Matrix. F-test is ratio of residual sum of squares of main effects model over interacted model. 3b) Pairwise A/B Test interaction Partial F-test (see 3a). 4 Covariate Adjustment - CUPED/ANCOVA. Design matrix includes treatment dummy, mean centered pre experiment covariate(s) and dummy interaction term of covariate with treatment dummy* AND, it turns out that ALL of these test design matrices can be constructed from K-Anonymous data! That means that not only do all of these disparate tasks roll up into one general approach, but with a bit of upfront thought, all of them can be performed when following privacy by design. * technical note that unless Na=Nb, one prob should account for heteroskedastic errors rather than use the OLS st errs.