Timeline for What regression method should I use for non-normal, outlier-heavy biomedical data with a continuous outcome?
Current License: CC BY-SA 4.0
26 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Mar 25, 2025 at 16:28 | answer | added | EdM | timeline score: 5 | |
| Mar 23, 2025 at 17:54 | comment | added | Nikimiskata | @Ed Thanks! I’ve removed individuals with conditions that could potentially distort the analysis. I also reviewed some of the outliers to see if there were any underlying diagnoses that might explain the extreme values, but that wasn’t the case. So for now, I’m assuming these outliers could carry some biological relevance and I’ve decided to retain them for further analysis. | |
| Mar 23, 2025 at 15:18 | comment | added | EdM | The y-axis values in your leftmost boxplot seem much more consistent with platelet rather than erythrocyte counts. See this StatPearls page. If that's the case, then the extremely high value in that boxplot (>1000) would seem to be from someone with thrombocytosis that might be sufficiently severe to be clinically actionable. If your interest is in the normal variability of platelet counts that might be associated with levels of interleukins, etc., you probably want to remove that case like you did with pregnant individuals. | |
| Mar 23, 2025 at 1:03 | history | edited | Nikimiskata | CC BY-SA 4.0 |
added 131 characters in body
|
| Mar 23, 2025 at 0:57 | comment | added | Nikimiskata | @jginestet That said, the residuals are still not perfectly normally distributed — the distribution is better, but still shows some deviation from the diagonal, likely due to remaining skewness or heavy tails. I’ve uploaded the updated version in my original post. Really appreciate your guidance — if you have suggestions for further improving the residual distribution, I’d be happy to try them! | |
| Mar 23, 2025 at 0:56 | comment | added | Nikimiskata | @jginestet Thanks a lot for pointing this out — you were absolutely right. I checked again and realized that my outcome variable (platelet counts) had a very wide scale with extreme values, which completely distorted the residuals. I’ve now applied a log₂ transformation to the platelet counts, and the Q–Q plot looks much more reasonable — no more vertical alignment of residuals, and the axes now fall within a normal expected range (roughly -5 to 5). | |
| Mar 23, 2025 at 0:17 | comment | added | jginestet | With ~ 50k data points, your X axis should go from about -5 to about 5 (rough order of magnitude), not -200 to 800... The identity (diagonal) line is completely off. | |
| Mar 23, 2025 at 0:03 | comment | added | jginestet | @Nikimiskata, you residuals go from below -200, ro over 800. Their mean is (as it should!), but they are not centered on 0 (median), as conformed by the regression plot. It is the theoritical quantiles which are clustered around 0, with almost no spread; that is not normal, particularly given your large samp;le size (~50k). Take a look at the issue described in the link I provided; I am not saying it is the exact same issue, but there is a glitch nonetheless. NO matter how non-normal, skewed and heavy tailed your residuals are, you should not see such a vertical line... | |
| Mar 22, 2025 at 20:45 | comment | added | Nikimiskata | @jginestet I think the reason my Q–Q plot looks like that is because most of the residuals are very small and tightly clustered around zero, while a few are extremely large — both positive and negative. This creates a sharp central peak and long tails, which makes the Q–Q plot appear compressed or almost vertical in the middle. I also checked the residual histogram (Right-skewed) and the kurtosis, which was slightly above 3 — so I assume the residuals are leptokurtic. Could that be the reason why my Q–Q plot looks like this? | |
| Mar 22, 2025 at 20:06 | comment | added | jginestet | @Nikimiskata, thanks for the plots. Informative. There is something very odd about the Q-Q plot. You may want to look at this post, specifically the first point of the accepted answer. Your residuals are skewed, but should not give you a vertical line; something is not right... | |
| Mar 22, 2025 at 17:43 | history | edited | Nikimiskata | CC BY-SA 4.0 |
added 745 characters in body
|
| Mar 22, 2025 at 17:38 | comment | added | jginestet | @Nikimiskata, thanks for the response. Could you update your original post with the plots (Q-Q of residuals, residuals vs. fitted, and at least 1 plot of "outliers" impacting the regression). Contributors may not bother reading all comments, and this info will really frame your problem (which seems real!), and help generate better answers. I will need to think a bit about it... | |
| Mar 22, 2025 at 17:31 | comment | added | Nikimiskata | @jginestet So even though the majority of observations seem centered, the OLS line is visibly influenced by those extreme cases. This confirms that outliers are having a real impact on the model, and the effect estimate may not reflect the central tendency for most patients. I think this strengthens the case for using quantile regression (e.g., at the median) or a robust regression method. Would you agree — or do you have a different recommendation based on this plot? | |
| Mar 22, 2025 at 17:30 | comment | added | Nikimiskata | @jginestetThanks again — I followed your suggestion and plotted the OLS regression line over the raw data. The result supports what you mentioned: although most of the data is concentrated in a central cluster, there are a few extreme values (especially one very high erythrocyte count) that appear to pull the regression line upward. | |
| Mar 22, 2025 at 17:24 | comment | added | Nikimiskata | @jginestet Thanks for the suggestion! I checked the residual diagnostics as recommended.The Q–Q plot of the residuals shows a strong deviation from normality — the points are tightly clustered vertically and don’t follow the diagonal line, indicating a non-normal error distribution. The Residuals vs Fitted plot clearly shows heteroscedasticity — there’s a cone-shaped spread, and quite a few large residuals (outliers), especially at higher fitted values. So while OLS gives me a point estimate, the assumptions required for valid p-values and confidence intervals are not met in my case. | |
| Mar 22, 2025 at 17:01 | comment | added | jginestet | ALso, you mention "outliers". I am glad you resists any transformation, as I would argue that there is no such thing as outliers. They are just extreme values, but they belong to the DGP (data generating process). The only way to see if they affect your regression in a major way, is to plot the regression line; is it puller by the extreme values? | |
| Mar 22, 2025 at 16:59 | comment | added | jginestet | You say non-normal, and heteroscedastic. But the marginal distributions (i.e. your IV's and DV's) do not need to be normal, or homoscedastic; it is the residuals which need to be! And they only need to be if you need p values and/or CI's for your regression coefficients. The OLS regression will still be the Best Unbiased Linear Estimator (BLUE) of your outcome, against the predictors. So try an OLS, and look at the Q-Q plot of the residuals, and the residuals against the fitted values; this is where you can make your diagnostic of normality, and equal variances. | |
| Mar 22, 2025 at 16:54 | comment | added | Nikimiskata | Thank you! The cell counts are recorded in 10⁹ cells per liter. I’ve already excluded patients with relevant diseases as well as those who are pregnant. I'm not adjusting for sex or age at this stage — my goal is to first analyze the patterns within the current dataset as it is, before adding additional layers like demographic covariates. | |
| Mar 22, 2025 at 16:42 | comment | added | EdM | Thanks for updating with the extra information. What are the units of your erythrocyte counts? They don't seem to be in the units of millions of cells per microliter that I'm familiar with. Also, are you adjusting for sex (including pregnancy status) or age? | |
| Mar 22, 2025 at 16:07 | comment | added | Nikimiskata | @StephanKolassa Thank you! I’ve edited the question to clarify that I have 3 protein expression predictors and that the outcome variable is erythrocyte count (not platelets). I also added that the outliers are present in both the predictors and the outcome variable. | |
| Mar 22, 2025 at 16:04 | history | edited | Nikimiskata | CC BY-SA 4.0 |
added 379 characters in body
|
| Mar 22, 2025 at 15:38 | comment | added | EdM | Could you please edit the question to include the number of predictors? Information in comments is easy to overlook and can even be deleted. Also, is your outcome erythrocyte count (as in the question) or platelet count (as in your comment)? Are the outliers in the outcome values or in the protein levels? Please clarify by editing the question. | |
| Mar 22, 2025 at 15:25 | comment | added | Nikimiskata | @StephanKolassa Thanks! I currently have 4 protein expression predictors and one continuous outcome (platelet count). | |
| Mar 22, 2025 at 15:21 | comment | added | Stephan Kolassa | Welcome to CV! How many predictors (protein expression levels) do you have? | |
| S Mar 22, 2025 at 15:02 | review | First questions | |||
| Mar 22, 2025 at 15:41 | |||||
| S Mar 22, 2025 at 15:02 | history | asked | Nikimiskata | CC BY-SA 4.0 |