Fitting a Straight Line: A Detailed Guide to Machine Learning for O&G Engineers

Hai Hung Vu

Published Jul 28, 2025

I’ve been wondering, why do we engineers love fitting a straight line so much?

Remember the semi-log or log-log plots? We’d bend, twist, and transform our data just to see that beautiful line. And honestly, I still find it satisfying, something about how clean and elegant it looks to the eye.

But here’s the thing��

In oil and gas, we often stop at “it looks good” or “R² > 0.7.”

In machine learning, that’s just the beginning.

Fitting a straight line is formally known as linear regression, and while it’s familiar, ML treats it with more scrutiny. Behind the simplicity lies a set of statistical assumptions that must hold for the model’s outputs to be valid and trusted.

A simple linear regression model typically takes the form:

Article content — SImple Linear Regression

Although in O&G, we did not typically think much about "fitting a straight line" as long as it looks good and that R2 is somewhere more than 0.7-0.8. In ML, the validity of this model hinges on several critical assumptions. If they’re violated, the estimated coefficients may be biased, inefficient, or misleading. So let's see what we can learn from that.

Note: in simple linear regression with a single predictor (X), some below assumptions were not applicable.

📌 Assumption 1: Linearity

The relationship between the independent variables and the dependent variable is linear.

❓ Why it matters:

If the relationship isn’t linear, you’re misrepresenting reality, which means bad predictions.

✅ How to check:

Plot residuals vs. fitted values: should look like random scatter.
Partial residual plots: to isolate individual variable effects.

🔧 Remedy:

Engineers do this all the time! Apply log, square root, or polynomial transformations.

📌 Assumption 2: Independence of Errors

The residuals (errors) should be independent of each other.

❓ Why it matters:

Autocorrelated residuals (especially in time series data) can lead to underestimated standard errors and overly optimistic p-values.

✅ How to check:

Durbin-Watson test: Detects autocorrelation (values near 2 indicate no autocorrelation).
Plot residuals vs. time/order: Patterns (like waves or trends) suggest autocorrelation.

🔧 Remedy:

Use time series models (e.g., ARIMA)

📌 Assumption 3: Homoscedasticity (Constant Variance of Errors) - What a term! I can't pronounce it

The variance of the residuals is the same across all levels of the independent variables.

❓ Why it matters:

If heteroscedasticity exists (i.e., non-constant variance), standard errors of coefficients become biased, affecting confidence intervals and hypothesis tests.

✅ How to check:

Residuals vs. fitted values plot: Look for a funnel shape, widening or narrowing spread indicates heteroscedasticity.
Breusch-Pagan test: A formal statistical test.
White test: Another robust option for detecting heteroscedasticity.

🔧 Remedy:

Transform response, weighted regression

📌 Assumption 4: Normality of Errors

The residuals should be normally distributed (especially for inference, like p-values and confidence intervals).

❓ Why it matters:

Normality is not required for point estimates, but it’s crucial for valid hypothesis testing and confidence intervals.

📌 Assumption 5: No Multicollinearity

Independent variables should not be highly correlated with each other.

❓ Why it matters:

Multicollinearity inflates the variances of the coefficient estimates, making it hard to determine the effect of individual predictors.

✅ How to check:

Variance Inflation Factor (VIF): A VIF > 5 (sometimes 10) indicates problematic multicollinearity.
Correlation matrix: Check for high pairwise correlations between features.
Condition index: Part of the collinearity diagnostics.

🔧 Remedy:

Remove/reduce features, Principal Component Analysis

🧠 Insight: Use GLMs or Robust Methods When Assumptions Break

If your data violates normality or homoscedasticity, Generalised Linear Models (GLMs) can be a better fit:

Logistic regression for binary outcomes
Poisson regression for count data
Use robust standard errors or bootstrapped confidence intervals if normality/homoscedasticity is weak

🛠️ Real-World Use Cases of Linear Regression in Oil & Gas

💡 Production Forecasting:

Used to predict production rates from time-series data, especially in unconventional wells. Regression models are often benchmarked against Arps decline curves and can outperform them in certain cases .

💡 Gas Lift Rate Prediction:

ML models, including MLP regression and linear regression, have been used to predict oil, water, and gas rates based on surface and downhole parameters like BHP, WHP, choke size, and gas injection rates .

💡 Petrophysical Property Estimation:

Multiple linear regression is used to estimate log-derived properties like permeability or saturation from other well log inputs .

💡 Well Performance Classification & Planning:

In field development, linear regression is used after PCA to rank and classify well performance, helping to place new wells optimally based on analogues .

🎯 Final Thoughts for Engineers

Fitting a straight line may look simple, but doing it right takes rigour.

As engineers, we often lean on physical intuition and well-established equations. Many of the relationships we use daily were derived from first principles. But even physics-based models come with assumptions, simplifications, and hidden variables we may not fully account for.

In data-driven modelling, we face the same challenge:

Assumptions matter, and trust must be earned, not assumed.

So the next time you draw a trendline in Excel, run a regression in Python, or build a dashboard in Power BI, pause and ask yourself:

“Have I earned the right to trust this model?”

Let’s carry the same engineering discipline we apply to pipelines, wells, and reservoirs, into our data science workflows. 👷♂️📊

👋 I’d love to hear from others applying regression in real-world O&G settings. What’s worked (or not) for you?

#MachineLearning #OilAndGas #DataScience #LinearRegression #GLM #EngineeringAnalytics #ProductionEngineering #ReservoirEngineering

Paul Skoczylas 10mo

You missed an important assumption. The traditional least-squares methods of calculating the coefficients for a linear regression *assume* that all the error is in the dependent variable, i.e., you have perfect measurements for the independent variable(s). It's possible to calculate the regression with error in both dependent and independent variables, but it's a lot more complicated!

3 Reactions

Hai Hung Vu 10mo

Absolutely critical to be mindful of the data range. Many relationships appear linear within a narrow window, and linear approximations can be useful for quick insights or localised modeling. But the danger lies in extrapolation. Just because a trend holds within a small range doesn’t mean it will hold beyond it. Without grounding in physical principles or validation outside the training range, long-range predictions can be highly misleading. It’s a good reminder that a good fit isn’t always a good model, especially when we step outside the data we used to build it.

See more comments

To view or add a comment, sign in

Fitting a Straight Line: A Detailed Guide to Machine Learning for O&G Engineers

Hai Hung Vu

📌 Assumption 1: Linearity

❓ Why it matters:

📌 Assumption 2: Independence of Errors

❓ Why it matters:

✅ How to check:

🔧 Remedy:

📌 Assumption 3: Homoscedasticity (Constant Variance of Errors) - What a term! I can't pronounce it

❓ Why it matters:

✅ How to check:

🔧 Remedy:

📌 Assumption 4: Normality of Errors

❓ Why it matters:

Recommended by LinkedIn

✅ How to check:

🔧 Remedy:

📌 Assumption 5: No Multicollinearity

❓ Why it matters:

✅ How to check:

🔧 Remedy:

🧠 Insight: Use GLMs or Robust Methods When Assumptions Break

🛠️ Real-World Use Cases of Linear Regression in Oil & Gas

🎯 Final Thoughts for Engineers

More articles by Hai Hung Vu

Others also viewed

What Is Polynomial Regression in Machine Learning?

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Vector Indexes and Embedding Models

Unveiling the Potential of Support Vector Machines in Feature Engineering

Decision Trees in Machine Learning

The Art and Science of Feature Engineering in Machine Learning

The Curse of Dimensionality in Machine Learning

What is RandomizedSearchCV in Machine Learning

Machine Learning algorithms - A quick summary

Real-Time Root-Cause Analysis Using ML Explainability (SHAP, LIME)

Explore content categories

📌 Assumption 1: Linearity

❓ Why it matters:

📌 Assumption 2: Independence of Errors

❓ Why it matters:

✅ How to check:

🔧 Remedy:

📌 Assumption 3: Homoscedasticity (Constant Variance of Errors) - What a term! I can't pronounce it

❓ Why it matters:

✅ How to check:

🔧 Remedy:

📌 Assumption 4: Normality of Errors

❓ Why it matters:

Recommended by LinkedIn

✅ How to check:

🔧 Remedy:

📌 Assumption 5: No Multicollinearity

❓ Why it matters:

✅ How to check:

🔧 Remedy:

🧠 Insight: Use GLMs or Robust Methods When Assumptions Break

🛠️ Real-World Use Cases of Linear Regression in Oil & Gas

🎯 Final Thoughts for Engineers

More articles by Hai Hung Vu

Buy vs Build in the Age of AI

Production Deferral Tracking: The Pinnacle of Production Operations Excellence

The White Knight of PNG Oil and Gas Sector

Timor Sea: Where Australia Learned to Produce Oil Offshore and a $40 Billion Resource Waited 51 Years

Foam and Emulsion - Two Sides of the Same Coin

The Digital Petroleum Engineer: Why Domain Expertise Changes Everything in Agile

Critical Velocity in Oil and Gas: One Phrase, Many Meanings

Your Compressor Is Talking. Are You Listening?

The Last Victory

Spreadsheets to Spark: Not Just Efficiency Gain

Others also viewed

What Is Polynomial Regression in Machine Learning?

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Vector Indexes and Embedding Models

Unveiling the Potential of Support Vector Machines in Feature Engineering

Decision Trees in Machine Learning

The Art and Science of Feature Engineering in Machine Learning

The Curse of Dimensionality in Machine Learning

What is RandomizedSearchCV in Machine Learning

Machine Learning algorithms - A quick summary

Real-Time Root-Cause Analysis Using ML Explainability (SHAP, LIME)

Explore content categories