As data engineers, we often talk about scalability, performance, and automation — but there’s one thing that silently determines the success or failure of every pipeline: Data Quality. No matter how advanced your stack, if your data is inconsistent, incomplete, or inaccurate, your downstream dashboards, ML models, and decisions will all be compromised. Here’s a detailed list of 25 critical checks that every modern data engineer should implement 👇 🔹 1. Null or Missing Value Checks Ensure no essential field (like customer_id, transaction_id) contains missing data 🔹 2. Primary Key Uniqueness Validation Verify that key columns (like IDs) remain unique to prevent duplicate business entities or revenue double counting. 🔹 3. Duplicate Record Detection Detect duplicates across ingestion stages 🔹 4. Referential Integrity Validation Confirm that all foreign key relationships hold true 🔹 5. Data Type Validation Ensure incoming data matches schema definitions — no strings in numeric fields, no invalid dates. 🔹 6. Numeric Range Validation Catch impossible values (e.g., negative ages, >100% percentages, invalid ratings). 🔹 7. String Length & Pattern Checks Enforce length constraints and validate formats (emails, phone numbers, IDs) with regex rules. 🔹 8. Allowed Value / Domain Validation Ensure categorical columns only contain valid entries — e.g., gender ∈ {‘M’, ‘F’, ‘Other’}. 🔹 9. Business Rule Consistency Check rules like order_amount = item_price * quantity or revenue = sum(product_sales). 🔹 10. Cross-Column Consistency Validate logical dependencies — e.g., delivery_date ≥ order_date. 🔹 11. Timeliness / Freshness Checks Detect data delays and SLA breaches — especially important for near real-time systems. 🔹 12. Completeness Check Verify all partitions, expected files, or dates are present — no missing data slices. 🔹 13. Volume Check Against Historical Data Compare record counts or data sizes vs previous runs to detect anomalies in ingestion. 🔹 14. Statistical Distribution Checks Validate stability of metrics like mean, median, and standard deviation to catch silent drifts. 🔹 15. Outlier Detection Identify records that deviate significantly from normal ranges 🔹 16. Schema Drift Detection Automatically detect added, removed, or renamed columns — common in dynamic source systems. 🔹 17. Duplicate File Ingestion Check Prevent reprocessing of already-loaded files or data across multiple sources. 🔹 18. Negative / Invalid Value Checks Block impossible values like negative prices or zero quantities where not allowed. 🔹 19. Percentage / Total Consistency Check Ensure calculated percentages correctly sum to 100% or totals match constituent values. 🔹 20. Hierarchy Validation Validate hierarchical consistency. 🔹 21. Audit Column Consistency Confirm audit columns like created_by, updated_at, and load_date are properly populated. #DataEngineering #DataQuality #Databricks #ETL #DataPipelines #DataGovernance
Data Quality Assessment
Explore top LinkedIn content from expert professionals.
Summary
Data quality assessment is the process of checking whether data is accurate, complete, and reliable enough to support decision-making. By examining various aspects like consistency, timeliness, and validity, organizations can avoid costly mistakes and build trustworthy data systems.
- Define clear rules: Set up specific checks for missing values, valid ranges, and unique identifiers to catch errors early.
- Monitor changes: Regularly review your data for unexpected shifts, such as sudden increases in duplicates or schema changes, to spot issues before they impact analysis.
- Summarize results: Organize data quality scores in a way that's easy to understand, so everyone involved can quickly see where improvements are needed.
-
-
𝗧𝗵𝗲 𝗱𝗮𝘀𝗵𝗯𝗼𝗮𝗿𝗱 𝗱𝗶𝗱𝗻’𝘁 𝗹𝗶𝗲. 𝗧𝗵𝗲 𝗱𝗮𝘁𝗮 𝗱𝗶𝗱, 𝗾𝘂𝗶𝗲𝘁𝗹𝘆. 𝗔𝗻𝗱 𝘁𝗵𝗮𝘁’𝘀 𝗵𝗼𝘄 𝗯𝗮𝗱 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻𝘀 𝗴𝗲𝘁 𝗺𝗮𝗱𝗲. Most data issues don’t show up as errors. They show up as slightly wrong numbers that snowball into wrong strategy, wrong forecasts, and wrong outcomes. Here are the data quality checks that keep your business from steering off-course: 𝟭. 𝗥𝗼𝘄 𝗖𝗼𝘂𝗻𝘁 𝗗𝗿𝗶𝗳𝘁 𝗖𝗵𝗲𝗰𝗸 Catches sudden jumps or drops in record counts before they distort metrics. 𝟮. 𝗡𝘂𝗹𝗹 𝗩𝗮𝗹𝘂𝗲𝘀 𝗶𝗻 𝗖𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗙𝗶𝗲𝗹𝗱𝘀 Ensures key identifiers and revenue fields are never missing. 𝟯. 𝗗𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲 𝗥𝗲𝗰𝗼𝗿𝗱 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 Flags repeated data caused by retries or broken idempotency. 𝟰. 𝗥𝗲𝗳𝗲𝗿𝗲𝗻𝘁𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗴𝗿𝗶𝘁𝘆 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 Checks whether all foreign keys correctly map to parent records. 𝟱. 𝗦𝗰𝗵𝗲𝗺𝗮 𝗖𝗵𝗮𝗻𝗴𝗲 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 Alerts you when columns are added, removed, or renamed so pipelines don’t break silently. 𝟲. 𝗙𝗿𝗲𝘀𝗵𝗻𝗲𝘀𝘀 & 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 𝗖𝗵𝗲𝗰𝗸𝘀 Confirms dashboards are showing timely data within agreed SLAs. 𝟳. 𝗩𝗮𝗹𝘂𝗲 𝗥𝗮𝗻𝗴𝗲 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 Detects impossible values like negative revenue or unrealistic outliers. 𝟴. 𝗛𝗶𝘀𝘁𝗼𝗿𝗶𝗰𝗮𝗹 𝗧𝗿𝗲𝗻𝗱 𝗖𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻 Surfaces metric shifts that don’t match past behavior or known events. 𝟵. 𝗦𝗼𝘂𝗿𝗰𝗲-𝘁𝗼-𝗧𝗮𝗿𝗴𝗲𝘁 𝗥𝗲𝗰𝗼𝗻𝗰𝗶𝗹𝗶𝗮𝘁𝗶𝗼𝗻 Validates that transformed totals match upstream source data. 𝟭𝟬. 𝗟𝗮𝘁𝗲-𝗔𝗿𝗿𝗶𝘃𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 Prevents delayed events from corrupting historical reporting. 𝟭𝟭. 𝗕𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗥𝘂𝗹𝗲 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 Ensures domain rules, like order states or status transitions, are always respected. 𝟭𝟮. 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 𝗖𝗵𝗲𝗰𝗸𝘀 Confirms daily, weekly, and monthly totals all align. 𝟭𝟯. 𝗖𝗮𝗿𝗱𝗶𝗻𝗮𝗹𝗶𝘁𝘆 𝗔𝗻𝗼𝗺𝗮𝗹𝘆 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 Catches unexpected drops or spikes in unique users, products, or transactions. 𝟭𝟰. 𝗗𝗮𝘁𝗮 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗻𝗲𝘀𝘀 𝗯𝘆 𝗦𝗲𝗴𝗺𝗲𝗻𝘁 Ensures every region, product line, or channel is fully represented. Bad data rarely screams, it whispers. These checks make sure you hear it before your business does.
-
Data Quality isn't boring, its the backbone to data outcomes! Let's dive into some real-world examples that highlight why these six dimensions of data quality are crucial in our day-to-day work. 1. Accuracy: I once worked on a retail system where a misplaced minus sign in the ETL process led to inventory levels being subtracted instead of added. The result? A dashboard showing negative inventory, causing chaos in the supply chain and a very confused warehouse team. This small error highlighted how critical accuracy is in data processing. 2. Consistency: In a multi-cloud environment, we had customer data stored in AWS and GCP. The AWS system used 'customer_id' while GCP used 'cust_id'. This inconsistency led to mismatched records and duplicate customer entries. Standardizing field names across platforms saved us countless hours of data reconciliation and improved our data integrity significantly. 3. Completeness: At a financial services company, we were building a credit risk assessment model. We noticed the model was unexpectedly approving high-risk applicants. Upon investigation, we found that many customer profiles had incomplete income data exposing the company to significant financial losses. 4. Timeliness: Consider a real-time fraud detection system for a large bank. Every transaction is analyzed for potential fraud within milliseconds. One day, we noticed a spike in fraudulent transactions slipping through our defenses. We discovered that our real-time data stream was experiencing intermittent delays of up to 2 minutes. By the time some transactions were analyzed, the fraudsters had already moved on to their next target. 5. Uniqueness: A healthcare system I worked on had duplicate patient records due to slight variations in name spelling or date format. This not only wasted storage but, more critically, could have led to dangerous situations like conflicting medical histories. Ensuring data uniqueness was not just about efficiency; it was a matter of patient safety. 6. Validity: In a financial reporting system, we once had a rogue data entry that put a company's revenue in billions instead of millions. The invalid data passed through several layers before causing a major scare in the quarterly report. Implementing strict data validation rules at ingestion saved us from potential regulatory issues. Remember, as data engineers, we're not just moving data from A to B. We're the guardians of data integrity. So next time someone calls data quality boring, remind them: without it, we'd be building castles on quicksand. It's not just about clean data; it's about trust, efficiency, and ultimately, the success of every data-driven decision our organizations make. It's the invisible force keeping our data-driven world from descending into chaos, as well depicted by Dylan Anderson #data #engineering #dataquality #datastrategy
-
DQ score calculations are not as straightforward as one might think. Typically, there is a DQ rule calculation score, which is determined as the number of records that passed the rule divided by the total number of records. However, almost everyone wants to generate some kind of aggregated score for a dataset, coming up with a single number to measure data quality. This is where it gets interesting. Some DQ tools offer a DQ score calculated as an average of all DQ rule scores. This is just a number and often lacks meaningful interpretation. Other tools provide more sophisticated score calculations at the record level and subject level. These scores are more insightful: The record-level score shows the number of error-free records, while the subject-level score shows the number of error-free subjects. Subjects are high-level entities whose data is being assessed, such as customers or accounts (or loans, in the example here). Interestingly, different calculation methods can yield different results! Which method is the best? It’s the one that is understandable to the people involved in reviewing the DQ assessment results. Personally, I prefer calculating all kinds of scores and organizing them into a neat DQ scorecard. Examining scores from various perspectives gives me valuable information that I can use to draw actionable conclusions and perform data quality improvement exercises. What methods do you use?
-
A comprehensive Data Quality framework encompasses Design-Time, Run-Time, AND Consumption-Time quality checks, yet most companies only focus on the latter. First some definitions: 1. Design-Time Quality focuses on code. Imagine that software / pipeline code is the 'machine' producing a data product. If there are defects in the data caused by the machine itself, no amount of post-deployment monitoring will ever fix the issues. Design-Time quality focuses on bringing DQ best practices into the software development lifecycle with unit testing & integration tests. Catching a DQ issue at the design time is the cheapest to mitigate. 2. Run-time quality focuses on evaluating data produced at run-time. This is essential because it is the first moment data can be analyzed by a producer for problems. Run-time checks allow software teams to treat and diagnose problems at the source instead of spending countless hours focusing on root-causing downstream impact. Catching an issue at run time is not inexpensive (it requires RCA and potentially refactoring code) but much less expensive than the alternative. 3. Consumption-Time Quality is what we are most familiar with in the data space. This is anomaly detection, aggregations, and other forms of trend analysis. While exceptionally useful for identifying problems after the fact and catching unexpected errors outside our control, it can be expensive and reactive. Consumption-Time quality is the most costly, and generally should be reserved for the long-tail of unexpected errors. Holistic Data Quality therefore, relies on combining these three types of checks to identify problems at the points in time they are MOST necessary. The ideal framework puts the most resourcing into design-time checks which are the most inexpensive to deal with, a moderate amount of resources into run-time checks, and fewer comparative resources into consumption-time checks. Not only that, but this DQ framework allows upstream engineers to take ownership of the systems they actually manage - Code and Runtime Events rather than needing to educate them about what happens in systems they are totally unfamiliar with. Good luck!
-
If data quality is about being fit for purpose, then why don't data leaders use business KPI's as data quality metrics? Most DQ frameworks still obsess over the attributes of data - completeness, accuracy, timeliness - without ever asking the most important question: Did the data help the 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐩𝐞𝐫𝐟𝐨𝐫𝐦 𝐛𝐞𝐭𝐭𝐞𝐫? We’ve had the tools for decades - regression analysis, causal inference - yet few organizations are connecting DQ to the efficiency of the business processes that the data supports. That’s a huge miss. Because until you tie data quality to real-world business outcomes, your governance remains incomplete. Worse yet, it may be misleading. Bad data in analytics? Maybe. But in operations? That exact same data might be perfectly fit for purpose. A rigid, one-size-fits-all DQ standard leads to finger-pointing ("this data is garbage!") when the real issue is a lack of contextual awareness. What's fit for one use may not be fit for another, and vice versa. It’s time we evolve: ✅ Our Governance frameworks must become more adaptive - where there are different sets of data quality rules/policies depending on how the data is used. At a minimum, our policies should adapt to support three contexts: functional/domain, cross-functional, and enterprise-wide. The data mesh movement was all about empowering domains - which is fine, but we cannot also ignore the need to govern data at 'higher' levels of the organization. ✅ Quality metrics that reflect how data impacts business performance must exist, and must also be connected to more 'traditional' DQ metrics, like consistency and accuracy. For example - if there is a duplicate customer record, how does that negatively affect marketing effectiveness? ✅ Recognition that DQ must support both operational and analytical use cases, and that what is 'fit' for one purpose may not be fit for the other. We are quickly approaching a point where quality data is no longer negotiable. Yet, our DQ frameworks - and our general mindset around data quality - are insufficient to support our rapidly evolving business needs. What is necessary is a change of perspective - where the 'quality' of data is measured, in part, by its ability to support our business goals. So... What would it take for your org to start measuring data quality in terms of business outcomes? #dataquality #datagovernance #datamanagement
-
This systematic review examines how #data_quality is assessed in healthcare, highlighting its critical role in clinical decision-making, patient outcomes, and research. The authors analyzed 44 studies, identifying significant variability in the definitions and number of data quality dimensions (DQDs) evaluated. The most frequently assessed dimensions are completeness, plausibility, and conformance. Diverse methodologies are used, including rule-based systems, statistical analyses, enhanced definitions, and comparisons with external standards. The review also catalogs a wide range of tools and software applications supporting data quality assessment (DQA), such as R and Python-based toolkits, web-based dashboards, and SQL solutions. The authors recommend developing a practical framework to harmonize definitions, assessment methods, and tool design, aiming to improve the consistency and efficiency of healthcare data quality evaluation. Reference: Hosseinzadeh, E., Afkanpour, M., Momeni, M. et al. Data quality assessment in healthcare, dimensions, methods and tools: a systematic review. BMC Med Inform Decis Mak 25, 296 (2025). https://lnkd.in/ejr3mtir
-
𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗖𝗵𝗲𝗰𝗸𝘀 𝗶𝗻 𝗠𝗲𝗱𝗮𝗹𝗹𝗶𝗼𝗻 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 (𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀) Data flows through 𝗕𝗿𝗼𝗻𝘇𝗲 → 𝗦𝗶𝗹𝘃𝗲𝗿 → 𝗚𝗼𝗹𝗱… But without 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗰𝗵𝗲𝗰𝗸𝘀 𝗮𝘁 𝗲𝗮𝗰𝗵 𝗹𝗮𝘆𝗲𝗿, even the best Lakehouse becomes unreliable. Here’s how modern teams implement 𝗗𝗤 𝘁𝗵𝗲 𝗿𝗶𝗴𝗵𝘁 𝘄𝗮𝘆 👇 🥉 𝗕𝗥𝗢𝗡𝗭𝗘 (𝗥𝗮𝘄 𝗟𝗮𝘆𝗲𝗿) Goal: Capture everything, validate nothing. But still add basic checks to prevent corruption: ✔ File format validation ✔ Schema detection ✔ Row count logging ✔ Bad records quarantine Outcome: 𝗥𝗮𝘄 𝗯𝘂𝘁 𝘁𝗿𝘂𝘀𝘁𝗲𝗱. 🥈 𝗦𝗜𝗟𝗩𝗘𝗥 (𝗖𝗹𝗲𝗮𝗻𝗲𝗱 & 𝗘𝗻𝗿𝗶𝗰𝗵𝗲𝗱 𝗟𝗮𝘆𝗲𝗿) This is where real quality checks happen: ✔ Schema enforcement ✔ Null / duplicate checks ✔ Referential integrity ✔ Data type standardization ✔ Deduplication & late data handling ✔ Business rule validation (thresholds, patterns) Outcome: 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀-𝗿𝗲𝗮𝗱𝘆 𝗱𝗮𝘁𝗮. 🥇 𝗚𝗢𝗟𝗗 (𝗖𝘂𝗿𝗮𝘁𝗲𝗱 / 𝗕𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗟𝗮𝘆𝗲𝗿) Here checks are business-driven: ✔ KPI validation ✔ Surrogate key consistency ✔ SCD validations ✔ Aggregation-level checks ✔ Reconciliation with source systems Outcome: 𝗧𝗿𝘂𝘀𝘁𝗲𝗱, 𝗴𝗼𝘃𝗲𝗿𝗻𝗲𝗱 𝗱𝗮𝘁𝗮 𝗳𝗼𝗿 𝗱𝗮𝘀𝗵𝗯𝗼𝗮𝗿𝗱𝘀 & 𝗠𝗟. 🔥 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 Good pipelines load data. Great pipelines validate data at every step. That’s what earns business trust. #Databricks #DataEngineering #DataQuality #DeltaLake #MedallionArchitecture #ETL #Lakehouse #PySpark #DataGovernance #Azure #BigData
-
💡 Mastering Data Quality in Modern Data Pipelines 🎯 “Your analytics are only as good as the data feeding them.” In today’s fast-paced data ecosystems, bad data isn’t just an inconvenience, it’s an invisible cost that compounds over time. Whether you’re building on Spark, Airflow, or dbt, data quality should be treated as a first-class citizen in your architecture. Here’s what separates resilient data platforms from reactive ones 👇 🔹 1. Shift-Left Data Validation Don’t wait until your dashboards break. Validate early at ingestion. Use tools like Great Expectations, Soda, or Deequ to catch schema drifts and anomalies before loading data downstream. 🔹 2. Observability as a Core Component Treat data like infrastructure. 📊 Implement end-to-end monitoring for freshness, volume, and schema consistency. Platforms like Monte Carlo, Databand, or OpenMetadata can help you see your data flows. 🔹 3. Version Control for Data Models Use Git + CI/CD for your transformation logic. ⚙️ dbt tests + automated checks = fewer surprises in production. 🔹 4. Feedback Loops from Consumers Your downstream users (analysts, ML teams, BI tools) are your best sensors. 💬 Create Slack or Jira-based feedback loops for data issues. 🔹 5. Golden Data Contracts Define schemas, SLAs, and ownership before data starts flowing. 📄 Data contracts reduce chaos between producers and consumers — aligning expectations around latency, structure, and quality. 💬 Final Thought: Data quality isn’t a one-time project, it’s a culture. Build trust by designing your pipelines to detect, prevent, and communicate quality issues automatically. 👇 How are you ensuring data reliability in your pipelines today? #DataEngineering #DataQuality #DataObservability #ETL #DataOps #GreatExpectations #dbt #DataContracts #BigData #Airflow #DataTrust #AnalyticsEngineering #DataPipeline