From the course: Data Quality: Measure, Improve, and Enforce Reliable Systems
Profiling data for quality - Python Tutorial
From the course: Data Quality: Measure, Improve, and Enforce Reliable Systems
Profiling data for quality
- [Instructor] Before you clean, fix, or improve any dataset, you have to know what's actually inside it, and that's where data profiling comes in. Data profiling is the process of scanning, summarizing, and analyzing your data to uncover what's normal, what's missing, and what might be going wrong. Think of it like a routine health checkup for your data. It doesn't fix anything yet, but it tells you what needs attention. I'm in Databricks right now showing a quick example of what data profiling looks like using PySpark in Python. Let's drill down into a few things profiling can uncover. Null values. How many fields are blank or missing? Distinct values. Are there more categories than expected? Outliers. Are there any values that clearly don't belong? Inconsistent formats. Is date stored in five different ways? Duplicates. Are the same records showing up more than once? Let's say you have just received a dataset from a colleague for an upcoming product launch. It's a list of leads collected from an online giveaway, and your marketing team wants to run an email campaign tomorrow. Before sending anything out, you run a quick data profile in Databricks, and here's what jumps out. Over 17% of email addresses are missing or clearly fake, like test@test.com or no@email.com. These are often the result of users trying to claim a reward without providing real contact info, and a lack of email validation during form entry. The country column includes 10 different spellings of United States, like USA, America, US, United States of America, and even the lowercase versions because the form did not use a dropdown or a systemic list, just a free text field. Over 190 phone numbers are simply 0. The field was optional, but the backend still stored a default 0 when left blank, masking the fact that the value is really missing. Without profiling, the marketing team would've plastered an expensive campaign to fake, incomplete, or poorly labeled leads, wasting budget and risking email reputation. Profiling doesn't just show you what's wrong. It protects your time, your tools, and your brand. So how do you actually perform data profiling? The good news is you don't have to do it manually. Most tools you already use have built in ways to scan and summarize data quickly. Here are a few common ways data profiling shows up in real workflows. In Excel or spreadsheets, you can use filters, PivotTables, and conditional formatting to highlight missing values, outliers, or unexpected categories. Now I'm back in Databricks to show you how we can do it quickly in SQL and Python. In SQL, you can run simple queries, like COUNT, DISTINCT, MIN, MAX, and IS NULL, to check data distributions, gaps, and uniqueness. Example, a SELECT COUNT(*) from table where email IS NULL. As you can see on the screen, it tells you how many customers skipped the email field. In Python, libraries like pandas, pandas-profiling, or YData Profiling can give you quick summaries, visualizations, and alerts. Example df.describe gives you a snapshot of numerical columns, as you can see on the screen. In BI platforms or data quality tools, tools like Power BI, Tableau, Great Expectations, often include profiling dashboards out of the box. A data source summary in Power BI shows column-level stats, missing data rates, and type mismatches. In a nutshell, whether you are using code, clicks, or dashboards, profiling is something you can do in nearly any tool you work with.