From the course: Data Preparation, Feature Engineering, and Augmentation for AI Models

Detecting and managing missing data

- [Instructor] Now when we're interested in detecting and managing missing data, we want to start with a dataset overview analysis. Now, in this process, we're going to do things like calculating key metadata about our data. These include metrics like record count, column count and the time range that our dataset covers. We also want to identify data types across the different columns, that will inform the kinds of statistical analysis that we will do. We also want to determine the overall data completeness and come up with some quality metrics. And when we do this, basically what we're doing is we're establishing a baseline for understanding the dataset. And in particular, we want to understand the scope or the range of information that's covered and the limitations with regards to missing data. Now, when we're doing missing data analysis, we want to start by quantifying missing values across each column. And for that, we want to get counts and percentages. We also want to identify records with multiple missing fields. Those are candidates for possible exclusion because it indicates there's some potentially significant problem with those records. We also want to analyze patterns of missingness. This will help us detect potential problems, either with the source system or with the way we're processing data. We also want to provide visual representations of data completeness across the dataset to help us identify issues. And we want to be sure that we flag critical columns that have excessive missing data because we may need to do additional work with those columns. Now, depending on the data type that we're working with, we may have different operations that we perform. So for example, with numeric columns, we want to calculate core statistics, like the mean, which is the average, the median, which is the sort of the halfway mark in the values if you order the values from smallest to largest, the median is the halfway mark. Median and mean are used in different cases, mean we use when we're working with normally distributed or bell-shaped curve data, median we work with when we are working with non-normally distributed data. Also, when we're working with normally distributed data, we often want to calculate things like the standard deviation. And in any case, we always want to think about the range, which includes the min and maximum values as well. Now, we also want to identify outliers, and we can often use multiple detection methods. These include things like C-score and interquartile range. We also want to understand or analyze distribution characteristics. These include things like skewness and the coefficient of variation. We'll also compute quantiles, that'll help us understand the data spread and the concentration of different data points. And this is all useful because it helps us provide a statistical foundation for data normalization decisions. Now, when we're working with categorical data, we're going to work with different kinds of statistics. First of all, we want to determine the frequency distributions for all of our categorical variables. And we want to identify the most common values and their prevalence in our datasets. Now, we want some simple statistics here. We want to calculate things like the unique value counts, 'cause that'll help us understand cardinality. Cardinality refers to the number of unique values or different values that we can have. And we often use the term low cardinality to refer to columns or features that have relatively few unique values, and high cardinality to refer to features that have a large number of unique values. We also want to detect things like empty strings and make sure that we distinguish them from explicit null values in our categorical variables. And we want to watch for potential inconsistencies in our category naming conventions as well. Now, correlation analysis is another important operation. Here what we're trying to do is map the relationships between numeric variables to identify any dependencies. And we also want to highlight strong correlations because that can indicate redundant features. It often helps to eliminate one or more redundant features when we're building machine learning models. Now, correlation analysis is also useful for helping us detect counterintuitive correlations, 'cause again, this might be a signal that we either have data quality issues in our source system or in our data-processing pipeline. And in general, basically this kind of correlation analysis can help support feature selection decisions. And correlation analysis also helps provide insights into business relationships between different metrics that may not necessarily have been obvious to us before. Now, after we're done doing missing data analysis, we want to essentially end up with a knowledge base that we can share. So we want to, for example, be able to deliver a description of the overall missing data quality and be able to score that based on things like completeness and consistency. We also want to provide specific recommendations for handling missing values, and we want to suggest appropriate normalization methods. And we do that based on distribution analysis. And then finally, we want to be able to recommend feature engineering opportunities. And these basically kind of show themselves based on the statistical analysis that we have done.

Contents