From the course: Learning Amazon SageMaker AI

Analyzing and preparing data

- [Instructor] In your SageMaker Studio workshop, your task is to prepare the raw materials, the data. Just like in any workshop, you wouldn't start building with rough, uncut materials. You'd first clean, measure, and shape everything to the correct size. Let's talk about how to collect data from multiple sources, clean and pre-process it, and prepare it for model training. When we're done, you'll have the skills to prepare Dataville's traffic data for the machine learning system you'll build to optimize the traffic flow. Before diving in into building models, we need to make sure the data is in the best possible shape. In machine learning, there's a saying, "Garbage in, garbage out." In other words, if the data is messy, incomplete, or irrelevant, the model you build won't perform well no matter how sophisticated it is. Step one, collecting data from multiple sources. When working on a machine learning project, you'll often gather data from various sources, whether it's sensors, historical data, or even external systems. For our smart city traffic management system, we'll pull data from multiple traffic sensors installed throughout Dataville. These sensors provide important information, such as timestamp is the time and date the reading was collected, Sensor_id is a unique identifier for the sensor providing the reading. Vehicle_count is the number of vehicles passing through an area at a given time. Avg_speed is the speed of traffic flow in different conditions. Weather_conditions includes environmental factors like rain, snow, or clear skies. And traffic incidents, this is the data on accidents, roadblocks, or other disruptions. Step two is cleaning the data. Now that we've collected our data, it's time to process it. This involves several steps. First, managing missing data. It's common for sensor data to have gaps or missing values. You'll need to decide whether to fill in missing data using averages, medians, or other methods, or remove those records entirely. Next, removing duplicates. In some cases, sensors may log the same data multiple times. Duplicate entries can skew your model's accuracy, so they should be deleted. Then correcting data types. Ensure each column in the dataset has the correct data type. For instance, timestamps should be recognized as dates and numerical values should be integers or floats. And finally, standardizing formats. Standardizing the format of the data ensures it's consistent across all records. For example, you want all weather conditions to be categorized in the same way, such as Rain using a capital R versus rain using a lowercase r. By cleaning up the data, you're ensuring that your model will learn from high quality inputs. The next step is pre-processing the data. Once the data is clean, it's time to pre-process it. That means transforming the data into a format your machine learning model can easily understand. These are the key pre-processing tasks you need to know. Normalization or scaling. If your data includes values with vastly different scales, such as vehicle count versus average speed, it can negatively impact your model's performance. Normalizing or scaling the data ensures that the model treats all features equally. Feature engineering. This is where you create new features from your existing data to help the model learn better. For instance, you might create a feature that combines traffic count with weather data to see how rain affects vehicle speed. Encoding categorical variables. If you have categorical data such as weather conditions, it needs to be converted into numerical values before feeding it into the model. This can be done using techniques like one hot encoding. Data is the foundation of any machine learning project. And for even the best algorithms to succeed, you need to prepare that data properly. Next, we'll dive deeper into SageMaker Data Wrangler and some advanced data preparation techniques to refine your dataset further.

Contents