From the course: Machine Learning with SageMaker by Pearson
Ingesting data with Amazon S3 and SageMaker Data Wrangler - Amazon SageMaker Tutorial
From the course: Machine Learning with SageMaker by Pearson
Ingesting data with Amazon S3 and SageMaker Data Wrangler
As I mentioned in the previous lesson, our first step to training an ML model is to feed it some data. So why is this so critical? We've heard the term garbage in, garbage out. If we do not ensure that our data is clean and accessible and available to the training session, then it's going to result in a subpar, substandard output. So S3 and SageMaker Data Wrangler are available to help us with this process to simplify the cleaning of data and transformation of data to feed into our models. So what is S3? S3 is Simple Storage Service. Simple Storage Service gives us scalable, durable, and secure object storage. So think of an object as a file. Think of S3 like a hard disk in the cloud. You can store the actual contents of your files as well as metadata, such as things like content type. You can do custom metadata, last modified date, things like that. Store structured, unstructured, and semi-structured data. S3 does not care. Again, think of it like a hard disk in the cloud. You have a file, whether it's a binary data format like Parquet or Org, or you have a CSV file, a file that contains JSON dictionaries, doesn't matter, S3 will store it. SageMaker Data Wrangler, I mentioned this a few moments ago as well, but what is it? It is a tool for data preparation and feature engineering. We'll talk about features in a subsequent lesson. Just know that Data Wrangler allows you to pre-process and manipulate the source data that is going to be used to train your model. It does integrate with S3, as well as other AWS services. So you can use S3 as the source of your data into Data Wrangler and then have a step-by-step processing of your data in order to clean, transform, and visualize that data. So why would we use S3? You know, why not store it on an EC2 or Elastic Compute Cloud instance in AWS? S3 integrates with IAM. This is one of the key reasons to use S3. IAM is Identity Access Management, and it has permissions and policies that can be associated with the data within S3, as well as providing things like server-side encryption for your data, as well as versioning of your data, which allows you to, for example, train your model based on a piece of data. That data gets updated, it creates a new version in S3 that can then trigger a pipeline to execute, which would retrain the model using the new data. Also, S3 is highly durable. So it's some ridiculous number of nines. Like, you hear the term five nines for network access. It's something like 13 nines for data durability of data within S3 using a storage class, the generally available storage class, wherein that data is replicated across availability zones. So the next question you ask yourself is, why should I use SageMaker Data Wrangler? Well, it's there and easy to access, it works great. It is a visual interface for data preparation. You say, take this data from S3, it will pre-process it, saying like give you 100 rows from your original data, and it allows you to create a step-by-step transformation of your data that will then be used to feed the training of your model. and it's highly integrated with SageMaker pipelines so that you can say, here's the data, pull it in, transform it like this, spin up your training instances, that results in a model, that model gets then deployed to production, all using SageMaker pipelines. So how does data get from S3 into SageMaker? Well, we create a bucket in S3, we put our data in that bucket, and this could all be done programmatically as well. Part of this course is not to talk about the programmatic access to AWS. Take a look at the AWS Developer Certification in order to, if you want to learn more about programmatic access to AWS. But with programmatic access to AWS, you could, through a single program, create a bucket, store data in that bucket, create a training job within SageMaker, interact with Data Wrangler, et cetera, all programmatically using secure access provided by IAM. Point being that you put data in S3, you manipulate it with Data Wrangler, and then that process data goes into SageMaker for training. Formats that are compatible with S3 and Data Wrangler, we touched on this slightly in the previous lesson. CSV, JSON and tab separated value are supported here in Data Wrangler. We have Parquet, Orc, Avro, and then RecordIO and TF Record are supported as well. Specifying a connection from Data Wrangler to an S3 bucket is particularly easy. We'll see this in the demo. I am going to manipulate data in Data Wrangler from S3 when we get into the demo here. Specify the S3 bucket. You associate an IAM role with Data Wrangler. This allows Data Wrangler to interact with the data in that bucket. So another little tangent here for you. Nothing within AWS trusts anything else. So for SageMaker AI and Data Wrangler to pull data from an S3 bucket, it needs to be explicitly permitted to do so. That means it needs to assume an IAM role. That role is going to have a permission policy associated with it that will allow Data Wrangler to talk to S3. You select a data set that is going to be used for ingestion. And then you start manipulating and pre-processing in Data Wrangler. Some common pre-processing tasks that we do in Data Wrangler, handling missing values, scaling, and normalizing data, as well as encoding categorical variables. Visualizing data transformations in real time is possible as well. So whenever you get those first 100 rows, or whatever the case may be, and you're defining your step by step rules that you're going to do in Data Wrangler, you could have a rule in there that says, if you're missing a value, such as the temperature, what I want you to do is fill it in or impute that data based on something, like a mean value or a static value, whatever the case may be. It's up to you as to how you do that. Once your data has been processed by Data Wrangler, you can export that back to S3. This is a common thing to do. wherein you have a bucket for pre-processed data and a bucket for post-processed data. And then your training job is going to source the data from the post-processing bucket. Supports multiple output formats, as we've seen, like CSV, Parquet, and JSON. Optimizing for data ingestion, it is recommended based on various things, like data set size or distributed processing, as to which data format you're going to use, such as Parquet or RecordIO. If you're doing a large data set, you can split it into shards, of course, with TF record. And it is recommended to apply IAM policies to the access of the data within S3. So there is a principle of least privilege, P-O-L-P, principle of least privilege, that is applied to whenever you are creating policies within IAM that would be associated with something like an IAM role. So it is recommended to use this principle of least privilege to ensure that something cannot access data, which it should not. It's sort of a for your eyes only type of policy. Some potential issues and possible solutions. Slow data transfer speeds. If you're putting a terabyte of data into S3, it could take a while. There is a feature called Transfer Acceleration, wherein you can use an edge location, which is typically used with CloudFront CDN from AWS, you can actually ingest data into S3 at those edge locations. And it's possible that it's faster for you to get data into that edge location than it would be if you go into a standard region-based ingestion point. Inconsistent data formats is also a problem. You can use data Wrangler's validation tools in order to transform that data and arrive at a common, consistent data format. Some common questions that we see with data ingestion, what do I do with large data sets? You can use pipe mode in order to stream data in to SageMaker. And can I pre-process data without exporting it? Yes, you can process it in place, meaning that it grabs it from S3, transforms it, and puts it back into the same place. Key points of what we touched on here, S3 and SageMaker Data Wrangler can simplify your data ingestion and pre-processing. They do support a wide range of formats for having scalable workflows in the end. Again, S3 is just a file storage location. It's a hard drive in the cloud. Integrate these tools into your pipelines, save time and ensure data quality. Some best practices here with respect to ensuring data quality, make sure you address missing values, handle outliers and inconsistencies early. There are tools within Data Wrangler to visualize your data. So when you're looking at a curve of data and those outliers sitting less than 5% and greater than 95%, Data Wrangler can actually handle that for you and clean up the outliers. We will be talking in a subsequent lesson about SageMaker pipelines and Data Wrangler can be used with those pipelines in order to pre-process data in an automated fashion. So it changes in S3, we preprocess it, that triggers a new model training, and then deployment using that pipeline. Keep an eye on Data Wrangler. Do continuous monitoring to check for data drift and update your preprocessing steps in the event your data set has some sort of evolution.