From the course: Machine Learning with SageMaker by Pearson

Overview of common data formats

The first step in our machine learning lifecycle data preparation, we need to ensure that our data is consistent because it is going to affect how well this model gets trained. So if we have, for example, the size of an engine in cubic centimeters, that's a number or an integer, or a car make is most likely going to be a string. No need to talk about the data formats that AWS can understand in order to ingest this data. Three common text-based formats, we have comma-separated value, JavaScript object notation, and extensible markup language. CSV is what you commonly see in, call it a spreadsheet, if you will. So it's just an ASCII-based, you could have some Unicode characters in there, but let's not talk about that right now. Just simple comma-separated values. So you have a value, comma, value, comma. Widely used, easy to parse. You can open this in something like Excel or Numbers on Mac. JSON is JavaScript Object Notation, flexible, hierarchical. It does require more space to store. And then extensible markup language, which is pretty verbose, but is less common. It is structured. And I have examples of those here. So with comma separated value, you can see we have the first row are our column headers. Make model year color, Toyota Camry 2020 blue. JavaScript object notation gives us dictionaries and arrays and strings and numbers and Booleans. Here we have a dictionary that contains four keys. Those keys are make model year and color. And then the values for those keys, Toyota Camry 2020 and blue, just like we saw with CSV up there. And you'll notice that the year here, 2020, does not have quotes around it. So that tells us it is a number. Over in XML, we have the same data represented in extensible markup language, CAR, MAKE, MODEL, YEAR, and COLOR. Those keys MAKE, MODEL, YEAR, and COLOR define the CAR. Parquet, this is a binary data format. This is created by the Apache project. It is commonly used with ML workloads over in AWS. Columnar storage, optimized for analytical workloads. We also have ORC. ORC is also from the Apache project, and this is optimized row columnar, another binary data format. And finally, we have Avro, row-based compact and driven by a schema. Typically, when dealing with machine learning in Amazon, you're going to be dealing with code a lot. Because you can do SageMaker no code in Canvas, or low code, if you will. But the real power of doing ML workloads within AWS comes from interacting with the services programmatically. So using SageMaker from Boto3 in Python is commonly what you're going to be doing. Here's an example of using Python to write a parquet file. We have some Star Trek data here. Captain James T. Kirk, actor William Shatner, ship USS Enterprise, quote, beam me up, Scotty. So this is just a dictionary of Star Trek data that we're writing into a parquet file. Here, I execute that Python script, create that parquet file. We can see that it is 3,558 bytes to store that data in a parquet file. And if I were to actually cat that file, or if you're on Windows, type it, you're just going to see binary data. Also, I have this URL on the bottom of the screen here. This code example, the Star Trek data did not come from me. I did not author it, so I wanted to give credit to Thunderboot here. This is Thunderboot's example of writing Parquet data. Optimize row columnar. Again, this is another Apache project. Similar, it's storing data in binary format. And Avro also coming from Apache. So all three of these, Avro, Orc, as well as Parquet, are ingestible data formats within AWS services for machine learning. Next up is Rekord.io. Rekord.io was created with Protobuf from Google. This is a file format designed for efficient data storage and retrieval. So Protobuf is used for moving data between applications across a network or interprocess communication within a single computer. Rekord.io is a form of storing data within Protobuf. Designed for TensorFlow and MXNet for training your ML models. sufficient when you're doing sequential data access. And it does split data sets into shards if you are doing distributed processing. So if you're doing learning across multiple, for example, GPU-enabled instances within AWS, then RecordIO can be helpful for distributing that data across the various recommended for optimizing training speed. So what is TensorFlow? I just mentioned this a moment ago. TensorFlow is an open source machine learning framework It was developed by Google, with its purpose being to build, train, and deploy models at scale. When we say scale, we're talking about distributing those models across multiple compute instances and having those models run on whether it be a CPU, a GPU, which is Graphics Processing Unit, or TPU, which is Tensor Processing Unit. There's a lot of use cases for TensorFlow, from deep learning to classic machine learning, where you are simply trying to predict the outcome by feeding it data. So we talk about machine learning these days like it's a new thing, but it's been around for a long time. Like you look at weather forecasting models, how long have humans been trying to forecast the weather? And all of that comes from feeding a bunch of data into something that's trying to algorithmically determine the outcome. So it's, you know, it's January 20th And at noon last year on January 20th, the temperature was x. Today, based on various factors, maybe it's going to be similar. And we try to predict that using some sort of algorithm that is trying to intelligently determine, or forecast, or predict what the outcome is going to be. Easy to use, got intuitive APIs. Cross-platform, you run it on Linux, Mac, Windows, whatever the case may be. And then we have natural language processing, computer vision, reinforcement learning, and time series analysis are common use cases for TensorFlow. Core components here of TensorFlow, we have the core itself. This is a low-level API for building custom ML workflows. Keras, high-level API for fast prototyping and development. Then we have TensorFlow Lite for running on embedded compute-type systems like cell phone. And then this JavaScript, tensorflow.js, can run directly in the browser. Formats like TF record for sequential data are available in TensorFlow, optimized for streaming in large data sets. So what this means is that we don't have to read the entire data set into memory in order to process it. So we're operating on chunks. When you're working with structured sequences, as well as time series data, something like TF record, especially if it's a large data set, can be particularly useful. Now, why would we want to use TF record in SageMaker? As I mentioned, it can be used for streaming, meaning you don't have to process the entire data set at once. You don't have to hold the entire data set in memory in order to process it. So this does allow for efficient data loading by reading in pages of data or chunks of data or shards of data. And it allows you to distribute that data across multiple compute instances, which can speed up. Now, you're doing parallel processing of your data, and it is integrated with TensorFlow. We have TF record data set API, which allows you to pre-process and batch that data. When should you use it with SageMaker? Again, it comes down to distribution of processing across multiple nodes, how big is your data set, and are you actually able to chunk up that data set for the distribution of data across multiple training nodes? If you are, then TF record can speed things up quite a bit. Processing a single data set on a single compute instance that takes 10 hours, could you potentially achieve it in one hour by distributing it across 10 instances? It's that classic problem of can 10 humans birth a baby in one month, rather than nine months with one human, for example. So which data format do you choose? It really comes down to what you're trying to do. Is it a large data set? If so, take a look at our binary formats like Parquet or Avro. If it's just a spreadsheet of columns of data, like a date time series type stuff, then of course we have CSV, JSON, et cetera. Do you need to be compatible with ML frameworks like TensorFlow? If so, take a look at TF record as well. SageMaker does support TF record. What about compression, read-write speeds? Those can come into play as well. AWS S3 supports all major data formats. So S3 is simple storage service. In AWS, it is just block-based storage of data. It's a place to store a file. There is some metadata associated with these files, but the underlying contents of those files is transparent to S3, it doesn't care. So it will support storage of a CSV file, a parquet file, whatever the case may be. Glue does support transformations between formats. So for example, you could convert from CSV to parget in AWS Glue. Finally, SageMaker does support CSV, JSON, RecordIO and TF record for training your models.

Contents