From the course: AWS Certified Data Engineer Associate (DEA-C01) Cert Prep

Data formats

- [Instructor] There are many considerations when selecting data formats for data pipelines. Certain formats are for optimizing the best performance when writing data and others for reading the data. For storing the data there are options to consider for compressing your data so that it consumes less storage, but preserves performance. In this lesson, we'll go through some of the common options and when to use each. When we think of data, we typically think of records in a table format. On the disc this can be stored in either row or column format. Row based formats store the data in rows in the file. This is the fastest way to write the data to disc, but not necessarily the fastest read option because you have to skip over irrelevant data. For example, if we want to select just the prices from this table and compute the average price, we have to skip over the blocks containing the SKUs and inventory numbers. Column based formats store all of the column values together in the file, and this leads to better compression because the same data types are now grouped together, and it also typically provides better read performance because you can skip columns that you aren't interested in and just read all the values together from contiguous blocks. As a result, row format is typically used for OLTP transactional databases, while column format is used for OLAP data warehouses. Text files could be stored in various formats, depending on various considerations including simplicity, performance, readability, and compatibility. Systems that generate text data usually write the files in row based formats such as comma separated values or fixed width. And while this is good for right performance, read performance for row based data formats is usually slow. Apache Avro is a binary row based format that improves upon some of the limitations of plain text formats by including the schema in the file. This makes this format more tolerant of schema changes. This format also supports multiple options for compression, and like other row based formats, Avro excels as a format for OLTP systems. Apache ORC, which stands for Optimized Row Columner uses a columner storage format. ORC is specifically designed for Hadoop workloads and is very good for read performance, especially for Apache Hive. Parquet is another columner storage format that supports highly efficient compression and decompression. Parquet files include the schema, so they are better for schema evolution. Both ORC and Parquet are designed for OLAP applications, as they provide good read performance. You'll need to know which file formats are compatible with and provide the best performance with your pipeline components. Compressing your data can save storage costs and maximize the throughput for your pipeline. There are several standard compression algorithms which are common. The differences between them are significant and you'll need to understand which best meets your needs and is compatible with your storage and processing components. Data pipelines with large throughput requirements employ parallel processing techniques that distribute data across multiple nodes. To take advantage of parallel processing, you'll need to design your files with an optimal size for each node, and that means potentially splitting larger files into multiple smaller ones. A disadvantage of two popular compression algorithms, Gzip and Snappy, is that files compressed with these algorithms are no longer splittable, so you need to split your data before using these compression techniques. Bzip2 and LZO are both splittable, but while bzip2 has very high compression, it is slow to compress and decompress, while LZO is fast, but doesn't compress the data nearly as much.

Contents