Generally, it is preferable to use one of the Hadoop-specific container formats for storing data in Hadoop. Yet in many instances, you will want to store source data in its raw form. One of the most robust features of Hadoop is its ability to save your entire data irrespective of format. If you have online access to data in its raw, source form, it implies that you can always perform new processing and analytics with the data while the requirements are changing.
Text Data:-
A widespread use of Hadoop is the storage and analysis of logs such as weblogs and server logs. Such text data comes in many other forms like CSV files, or unstructured data. You will want to select a compression format for the records since text files can consume significant space on your Hadoop cluster very quickly. Note that there is an overhead of type conversion involved in storing data in text format.
Selection of compression format will depend on how you use the data and also on Hadoop developers. You can choose the most compact compression available for archiving purpose. If you use the data in processing jobs such as MapReduce, you will want to opt for a format that you can split. Splittable forms enable Hadoop to break files into chunks for processing, which is critical to efficient parallel processing.
In several cases, the use of a container format such as SequenceFiles or Avro is a preferred format for most file types by professional Hadoop developers. These container formats provide functionality to support compression that you can split.
Structured Text Data:-
Structured formats such as XML and JSON can present unique challenges with Hadoop. It is because splitting XML and JSON files for processing is a tricky deal, and Hadoop does not have a built-in InputFormat for either. JSON presents even more significant difficulties than XML. It is because there are no tokens that mark the beginning or end of any record. For these formats, you get a few options:
- Transforming the data into Avro gives a compact and efficient way to store and process the data.
- Using a library for processing XML or JSON files. Examples for XML- XMLLoader in the PiggyBank library for Pig. With JSON, the Elephant Bird project gives the LzoJsonInputFormat.
Binary Data:-
Text is the most common source data format stored in Hadoop. But you can also use Hadoop to process binary files. For cases of storing and processing binary files in Hadoop, you must use a container format such as SequenceFile. If the unit of binary data that you can split is more significant than 64 MB, you may put the data in its file, without using a container format.
Hadoop File Types:-
Several Hadoop-specific file formats were created to work in sync with MapReduce. These Hadoop-specific file formats consist of file-based data structures, e.g. sequence files, serialization formats like Avro, columnar formats such as RCFile and Parquet etc. These file formats share the following characteristics that are important for Hadoop applications:-
Splittable Compression:-
These formats support standard compression formats and are also splittable. Splitting files can be a crucial consideration for storing data in Hadoop because it allows you to divide large files for input to MapReduce and other types of tasks. An essential part of parallel processing is the ability to split a file for processing by multiple tasks.
Agnostic Compression:-
You can compress the file with any compression codec, without readers having to know the codec. It is possible because the codec is in the header metadata of the file format.
File-based Data Structures:-
The SequenceFile format is a commonly used file-based forms in Hadoop, but you can find other file-based types like MapFiles, SetFiles, ArrayFiles, and BloomMapFiles. Because Hadoop designed these formats to work with MapReduce, and they offer a high level of integration for all forms of MapReduce jobs.
SequenceFiles store data as binary key-value pairs. There are three formats for records within SequenceFiles:
- Uncompressed-
Uncompressed SequenceFiles don not offer any advantages over their compressed alternatives, because they are less efficient for input/output (I/O). Also, they occupy more space on disk than the same data in a compressed form.
- Record-compressed-
This format compresses each record as and when you add it to the file.
- Block-compressed-
This format holds back until data reaches block size to compress. Block compression provides better compression ratios than record-compressed SequenceFiles. Block compression is generally the preferred compression option for SequenceFiles. A block in block compression is a group of records that are compressed together within a single HDFS block.
Irrespective of the format, every SequenceFile uses a standard header format containing necessary metadata about the file. It could have information like the compression codec used, key and value class names, user-defined metadata, a randomly generated sync marker etc. This sync marker is also in the body of the file. It allows seeking to random points in the data and is key to facilitating splits. An everyday use case for SequenceFiles would be a container for smaller files. Storing a large number of little data in Hadoop can cause problems like overuse of memory for the NameNode because metadata for each file stored in HDFS is held in memory/storage. Another question while processing data in these files could be many small files leading to many processing tasks, causing excessive overhead. Hadoop is optimized for large files. Hence, packing smaller files into a SequenceFile makes the storage and handling of these files much more efficient.
Serialization Formats:-
Serialization refers to transforming data structures into byte streams for storage or transmission over a network. On the contrary, deserialization is the process of converting a byte stream back into data structures. Serialization is the core of a distributed processing system like Hadoop because it allows you to transform data into a format that you can store as well as transfer over a network connection. Serialization has two aspects of data processing in distributed systems. It includes interprocess communication and data storage.
The main serialization format that Hadoop utilizes is Writables. Writables are compact and fast. But it is not easy to extend or use from languages other than Java. There are other serialization frameworks within the Hadoop ecosystem. These include Thrift, Protocol Buffers, and Avro. Avro is the best option because it addresses and mitigates the limitations of Hadoop Writables.
Thrift:-
Facebook had developed Thrift as a framework for implementing cross-language interfaces to services. Thrift uses an Interface Definition Language to define interfaces. Also, it uses an IDL file to generate stub code for implementing RPC clients and servers that you can use across languages. Using Thrift allows us to perform a single interface that you can use with different styles to access a varied range of underlying systems. Hadoop does not offer native support for Thrift as a data storage format.
Protocol Buffers:-
The Protocol Buffer format was developed at Google to facilitate data exchange between services written in different languages. Like Thrift, its structures are defined via an IDL, which generates stub code for multiple languages. Also, like Thrift, Protocol Buffers do not allow internal compression of records. They do not allow splitting, and also have no native MapReduce support.
Avro:-
Avro is a language-neutral data serialization system designed to address the lack of language portability in Hadoop Writables. Like Thrift and Protocol Buffers, a language-independent schema describes Avro data. Code generation is optional with Avro. Since Avro keeps the schema in the header of every file, it is self-explanatory. You can easily read Avro files later, even with a language different from the one you used to compose the file. Avro also provides better native support for MapReduce because you can easily compress and split Avro data files. Another essential feature of Avro its is support for schema evolution. It means that the schema used to read a file does not need to match the schema used to write the data. It enables adding new fields to a schema as and when the requirements change.
Avro schemas are usually in JSON, but you write them in Avro IDL, which is a C-like language. Along with metadata, the file header contains a unique sync marker. Following the header, an Avro file includes a series of blocks containing serialized Avro objects. You can compress these blocks optionally, and within these blocks, you can store types in their native format, giving an additional boost to compression. Avro supports Snappy and Deflate compression at the time of writing.
Avro defines a small number of primitive types like Boolean, int, float, and string, and it also supports complex types like array, map, and enum.
Columnar Formats:-
Recently, many databases have introduced columnar storage that provides several benefits over row-oriented systems:
- It skips I/O and decompression on columns that are not in the query.
- It works well for questions that only access a small subset of columns. If you locate many columns, then row-oriented will be right for you.
- It is very efficient with compression on columns because entropy within a column is lower than entropy within a block of rows.
- It is well suited for data-warehousing-type applications where users want to aggregate specific columns over a collection of records.
Columnar file formats are also instrumental in Hadoop applications. Hadoop supports Columnar file formats that include the RCFile format, the Optimized Row Columnar (ORC) and Parquet.
Related:- Hadoop in real world community.
RCFile:-
The RCFile format provides efficient processing for MapReduce applications. In practice, you can only see it implemented as a Hive storage format. The RCFile form offers fast data loading, quick query processing, and highly efficient storage space utilization. The RCFile breaks files into row splits then use column-oriented storage within each division.
The RCFile format provides advantages related to query and compression performance compared to SequenceFiles. But it also has few drawbacks that prevent optimal performance for query times and compression. The more up to date columnar formats such as ORC and Parquet address many of these drawbacks, and for more modern applications, they will likely replace the use of RCFile.
ORC:-
The ORC format addresses some of the shortcomings with the RCFile form, especially around query performance and storage efficiency. The ORC format comes with the following features:
- ORC offers lightweight, always-on compression provided by type-specific readers and writers.
- ORC facilitates the use of zlib, LZO, or Snappy to give more compression.
- It allows you to push down predicates to the storage layer so that only the necessary data is brought back in queries.
- It supports the Hive type model, including decimal and complex types.
- It is a storage format that you can split.
A drawback of ORC is that it has a design that suits the Hive. Hence, it is not a general-purpose storage format that you can use with non-Hive MapReduce interfaces like Pig or Java, or other query engines like Impala.
Parquet:-
Parquet shares similar design goals as ORC, but it is a general-purpose storage format for Hadoop. The goal is to create a form that is suitable for different MapReduce interfaces such as Java, Hive, and Pig, and also ideal for other processing engines like Impala and Spark. Parquet offers the following advantages:
- Parquet permits returning just the required data fields, reducing I/O and enhancing performance.
- It enables efficient compression that you can specify on a per-column level.
- It supports complex nested data structures and frames.
- It stores entire metadata at the end of files.
- Parquet files are self-documenting.
- It uses efficient and extensible encoding schemas—for example, bit-packaging/run-length encoding.
This was all about file formats in Hadoop. Further read about, top five Hadoop alternatives that you should try out.