Big Data Platforms
While you can separately download Hadoop, Spark, Pig, Hive, Solr etc., and configure them to work with each other, you can instead download a ready made Big Data Platform and here are some commonly used platforms:
Downloadable platforms
- Cloudera - one of the earliest Hadoop ecosystem with many proprietary tools in the mix. Also provide a cloud version
- HortonWorks - HortonWorks although started as an 100% open source company, merged with Cloudera. However HortonWorks platform is also available as a separate download. Also provide a cloud version
- MapR - Another popular stack that replaces HDFS with their own proprietary storage. Also provide a cloud version
Cloud only platforms
- Google's Dataproc - This stack consists of Hadoop with Spark, Hive and Pig and run on GCP
- Amazon Elastic Map Reduce (EMR) - An AWS solution for Map Reduce on cloud; one of the earliest one's on the cloud
- Microsoft Azure HDInsight - Microsoft Cloud solution
A point to note here is that while all the above platforms uses the open source Hadoop, Spark, Pig, Hive etc., they may also mix some proprietary solutions in addition to using open source solutions.
Big Data File Formats
While the traditional file formats like csv, xml, json works well for small data (the biggest advantage is these are human readable), but these are not optimal for Big Data. Some new file formats that are binary (not human readable), were invented and here are some which are popular:
Apache Avro: This is a serialization framework developed for usage with Hadoop that optimizes reading and writing to file systems. AVRO format is compact and the schemas are separated from the payload. Today AVRO format is not only used with Hadoop but with other systems as well. This is a row based storage format.
Apache Perquet: This is a serialization framework that uses columnar storage format meaning, values in a column are stored together. This format is most efficient when you need to query a subset of columns in contrast to getting complete rows of data across all columns. Here also the schema is stored as metadata of the data.
Apache OCR: OCR which stands for Optimized Row/Columnar that stores collections of rows in one file and the row data is stored in a columnar format called stripes. OCR supports ACID and is more compression efficient.