Big Data Platforms

While you can separately download Hadoop, Spark, Pig, Hive, Solr etc., and configure them to work with each other, you can instead download a ready made Big Data Platform and here are some commonly used platforms:

Downloadable platforms

Cloudera - one of the earliest Hadoop ecosystem with many proprietary tools in the mix. Also provide a cloud version
HortonWorks - HortonWorks although started as an 100% open source company, merged with Cloudera. However HortonWorks platform is also available as a separate download. Also provide a cloud version
MapR - Another popular stack that replaces HDFS with their own proprietary storage. Also provide a cloud version

Cloud only platforms

Google's Dataproc - This stack consists of Hadoop with Spark, Hive and Pig and run on GCP
Amazon Elastic Map Reduce (EMR) - An AWS solution for Map Reduce on cloud; one of the earliest one's on the cloud
Microsoft Azure HDInsight - Microsoft Cloud solution

A point to note here is that while all the above platforms uses the open source Hadoop, Spark, Pig, Hive etc., they may also mix some proprietary solutions in addition to using open source solutions.

Big Data File Formats

While the traditional file formats like csv, xml, json works well for small data (the biggest advantage is these are human readable), but these are not optimal for Big Data. Some new file formats that are binary (not human readable), were invented and here are some which are popular:

Apache Avro: This is a serialization framework developed for usage with Hadoop that optimizes reading and writing to file systems. AVRO format is compact and the schemas are separated from the payload. Today AVRO format is not only used with Hadoop but with other systems as well. This is a row based storage format.

Apache Perquet: This is a serialization framework that uses columnar storage format meaning, values in a column are stored together. This format is most efficient when you need to query a subset of columns in contrast to getting complete rows of data across all columns. Here also the schema is stored as metadata of the data.

Apache OCR: OCR which stands for Optimized Row/Columnar that stores collections of rows in one file and the row data is stored in a columnar format called stripes. OCR supports ACID and is more compression efficient.

Big Data Platforms

Big Data Platforms

Downloadable platforms

Cloud only platforms

Big Data File Formats

results matching ""

No results matching ""