Data Pipeline

Data Scientists/ Analysts are focused in building models. But for deploying the models to production you need to take help of Data Engineers. The data in different formats may need to be pulled from multiple sources and here again a Data Engineer will help setup the data scoop mechanisms.

And finally deploying to production often times involves deploying a Web application. This entire setup of data ingestion, model building, deploying model to production forms the data pipeline. And the architectural structure and components of the pipeline are built by Data Engineers

Data Engineers typically have a strong understanding of Software engineering principles, and posses architectural tenants necessary for building a data pipeline.

Once a pipeline is created it helps in:

  • end-to-end transformations of raw data coming from various sources
  • storing it appropriately in various different data sources; HDFS, Cloud, SQL, NoSQL etc.
  • processing streaming data
  • build/test/deploy model on the pipeline using CI methods as necessary

Data pipelines typically fall under one of the following paradigms:

  • Extract-Load
  • Extract-Load-Transform
  • Extract-Transform-Load

Real World Spark Use Cases

There are many data pipelines that are setup using Spark that processes trillions of data on a daily basis. Here are some examples

NASA JPL

NASA' Jet Propulsion Laboratory receives 10+ TB of data daily from Instrument and Ground Systems for Earth Monitoring and runs multiple kinds of jobs ranging from long running to sub second. JPL created SciSpark library to allow for interactive computation and exploration possible using scientific processing. SciSpark provides support for scientific data formats and created a new type of RDD called a scientific RDD (sRDD).

eBay

eBay uses Spark on clusters close to 2000 nodes, with 100 TB Ram and 20,000 cores. Ebay leverages Spark for interrogation of complex data, data modeling and data scoring among other things. eBay uses ML-Llib to cluster sellers together via Kmeans. By clustering sellers together, they’re able to increase the user experience by helping users find products they may like more, and provide alternatives or recommendations. In addition, eBay uses SQL with Spark, to increase the performance of their queries. They report that the queries are running at least 5x faster than their Hive counterparts.

Conviva

Conviva provides monitoring and optimization for online video provides. Customers include ESPN, Yahoo, Microsoft, Comcast amongst many others. They use Spark to process 150gb / week of compressed summary data. They found Spark to be 30x faster than Hive. Processing time went from 24 hours to 45 minutes for their weekly Geo Report. Biggest speed up came from reducing disk reads, and storing only relevant data in memory. 30% of their reports currently use Spark, as of 2012.

Yahoo!

Yahoo has a cluster with over 35k servers, 150PB of data spanning 800m users. Yahoo needs a way to quickly learn about users and provide a personalized homepage to increase the user experience. Yahoo’s data scientists leveraged spark to create models to find what news stories would appeal to each users. These models need to run fast, really fast. With Spark they were able to create models in under an hour which greatly enhanced Yahoo's ability to provide personalized news stories to users

Facebook: https://engineering.fb.com/2016/08/31/core-data/apache-spark-scale-a-60-tb-production-use-case/

results matching ""

    No results matching ""