Dataproc

Google provides Dataproc - a big data platform consisting of Hadoop, Spark, Pig and Hive. You can spin up a Dataproc cluster of as many nodes as you want on Google's infrastructure.

Steps to create a Dataproc Cluster

  • Go to https://console.cloud.google.com and select Dataproc from the left navigation bar. You can also search from the search bar by keying in ‘Dataproc’

  • In the next tab, choose the resources by selecting the number of nodes, cpu, memory etc., by selecting the appropriate dropdowns.

Pricing

The final cost of your Dataproc instance includes the cost for different types of resources that you consume. Here are some resources to help you understand the cost https://cloud.google.com/dataproc/pricing https://cloud.google.com/products/calculator/

However, you may go with default settings or even add additional resources if the data that you are processing is large enough to consume all the resources.

Commands you can use

  • pyspark - opens a python kernel for spark -- use exit() to get out of pyspark
  • spark-submit filename.py -- to submit a batch job
  • hdfs - to use Hadoop Distributed File System
  • hive - to start a hive session -- use quit; to get out of hive
  • spark-shell - to start interactive spark with scala -- System.exit(0) to get out
  • pig - to start interactive pig interface -- use quit; to get out of pig
  • yarn - to run hadoop jobs

Running the sample wordcount program

  • SSH into the Dataproc box and upload a file - e.g., text.txt into Dataproc home directory with the text for which you want to find the word count.
  • Then copy over the file into hdfs with the below command. hdfs dfs -put text.txt / -- this copies over the newly created file into hdfs / folder
  • Run the below command to run wordcount example problem yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /text.txt /tmp/output

  • Map reduce job runs and the output is in /tmp/output folder of the hdfs

  • Copy over the hdfs output file into local with the below command hdfs dfs -get /tmp/output*

References:

DataProc: https://cloud.google.com/dataproc/docs/resources/faq

results matching ""

    No results matching ""