Dataproc
Google provides Dataproc - a big data platform consisting of Hadoop, Spark, Pig and Hive. You can spin up a Dataproc cluster of as many nodes as you want on Google's infrastructure.
Steps to create a Dataproc Cluster
Go to https://console.cloud.google.com and select Dataproc from the left navigation bar. You can also search from the search bar by keying in ‘Dataproc’
In the next tab, choose the resources by selecting the number of nodes, cpu, memory etc., by selecting the appropriate dropdowns.
Pricing
The final cost of your Dataproc instance includes the cost for different types of resources that you consume. Here are some resources to help you understand the cost https://cloud.google.com/dataproc/pricing https://cloud.google.com/products/calculator/
However, you may go with default settings or even add additional resources if the data that you are processing is large enough to consume all the resources.
Commands you can use
- pyspark - opens a python kernel for spark -- use exit() to get out of pyspark
- spark-submit filename.py -- to submit a batch job
- hdfs - to use Hadoop Distributed File System
- hive - to start a hive session -- use quit; to get out of hive
- spark-shell - to start interactive spark with scala -- System.exit(0) to get out
- pig - to start interactive pig interface -- use quit; to get out of pig
- yarn - to run hadoop jobs
Running the sample wordcount program
- SSH into the Dataproc box and upload a file - e.g., text.txt into Dataproc home directory with the text for which you want to find the word count.
- Then copy over the file into hdfs with the below command. hdfs dfs -put text.txt / -- this copies over the newly created file into hdfs / folder
Run the below command to run wordcount example problem yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /text.txt /tmp/output
Map reduce job runs and the output is in /tmp/output folder of the hdfs
- Copy over the hdfs output file into local with the below command hdfs dfs -get /tmp/output*
References:
DataProc: https://cloud.google.com/dataproc/docs/resources/faq