Dataproc

Google provides Dataproc - a big data platform consisting of Hadoop, Spark, Pig and Hive. You can spin up a Dataproc cluster of as many nodes as you want on Google's infrastructure.

Steps to create a Dataproc Cluster

Go to https://console.cloud.google.com and select Dataproc from the left navigation bar. You can also search from the search bar by keying in ‘Dataproc’
In the next tab, choose the resources by selecting the number of nodes, cpu, memory etc., by selecting the appropriate dropdowns.

Pricing

The final cost of your Dataproc instance includes the cost for different types of resources that you consume. Here are some resources to help you understand the cost https://cloud.google.com/dataproc/pricing https://cloud.google.com/products/calculator/

However, you may go with default settings or even add additional resources if the data that you are processing is large enough to consume all the resources.

Commands you can use

pyspark - opens a python kernel for spark -- use exit() to get out of pyspark
spark-submit filename.py -- to submit a batch job
hdfs - to use Hadoop Distributed File System
hive - to start a hive session -- use quit; to get out of hive
spark-shell - to start interactive spark with scala -- System.exit(0) to get out
pig - to start interactive pig interface -- use quit; to get out of pig
yarn - to run hadoop jobs

Running the sample wordcount program

SSH into the Dataproc box and upload a file - e.g., text.txt into Dataproc home directory with the text for which you want to find the word count.
Then copy over the file into hdfs with the below command. hdfs dfs -put text.txt / -- this copies over the newly created file into hdfs / folder
Run the below command to run wordcount example problem yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /text.txt /tmp/output
Map reduce job runs and the output is in /tmp/output folder of the hdfs
Copy over the hdfs output file into local with the below command hdfs dfs -get /tmp/output*

References:

DataProc: https://cloud.google.com/dataproc/docs/resources/faq

Dataproc

Dataproc

Steps to create a Dataproc Cluster

Pricing

Commands you can use

Running the sample wordcount program

References:

results matching ""

No results matching ""