Hadoop – Spark Training – Han Learning Journey

Hadoop: Data Storage, Resource Management, Data Processing

A. Data Storage: HDFS

Base on Google File system
Immutability – WORM (Write Once Read Many) filesystem
Large File 500MB => split into multiple 128MB block (Default)
Blocks are distributed among nodes and replicated to 3 (Default replication factor)
performs best with modest number (millions rather than billions) of large files (100MB or more)
optimized for large streaming reads of files rather than random reads

Note: HDFS command is very similar to linux command, do bear in mind that there isn’t any current directory as pwd/cwd, all directory path is relative

Basic Command

How to get Help: hdfs dfs or hadoop fs (Used interchangeably)
hdfs dfs -ls /
hdfs dfs -put test.txt test.txt
hdfs dfs -put test /testing/
hdfs dfs -cat test/test1.txt | head -n 30
hdfs dfs -cat test/test1.txt | tail -n 30
hdfs dfs -get /testing/test/test1.test1.txt test1.txt
less test1.txt press q to exit
hdfs dfs -rm test.txt
hdfs dfs -rm -r /testing/

B. Resource Manager: YARN

Accessing YARN Resource Manager via localhost:8088 as below:

Submit a spark application to YARN cluster

spark2-submit $testing/yarn/wordcount.py /testing/test/*

C. Data Processing: Spark

Spark Streaming

Kafka
Kinesis/Firehose
Flume

Spark SQL

Hive

Spark GraphX

HBase + Titan

Spark MLlib

Other relevant Tools

Acquisition – Nutch (crawler), Solr(Search), Gora (In-memory data model)
Moving – Sqoop, Flume
Scheduling – Oozie
Storage – HBase,Hive

Structural, Unstructured, Semi-structural, Streaming -> Data Lake (S3) -> Sqoop/Flume/Kafka-> Hive/HBase/Spark Streaming/Spark SQL/Spark GraphX/RedShift-> Tableau, D3.js, MicroStrategy

Spark – Scala (Functional Programming) & Python (Object Oriented Programming)

Mode: