Hadoop – Spark Training

Hadoop: Data Storage, Resource Management, Data Processing

A. Data Storage: HDFS

  • Base on Google File system
  • Immutability – WORM (Write Once Read Many) filesystem
  • Large File 500MB => split into multiple 128MB block (Default)
  • Blocks are distributed among nodes and replicated to 3 (Default replication factor)
  • performs best with modest number (millions rather than billions) of large files (100MB or more)
  • optimized for large streaming reads of files rather than random reads

Note: HDFS command is very similar to linux command, do bear in mind that there isn’t any current directory as pwd/cwd, all directory path is relative

Basic Command

  1. How to get Help: hdfs dfs or hadoop fs (Used interchangeably)
  2. hdfs dfs -ls /
  3. hdfs dfs -put test.txt test.txt
  4. hdfs dfs -put test /testing/
  5. hdfs dfs -cat test/test1.txt | head -n 30
  6. hdfs dfs -cat test/test1.txt | tail -n 30
  7. hdfs dfs -get /testing/test/test1.test1.txt test1.txt
  8. less test1.txt press q to exit
  9. hdfs dfs -rm test.txt
  10. hdfs dfs -rm -r /testing/

B. Resource Manager: YARN

Accessing YARN Resource Manager via localhost:8088 as below:

Submit a spark application to YARN cluster

spark2-submit $testing/yarn/wordcount.py /testing/test/*

C. Data Processing: Spark

Spark Streaming

  • Kafka
  • Kinesis/Firehose
  • Flume

Spark SQL

  • Hive

Spark GraphX

  • HBase + Titan

Spark MLlib

Other relevant Tools

  • Acquisition – Nutch (crawler), Solr(Search), Gora (In-memory data model)
  • Moving – Sqoop, Flume
  • Scheduling – Oozie
  • Storage – HBase,Hive

Structural, Unstructured, Semi-structural, Streaming -> Data Lake (S3) -> Sqoop/Flume/Kafka-> Hive/HBase/Spark Streaming/Spark SQL/Spark GraphX/RedShift-> Tableau, D3.js, MicroStrategy

Spark – Scala (Functional Programming) & Python (Object Oriented Programming)

Mode:

  • Interactive
  • spark-submit

spark sessions

DataFrame vs Data Sets

.printSchema

.show

Transformation vs Actions

Transformations

.select()

.where()

.orderBy()

.join()

.limit()

Actions

.count()

.first()

.take(n)

.show()

.collect()

.write()

Try to read a json file into spark data frame and print the data frame schema

Show the data

Query the data frame

Data Frame Data Sources:

Most commonly used file format for Data Frame data

Data Frame Reader.Writer

InferSchema

Manual define schema

Read data from hive table:

write data from data frame to csv

Check if the csv file with header

Read the csv file and inferSchema to compare the the hive schema

to get help:

press “q” to quit help

Use parquet tools to view the schema of the saved file and getting help in parquet tools:

Read the parquet file which we just write and check the schema

RDD

  • Pair RDD
  • Double RDD

logsRdd = sc.textFile(“/loudacre/weblogs”)

jpgLogsRdd = logsRdd.filter(lambda line: “.jpg” in line)

jpgLines = jpgLogsRdd.take(5)

for line in jpgLines: print line

lineLengthsRdd = logsRdd.map(lambda line: len(line))

lineLengthsRdd.take(5)

lineFieldRdd = logsRdd.map(lambda line: line.split(‘ ‘))

lineFields = lineFieldRdd.take(5)

for line in lineFields: print line

ipRdd = logsRdd.map(lambda line: line.split(‘ ‘)[0])

for ip in ipRdd.take(5): print ip

ipRdd.saveTextFiel(“/loudacre/iplist”)

Scala > sys.exit

Python > exit or CTRL + D

Due to limited time available, will continue in future.

Related Posts