Hadoop: Data Storage, Resource Management, Data Processing
A. Data Storage: HDFS
- Base on Google File system
- Immutability – WORM (Write Once Read Many) filesystem
- Large File 500MB => split into multiple 128MB block (Default)
- Blocks are distributed among nodes and replicated to 3 (Default replication factor)
- performs best with modest number (millions rather than billions) of large files (100MB or more)
- optimized for large streaming reads of files rather than random reads
Note: HDFS command is very similar to linux command, do bear in mind that there isn’t any current directory as pwd/cwd, all directory path is relative
Basic Command
- How to get Help: hdfs dfs or hadoop fs (Used interchangeably)
- hdfs dfs -ls /
- hdfs dfs -put test.txt test.txt
- hdfs dfs -put test /testing/
- hdfs dfs -cat test/test1.txt | head -n 30
- hdfs dfs -cat test/test1.txt | tail -n 30
- hdfs dfs -get /testing/test/test1.test1.txt test1.txt
- less test1.txt press q to exit
- hdfs dfs -rm test.txt
- hdfs dfs -rm -r /testing/
B. Resource Manager: YARN
Accessing YARN Resource Manager via localhost:8088 as below:
Submit a spark application to YARN cluster
spark2-submit $testing/yarn/wordcount.py /testing/test/*
C. Data Processing: Spark
Spark Streaming
- Kafka
- Kinesis/Firehose
- Flume
Spark SQL
- Hive
Spark GraphX
- HBase + Titan
Spark MLlib
Other relevant Tools
- Acquisition – Nutch (crawler), Solr(Search), Gora (In-memory data model)
- Moving – Sqoop, Flume
- Scheduling – Oozie
- Storage – HBase,Hive
Structural, Unstructured, Semi-structural, Streaming -> Data Lake (S3) -> Sqoop/Flume/Kafka-> Hive/HBase/Spark Streaming/Spark SQL/Spark GraphX/RedShift-> Tableau, D3.js, MicroStrategy
Spark – Scala (Functional Programming) & Python (Object Oriented Programming)
Mode:
- Interactive
- spark-submit
spark sessions
DataFrame vs Data Sets
.printSchema
.show
Transformation vs Actions
Transformations
.select()
.where()
.orderBy()
.join()
.limit()
Actions
.count()
.first()
.take(n)
.show()
.collect()
.write()
Try to read a json file into spark data frame and print the data frame schema
Show the data
Query the data frame
Data Frame Data Sources:
Most commonly used file format for Data Frame data
Data Frame Reader.Writer
InferSchema
Manual define schema
Read data from hive table:
write data from data frame to csv
Check if the csv file with header
Read the csv file and inferSchema to compare the the hive schema
to get help:
press “q” to quit help
Use parquet tools to view the schema of the saved file and getting help in parquet tools:
Read the parquet file which we just write and check the schema
RDD
- Pair RDD
- Double RDD
logsRdd = sc.textFile(“/loudacre/weblogs”)
jpgLogsRdd = logsRdd.filter(lambda line: “.jpg” in line)
jpgLines = jpgLogsRdd.take(5)
for line in jpgLines: print line
lineLengthsRdd = logsRdd.map(lambda line: len(line))
lineLengthsRdd.take(5)
lineFieldRdd = logsRdd.map(lambda line: line.split(‘ ‘))
lineFields = lineFieldRdd.take(5)
for line in lineFields: print line
ipRdd = logsRdd.map(lambda line: line.split(‘ ‘)[0])
for ip in ipRdd.take(5): print ip
ipRdd.saveTextFiel(“/loudacre/iplist”)
Scala > sys.exit
Python > exit or CTRL + D
Due to limited time available, will continue in future.