Setup AWS EMR with Hadoop and Spark
Create a new EMR cluster
You should see the instances are ready in EC2 dashboard
Click on the “Action” button and select “Connect”, all the SSH connection details will be available
Or else you may click on the EMR dashboard for SSH connection details
We will use the movie lens 100k data set for demostration purpose.
#download the movie lens 100k data from below url
[hadoop@ip-xx ~]$ sudo wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
#unzip the file
[hadoop@ip-xx ~]$ unzip ml-100k.zip
#change directory
[hadoop@ip-xx ~]$ cd ml-100k
#check the first 5 rows of a particular files
[hadoop@ip-1xx ~]$ head -5 u.user
#moving the files to a directory
[hadoop@ip-xx ~]$ hadoop fs -put /home/hadoop/ml-100k /user/hadoop/
#check if all the files are copy correctly
[hadoop@ip-1xx ~]$ hadoop fs -ls /user/hadoop/ml-100k
Found 23 items
-rw-r–r– 1 hadoop hadoop 6750 2016-05-28 14:33 /user/hadoop/ml-100k/README
-rw-r–r– 1 hadoop hadoop 716 2016-05-28 14:33 /user/hadoop/ml-100k/allbut.pl
-rw-r–r– 1 hadoop hadoop 643 2016-05-28 14:33 /user/hadoop/ml-100k/mku.sh
-rw-r–r– 1 hadoop hadoop 1979173 2016-05-28 14:33 /user/hadoop/ml-100k/u.data
-rw-r–r– 1 hadoop hadoop 202 2016-05-28 14:33 /user/hadoop/ml-100k/u.genre
-rw-r–r– 1 hadoop hadoop 36 2016-05-28 14:33 /user/hadoop/ml-100k/u.info
-rw-r–r– 1 hadoop hadoop 236344 2016-05-28 14:33 /user/hadoop/ml-100k/u.item
-rw-r–r– 1 hadoop hadoop 193 2016-05-28 14:33 /user/hadoop/ml-100k/u.occupation
-rw-r–r– 1 hadoop hadoop 22628 2016-05-28 14:33 /user/hadoop/ml-100k/u.user
-rw-r–r– 1 hadoop hadoop 1586544 2016-05-28 14:33 /user/hadoop/ml-100k/u1.base
-rw-r–r– 1 hadoop hadoop 392629 2016-05-28 14:33 /user/hadoop/ml-100k/u1.test
-rw-r–r– 1 hadoop hadoop 1583948 2016-05-28 14:33 /user/hadoop/ml-100k/u2.base
-rw-r–r– 1 hadoop hadoop 395225 2016-05-28 14:33 /user/hadoop/ml-100k/u2.test
-rw-r–r– 1 hadoop hadoop 1582546 2016-05-28 14:33 /user/hadoop/ml-100k/u3.base
-rw-r–r– 1 hadoop hadoop 396627 2016-05-28 14:33 /user/hadoop/ml-100k/u3.test
-rw-r–r– 1 hadoop hadoop 1581878 2016-05-28 14:33 /user/hadoop/ml-100k/u4.base
-rw-r–r– 1 hadoop hadoop 397295 2016-05-28 14:33 /user/hadoop/ml-100k/u4.test
-rw-r–r– 1 hadoop hadoop 1581776 2016-05-28 14:33 /user/hadoop/ml-100k/u5.base
-rw-r–r– 1 hadoop hadoop 397397 2016-05-28 14:33 /user/hadoop/ml-100k/u5.test
-rw-r–r– 1 hadoop hadoop 1792501 2016-05-28 14:33 /user/hadoop/ml-100k/ua.base
-rw-r–r– 1 hadoop hadoop 186672 2016-05-28 14:33 /user/hadoop/ml-100k/ua.test
-rw-r–r– 1 hadoop hadoop 1792476 2016-05-28 14:33 /user/hadoop/ml-100k/ub.base
-rw-r–r– 1 hadoop hadoop 186697 2016-05-28 14:33 /user/hadoop/ml-100k/ub.test
#start a pyspark console
[hadoop@ip-1xx~]$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.1
/_/
Using Python version 2.7.10 (default, Dec 8 2015 18:25:23)
SparkContext available as sc, HiveContext available as sqlContext.
#load u.user into the RDD
>>> user_data = sc.textFile(“ml-100k/u.user”)
#checking 1st record of the RDD and you will see the file is seperated by “|”
>>> user_data.first()
12.159989 s
u’1|24|M|technician|85711′
# split the columns by “|”
>>> user_fields = user_data.map(lambda line: line.split(“|”))
#count number of users
>>> num_users = user_fields.map(lambda fields: fields[0]).count()
# distinct count by gender
>>> num_genders = user_fields.map(lambda fields:fields[2]).distinct().count()
# distinct count by occupation
>>> num_occupations = user_fields.map(lambda fields:fields[3]).distinct().count()
# distinct count by zip codes
>>> num_zipcodes = user_fields.map(lambda fields:fields[4]).distinct().count()
# print all the results
>>> print “Users: %d, genders: %d, occupations: %d, Zip Codes: %d” %(num_users, num_genders, num_occupations, num_zipcodes)
Users: 943, genders: 2, occupations: 21, Zip Codes: 795
The above is just a simple example how we can do a quick data analysis via Pyspark console. now we use the Ipython instead of the standard python shell for further demonstration:
>>> exit()
[hadoop@xx ~]$ IPYTHON=1 pyspark
In [1]: import re
In [2]: from operator import add
# count number of line in the file
In [3]: file_in = sc.textFile(“ml-100k/README”)
Nice to read your article! very informative post .So, please keep posting AWS Stuff here Thanks man…….
Excellent Blog , i appreciate your hardwork ,it is useful <a href="https://onlineitguru.com/aws-online-training-placement.html" title="AWS Online Training | AWS Certification Online Course in India | Online IT Guru\” rel=\”nofollow\”>AWS Online Course