Process and Analyse Big Data Using Hive on Amazon EMR and MicroStrategy Suite

This is a Amazon big data tutorial which combine the functionality of Elastic Map Reduce ( EMR) and MicroStrategy BI tool for data exploration.

Note: the below procedure will incurred both the AWS EMR, S3 storage and MicroStrategy software charges.

To start an interactive Hive session using the AWS Management Console
  1. Select the Elastic MapReduce tab.
  2. Select the your Region.
  3. Click “Create New Cluster.”
  4. Choose a descriptive name for your cluster, for example, “xxxx.”
  5. In the “Type of Instance” list, select “Large (m1.large)” or smaller instances will be fined too. 
  6. Specify an EC2 key pair in the “Advanced Options” and click “Continue”.
  7. Accept the default selection to “Proceed with no Bootstrap Actions” and click “Continue”.
  8. On the job flows page, wait until the job flow enters the “WAITING” state and then click “Refresh.”
  9. Select the job flow and find the DNS name of the master node in the detail pane in the bottom half of the screen.
  10. Save the master DNS name. You’ll use it to ssh to the master node.

Alternatively to the above console configuration, you may also use the below AWS Cli scripts below to setup the EMR cluster.
You may try one of the options to ssh to the EMR master node:
a) b) ssh with putty with proxy management tools setup

b) ssh with putty without setup proxy management tools

Note: Ensure your VPC DNS resolution and DNS hostname both are set to Yes.

Or else you may do the following to enable them:
a) Open the Amazon VPC console at https://console.aws.amazon.com/vpc/.
b) In the navigation pane, choose Your VPCs.
c) Select the VPC from the list.
d) Review the information in the Summary tab. In this example, both settings are enabled.
f) The DNS Settings tab
g) To update these settings, choose Actions and either Edit DNS Resolution or Edit DNS Hostnames. In the dialog box that opens, choose Yes or No, and Save.

Once you connect to the master node via ssh you run Hive with the following command. (Replace mybucket with your own Amazon S3 bucket.)
  hadoop@domU-12-31-39-07-D2-14:~$ hive \
    -d SAMPLE=s3://elasticmapreduce/samples/hive-ads \
    -d DAY=2009-04-13 -d HOUR=08 \
    -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 \
    -d OUTPUT=s3://mybucket/samples/output
  hive>
ADD JAR ${SAMPLE}/libs/jsonserde.jar ;
 CREATE EXTERNAL TABLE impressions (
    requestBeginTime string, adId string, impressionId string, referrer string, 
    userAgent string, userCookie string, ip string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT 
    serde ‘com.amazon.elasticmapreduce.JsonSerde’
    with serdeproperties ( ‘paths’=’requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip’ )
  LOCATION ‘${SAMPLE}/tables/impressions’ ;
 ALTER TABLE impressions ADD PARTITION (dt=’2009-04-13-08-05′) ;
  CREATE EXTERNAL TABLE clicks (
    impressionId string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT 
    SERDE ‘com.amazon.elasticmapreduce.JsonSerde’
    WITH SERDEPROPERTIES ( ‘paths’=’impressionId’ )
  LOCATION ‘${SAMPLE}/tables/clicks’ ;
  
  CREATE EXTERNAL TABLE joined_impressions (
    requestBeginTime string, adId string, impressionId string, referrer string, 
      userAgent string, userCookie string, ip string, clicked Boolean
    )
    PARTITIONED BY (day string, hour string)
    STORED AS SEQUENCEFILE
    LOCATION ‘${OUTPUT}/joined_impressions’
  ;
 CREATE TABLE tmp_impressions (
    requestBeginTime string, adId string, impressionId string, referrer string, 
    userAgent string, userCookie string, ip string
  )
  STORED AS SEQUENCEFILE;
  INSERT OVERWRITE TABLE tmp_impressions 
    SELECT 
      from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int)) requestBeginTime, 
      i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip
    FROM 
      impressions i
    WHERE 
      i.dt >= ‘${DAY}-${HOUR}-00’ and i.dt < '${NEXT_DAY}-${NEXT_HOUR}-00'
  ;
  CREATE TABLE tmp_clicks (
    impressionId string
  ) STORED AS SEQUENCEFILE;
  INSERT OVERWRITE TABLE tmp_clicks 
    SELECT 
      impressionId
    FROM 
      clicks c  
    WHERE 
      c.dt >= ‘${DAY}-${HOUR}-00’ AND c.dt < '${NEXT_DAY}-${NEXT_HOUR}-20'
  ;
  INSERT OVERWRITE TABLE joined_impressions PARTITION (day=’${DAY}’, hour=’${HOUR}’)
  SELECT 
    i.requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, 
    i.ip, (c.impressionId is not null) clicked
  FROM 
    tmp_impressions i LEFT OUTER JOIN tmp_clicks c ON i.impressionId = c.impressionId
  ;
Note: Reference from: http://aws.amazon.com/articles/2855?_encoding=UTF8&jiveRedirect=1

Setup a MicroStrategy AMI from AWS marketplace as below:

Once the MicroStrategy AMI instance is launch, you may setup the SSH port fowarding as per below:
  1. Start PuTTY.
  2. Specify master node DNS name as the Host Name.
  3. Make sure the SSH port is set to 22.
  4. Under Category, expand Connection, then SSH, and select Tunnels to open the Options controlling SSH port forwarding page.
  5. Specify 10003 as the Source port. (This assumes you used the default version of Hive, 0.8.1, on the Amazon EMR cluster. Otherwise specify the correct port for your version.)
  6. Specify localhost:10003 as the Destination port
  7. In the section with radio buttons, select Local and Auto.
  8. Click the Add button. This should create an entry L10003 localhost:10003.
  9. Now navigate to Connection >> SSH >> Auth to set up the private key settings.
  10. Click Browse under Private key for authentication and choose the location where you saved the .ppk key file.
  11. Navigate to Connection >> Data and type hadoop as the Auto-login username.
  12. In the Category pane, click Session to go back to the main page.
  13. In Saved Sessions, specify a name for this setting and click Save. Next time you can select this setting and click Load.
  14. Click Open to create the SSH tunnel.
And finally use the MicroStrategy Freeform to define a database connection. Fill out the dialog with the values shown below. These settings assume the SSH tunnel to the Hive job flow is via port 10003 on localhost.

Use the below Hive scripts to extract the data:
select feature, avg(clicked_percent) as avg_click_percent, 
count(*) as views from feature_index select feature,
avg(clicked_percent) as avg_click_percent, 
count(*) as views from feature_index where feature like ‘page:%’ group by feature
Data referenced from: https://aws.amazon.com/articles/7795181155430209#VPC

Related Posts

2 thoughts on “Process and Analyse Big Data Using Hive on Amazon EMR and MicroStrategy Suite

Comments are closed.