AWS Setup

Setting up your AWS account

Go to http://aws.amazon.com/ and sign up:
1. You may sign in using your existing Amazon account or you can create a new account by selecting "I am a new user."
2. Enter your contact information and confirm your acceptance of the AWS Customer Agreement.
3. Once you have created an Amazon Web Services Account, check your email for your confirmation step. You need Access Identifiers to make valid web service requests.
Go to http://aws.amazon.com/ and sign in. At the top of the page, click on Sign in to the AWS Management Console. You need to sign up for three of their services: Simple Storage Service (S3), Elastic Compute Cloud (EC2), and Amazon Elastic MapReduce.
Get your AWS credit code from here code assignment, the go to http://aws.amazon.com/awscredits/ This gives you $100 credit towards AWS. Be aware that if you exceed it, amazon will charge your credit card without warning. Normally, this credit is more than enough for this homework assignment (if you are interested in their changes, see AWS charges: currently, AWS charges about 8 cents/node/hour for the default "small" node size.). However, you must remember to terminate manually the AWS cluster (called Job Flows) when you are done: if you just close the browser, the job flows continue to run, and amazon will continue to charge you for days and weeks, exhausting your credit and charging you huge amount on your credit card. Remember to terminate the AWS cluster.

Setting up an EC2 key pair

To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following:

After setting up your account, follow Amazon's instructions to create a key pair. Follow the instructions in section "Having AWS create the key pair for you," subsection "AWS Management Console." (Don't do this in Internet Explorer, or you might not be able to download the .pem private key file.)
Download and save the .pem private key file to disk. We will reference the .pem file as </path/to/saved/keypair/file.pem> in the following instructions.
Make sure only you can access the .pem file, just to be safe:
```
$ chmod 600 </path/to/saved/keypair/file.pem>
```

Starting an AWS Cluster and running Pig Interactively

To run a Pig job on AWS, you need to start up an AWS cluster using the web Management Console, then connect to the Hadoop master node, as follows:

Complete Section 1, Section 2, and Section 3.1 (only) in Amazon's interactive Pig tutorial. The screenshots are slightly outdated, but most of the directions are the same. Make the following changes:

In section 2.1.5: In addition to "Pig Program" also select "Run your own application"
In section 2.1.7: Be sure to select only 1 instance instead of the 4 as shown in the screenshot, and you can leave the "optional" items as is. Note that in the last question of the homework you will need to set your cluster to have 20 nodes, rather than the 1 node.
In section 2.3: you are instructed to obtain the Master Public DNS Name: you get this by clicking (highliging) your job flow; this creates a frame at the bottom of your window. Scroll in that frame and you will find the Master Public DNS at the bottom. We call this Master Public DNS name <master.public-dns-name.amazonaws.com>. Section 2.3 instructs you to ssh to it: do it. For your reference, here is the ssh command (so you can cut/paste it from here):

$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
In section 3.1 type:
```
 $ pig
```
Instead of pig -x local.

Once you completed Sections 1, 2, and 3.1 you have a pig prompt:
```
  grunt>
```
This is the interactive mode where you type in pig queries. Here you will cut and paste example.pig, but only after you read "Managing the results of your Pig queries" below. In this homework we will use pig only interactively. (The alternative is to have pig read the program from a file.)
Useful information:
- To exit pig, type quit at the grunt> promt. To terminate the ssh session, type exit at the unix prompt: after that you must terminate the AWS cluster (see next).
- To kill a pig job type CTRL/C while pig is running.This kills pig only: after that you need to kill the hadoop job.

Terminating an AWS cluster

After you are done, shut down the AWS cluster:

Go to the Management Console.
Select the job in the list.
Click the Terminate button (it should be right below "Your Elastic MapReduce Job Flows").
Wait for a while (may take minutes) and recheck until the job state becomes TERMINATED.

Pay attention to this step. If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge you: for hours, days, weeks, and when your credit is exhausted, it chages your creditcard. Make sure you don't leave the console until you have confirmation that the job is terminated.

Monitoring Hadoop jobs

You are required in this homework to monitor the running Hadoop jobs on your AWS cluster using the master node's job tracker web UI. There are two ways to do this: using lynx or using your own browser with a SOCKS proxy.

Using LYNX. Very easy, you don't need to download anything. Open a separate ssh connection to the AWS master node and type:

% lynx http://localhost:9100/

Lynx is a text browswer. Navigate as follows: up/down arrows = move through the links (current link is highlighted); enter = follows a link; left arrow = return to previous page.

Examine the webpage carefully, while your pig pgram is running. You should find information about the map tasks, the reduce tasks, you should be able to drill down into each map task (for example to monitor its progress); you should be able to look at the log files of the map tasks (if there are runtime errors, you will see them only in these log files).
Using SOCKS proxy, and your own browser. This requires more work, but the nicer interface makes it worth the extra work
1. Set up your browser to use a proxy when connecting to the master node.
  - Firefox:
    1. Install the FoxyProxy extension for Firefox.
    2. Copy the foxyproxy.xml configuration file from the hw6/ folder into your Firefox profile folder.
    3. If the previous step doesn't work for you, try deleting the foxyproxy.xml you copied into your profile, and using Amazon's instructions to set up FoxyProxy manually.
  - Chrome:
    1. Install proxy switch!, by clicking "Add to Chrome" on the extension's page.
    2. After clicking the link, you should be at the ProxySwitch options page, but if not, click the Tools wrench icon (upper right corner). Go to Options, go to Extensions. Here you will see the ProxySwitch!: click on Options next to it.
    3. Create a new Proxy Profile: Manual Configuration, Profile name = Amazon Elastic MapReduce (any name you want), SOCKS Host = localhost, Port = 8888 (you can choose any port you want; another favorite is 8157), SOCKS v5.
    4. Create two new swtich rules (give them any names, say AWS1 and AWS2). Rule 1: pattern=*.amazonaws.com:*/*, Rule 2: pattern=*.ec2.internal:*/*. For both, Type=wildcard, Proxy profile=[the profile you created at the previous step].
2. Open a new local terminal window and create the SSH SOCKS tunnel to the master node using the following:
```
$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8888 hadoop@<master.public-dns-name.amazonaws.com>
```
  (The -N option tells ssh not to start a shell, and the -D 8888 option tells ssh to start the proxy and have it listen on port 8888.)
  
  The resulting SSH window will appear to hang, without any output; this is normal as SSH has not started a shell on the master node, but just created the tunnel over which proxied traffic will run.
  
  Keep this window running in the background (minimize it) until you are finished with the proxy, then close the window to shut the proxy down.
3. Open your browser, and type one of the following URLs:
  - For the job tracker: http://<master.public-dns-name.amazonaws.com>:9100/
  - For HDFS management: http://<master.public-dns-name.amazonaws.com>:9101/

Killing a Hadoop Job

If you want to kill pig, you first type CTRL/C, which kills pig only. Next, kill the hadoop job, as follows. From the job tracker interface find the hadoop job_id, then type:

% hadoop job -kill job_id

Managing the results of your Pig queries

Your pig program stores the results in several files in a directory. You have two options: (1) store these files in the Hadoop File System, or (2) store these files in S3. In both cases you need to copy them to your local machine.

1. Storing Files in the Hadoop File System

This is done through the following pig command (used in example.pig):

	store count_by_object_ordered into '/user/hadoop/example-results' using PigStorage();

Before you run the pig query, you need to (A) create the /user/hadoop directory. After you run the query you need to (B) copy this directory to the local directory of the AWS master node, then (C) copy this directory from the AWS master node to your local machine.

1.A. Create the "/user/hadoop Directory" in the Hadoop Filesystem

To create a /user/hadoop directory on the AWS cluster's HDFS file system run this from the AWS master node:

% hadoop dfs -mkdir /user/hadoop

Check that the directory was created by listing it with this command:

% hadoop dfs -ls /user/hadoop

You may see some output from either command, but you should not see any errors. Now you can run example.pig.

1.B. Copying files from the Hadoop Filesystem

The result of a pig script is stored in the hadoop directory specified by the store command. That is, for example.pig, the output will be stored at /user/hadoop/example-results, as specified in the script. HDFS is separate from the master node's file system, so before you can copy this to your local machine, you must copy the directory from HDFS to the master node's Linux file system:

% hadoop dfs -copyToLocal /user/hadoop/example-results example-results

This will create a directory example-results with part-* files in it, which you can copy to your local machine with scp. For the example, there may be only one part-* file, but generally you will have several. You can then concatenate all the part-* files to get a single results file, perhaps sorting the results if you like.

Use hadoop dfs -help or see the hadoop dfs guide to learn how to manipulate HDFS. (Note that hadoop fs is the same as hadoop dfs.)

1.C. Copying files to or from the AWS master node

To copy one file from the master node back to your computer, run this command on the local computer:
```
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<file_path> .
    
```
where <file_path> can be absolute or relative to the AWS master node's home folder. The file should be copied onto your current directory ('.') on your local computer.
Better: copy an entire directory, recursively. Suppose your files are in the directory example-results. They type the following on your loal computer:
```
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -r hadoop@<master.public-dns-name.amazonaws.com>:example-results .
```
As an alternative, you may run the scp command on the AWS master node, and connect to your local machine. For that, you need to know your local machine's domain name, or IP address, and your local machine needs to accept ssh connections.

2. Storing Files in S3

This seems much easier to use. Go to your AWS Management Console, click on Create Bucket, and create a new bucket (=directory). Give it a name that may be a public name. Let's say you call it superman-hw6. Click on the Properties button, then Permissions tab. Make sure you have all the permissions.

Modify the store command of example.pig to:

	store count_by_object_ordered into 's3n://superman-hw6/example-results';

Run your pig program. When it terminates, then in your S3 console you should see the new directory example-results. Click on individual files to download. The number of files depends on the number of reduce tasks, and may vary from one to a few dozens. The only disadvantage of using S3 is that you have to click on each file separately to download.

Note that S3 is permanent storage, and you are charged for it. You can safely store all your query answers for several weeks without exceeding your credit; at some point in the future remember to delete them.