Project 4 main page | Preliminaries | AWS and cluster setup | Problems

Project 4: Hadoop and Pig

Due date: March 11, 2011 (last day of class)

ESTIMATED TIME: Up to 18 hours, though likely less.

RIGHT NOW:

Find a partner, and email Rita with your names and CSE usernames by 5 pm on Tuesday, February 22. If you choose to work alone, you must also email Rita by the same date and time. Otherwise, you may be assigned to a group randomly!
Once you have your group, complete as many of the steps to set up your Amazon Web Services (AWS) account as you are able.
Resume AWS setup once your team receives an AWS credit coupon from us; this should happen on Wednesday, February 23. The setup process can include up to 2 days of delay by Amazon, so you want to do it quickly!
Once you have AWS access, read through and complete all the Preliminaries, to make sure you can access AWS and run Pig scripts there. You will start the project on your local machine and use an AWS cluster for large runs at the end.

STARTER CODE: Download the project archive, project4.tar.gz. It contains Hadoop, Pig, and the data files and scripts you need for this project. Note: it is about 20 MB compressed.

TURN IN INSTRUCTIONS: Turn in eight files (details on the problems page) to the Catalyst dropbox.

GROUPS: We strongly recommend you work with a partner on this assignment. If you do work with a partner, one member of the group should turn in a single project with everyone's name on it and all members of the group will receive the same score. You should also include a short readme.txt file listing the members of the group and giving a short summary of who did what. Everyone in the group is responsible for the material regardless of how you organize the work.

Where to go from here

The remainder of the instructions for this project consists of three parts:

Preliminaries contains a basic tutorial on using Pig and the AWS Hadoop cluster.
AWS and cluster usage contains instructions for the various tasks you need to do to set up your AWS account and an AWS Hadoop cluster.
Problems contains the actual statements of each problem in the assignment.

NOTE: You need to keep your AWS use down to avoid using up your AWS credits from us and getting charged real money. Once you finish the Pig tutorial on AWS, we recommend you do the following:

Run all your Pig scripts on your local machine using the copies of Hadoop and Pig in the project4.tar.gz archive, against the small data files in the archive.
Then copy the scripts to an AWS cluster and execute them there against the large data files in Amazon S3.
Once your scripts finish running in AWS, copy back the required output files to your local machine, and remember to shut down the cluster.