Project 4: Hadoop and Pig
Due date: March 11, 2011 (last day of class)
ESTIMATED TIME: Up to 18 hours, though likely less.
RIGHT NOW:
- Find a partner, and email Rita with your names
and CSE usernames by 5 pm on Tuesday, February 22. If you choose
to work alone, you must also email Rita by the same date and time.
Otherwise, you may be assigned to a group randomly!
- Once you have your group, complete as many of the steps to
set up your Amazon
Web Services (AWS) account as you are able.
- Resume AWS setup once your team receives an AWS credit coupon from us;
this should happen on Wednesday, February 23. The setup process can include
up to 2 days of delay by Amazon, so you want to do it quickly!
-
Once you have AWS access, read through and complete
all the Preliminaries, to make sure
you can access AWS and run Pig scripts there. You will start
the project on your local machine and use an AWS cluster for
large runs at the end.
STARTER CODE: Download the project archive,
project4.tar.gz. It contains Hadoop, Pig, and the data files and
scripts you need for this project. Note: it is about 20 MB compressed.
TURN IN INSTRUCTIONS: Turn in eight files (details on the
problems page) to the
Catalyst dropbox.
GROUPS: We strongly recommend you work with a partner on this assignment.
If you do work with a partner, one member
of the group should turn in a single project with everyone's name on it
and all members of the group will receive the same score. You should also
include a short readme.txt
file listing the members of the group
and giving a short summary of who did what. Everyone in the group is
responsible for the material regardless of how you organize the work.
Where to go from here
The remainder of the instructions for this project consists of three parts:
- Preliminaries contains a basic tutorial
on using Pig and the AWS Hadoop cluster.
- AWS and cluster usage contains
instructions for the various tasks you need to do to set up your AWS
account and an AWS Hadoop cluster.
- Problems contains the actual
statements of each problem in the assignment.
NOTE: You need to keep your AWS use down to avoid using up
your AWS credits from us and getting charged real money.
Once you finish the Pig tutorial on AWS, we recommend you do the following:
- Run all your Pig scripts on your local machine using the copies
of Hadoop and Pig in the
project4.tar.gz
archive,
against the small data files in the archive.
- Then copy the scripts to an AWS cluster
and execute them there against the large data files in Amazon S3.
- Once your scripts finish running in AWS, copy back the required output files
to your local machine, and remember to shut down the cluster.