You should submit partner selections by email, as described below.
You should submit Phase A, Phase B, and your Writeup using the Catalyst submission system linked on the course page:
https://catalyst.uw.edu/collectit/dropbox/jaf1978/19171
You are strongly encouraged, but not required, to work with a partner of your own choosing for this project. You may work with somebody you already know, or may use the course message board to find potential partners (we have created a Discussion Area for Potential Partners). No more than two students total may be in a group. You may divide the work however you wish, under three conditions:
Other logistics:
Test all of your code together to be sure it properly integrates. Start this early, and do not attempt to merge your code on the due date. You will very likely experience problems with integration. You may wish to set up an SVN repository to make collaboration easier. If so, contact the course staff regarding project space.
Group members will receive the same project grade, unless there is an extreme circumstance and you notify us in advance of the deadline.
If you plan to work with a partner, one partner MUST send cse332-staff an email. For each partner, provide:
Partner emails are due before class on Wednesday, February 22. This is the absolute latest you can email us. We strongly recommend you select a partner much earlier, then get started.
The availability of electronic data is revolutionizing how governments, businesses, and organizations make decisions. But the idea of collecting demographic data is not new. For example, the United States Constitution has required since 1789 that a census be performed every 10 years. In this project, you will process some data from the 2010 census in order to efficiently answer certain queries about population density. These queries will ask for the population in some rectangular area of the country. The input consists of "only" around 220,000 data points, so any desktop computer has plenty of memory. This manageable data size makes parallelism less compelling, but nonetheless required and educational.
You will implement the desired functionality in several ways that vary in their simplicity and efficiency. Your simplest implementations are entirely sequential. You will also implement approaches that use fork-join parallelism in Java's ForkJoin Framework and an approach that uses explicit threads, a shared data structure, and lock-based synchronization.
A final portion of this project involves comparing execution times for different approaches and parameter settings. You will want to write scripts to collect timing data for you, and you will want to use a machine that has at least 4 processors.
This project is an experiment where we will describe the algorithms you must use, but much of the coding details and experimentation are left to you. Will parallelism help or hurt? Does it matter given that most of your code runs only in a pre-processing step? The answers may or may not surprise you, but you should learn about parallelism along the way.
For this project, you will:
You need several files provided in this single zip file:
You will also find the Introduction to the ForkJoin Framework (JSR 166) useful.
Note the provided code will not do anything without your contributions. The code provided to you is believed to be correct (let us know of problems).
The file CenPop2010.txt (distributed with the project files) contains real data published by the U.S. Census Bureau. The data divides the U.S. into 220,333 geographic areas called "census-block-groups" and reports for each such group the population in 2010 and the latitude/longitude of the group. It actually reports the average latitude/longitude of the people in the group, but that will not concern us: just assume everyone in the group lived on top of each other at this single point.
Given this data, we can imagine the entire U.S. as a giant rectangle bounded by the minimum and maximum latitude/longitude of all the census-block-groups. Most of this rectangle will not have any population:
Note that the code we provide you reads in the input data and changes the latitude for each census group. That is because the Earth is spherical but our grid is rectangular. Our code uses the Mercator Projection to map a portion of a sphere onto a rectangle. It stretches latitudes more as you move North. You do not have to understand this except to know that the latitudes you will compute with are not the same latitudes that in the input file. If you find it helpful to do so, you can change the code to disable this projection during your testing.
We can next imagine answering queries related to areas inside the U.S.:
Such questions can reveal that population density varies dramatically in different regions, which explains, for example, how a presidential candidate can win despite losing the states that account for most of the country's geographic area. By supporting only rectangles as queries, we can answer queries more quickly. A different shape can be approximated using multiple rectangles, but this is "Above & Beyond."
Your program will first process the data to find the four corners of the rectangle containing the United States. Some versions of the program will then further preprocess the data to build a data structure that can efficiently answer the queries described above. The program will then prompt the user for such queries and answer them until the user chooses to quit. For testing and timing purposes, you may also wish to provide an alternative where queries are read from a second file. We also provide you a graphical interface that makes asking queries more fun.
The first three command-line arguments to your program will be:
Suppose the values for x and y are 100 and 50. That would mean we want to think of the rectangle containing the entire U.S. as being a grid with 100 columns (the x-axis) numbered 1 through 100 from West to East and 50 rows (the y-axis) numbered 1 through 50 from South to North (note our use of one-based indexing.) So the grid would have 5000 little rectangles in it. Larger x and y will let us answer queries more precisely but will also require more time and/or space.
A query describes a rectangle within the U.S. using the grid. It is simply four numbers:
You program should print a single one-line prompt asking for these four numbers and then read them in. Any illegal input (i.e., not 4 integers on one line) indicates the user is done and the program should end. Otherwise, you should output two numbers:
You should then repeat the prompt for another query.
To implement your program, you will need to determine within which grid each census-block-group lies. This requires computing the minimum and maximum latitude and longitude over all the census-block-groups. Note that smaller latitudes are farther South and smaller longitudes are farther West. Also note all longitudes are negative, but this should not cause any problems.
In the unlikely case that a census-block-group falls exactly on the border of more than one grid position, tie-break by assigning it to the north/east.
You will implement 5 versions of your program. There are signficant opportunites to share code among the different versions and you should seize these opportunities. Dividing the work with a partner by splitting up the versions is therefore also probably a poor strategy.
Before processing any queries, process the data to find the four corners of the U.S. rectangle. Use a sequential O(n) algorithm where n is the number of census-block-groups. Then for each query do another sequential O(n) traversal to answer the query (determining for each census-block-group whether or not it is in the query rectangle). The simplest and most reusable approach for each census-block-group is probably to first compute what grid position it is in and then see if this grid position is in the query rectangle.
This version is the same as version 1 except both the initial corner-finding and the traversal for each query should use the ForkJoin Framework effectively. The work will remain O(n), but the span should lower to O(log n). Finding the corners should require only one data traversal, and each query should require only one additional data traversal.
This version will, like version 1, not use any parallelism, but it will perform additional preprocessing so that each query can be answered in O(1) time. This involves two additional steps:
For example, suppose after step 1 we have this grid:
0 11 1 9 1 7 4 3 2 2 0 0 9 1 1 1
Then step 2 would update the grid to be:
0 11 12 21 1 19 24 36 3 23 28 40 12 33 39 52
There is an arithmetic trick to completing the second step in a single pass over the grid. Suppose our grid positions are labeled starting from (1,1) in the bottom-left corner. You can implement it differently, but this is how queries are given. So our grid is:
(1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1)
Now, using standard Java array notation, notice that after step 2, for any element not on the left or top edge: grid[i][j]=orig+grid[i-1][j]+grid[i][j+1]-grid[i-1][j+1] where orig is grid[i][j] after step 1. So you can do all of step 2 in O(x*y) by simply proceeding one row at a time top to bottom. You could also do it one column at a time from left to right, or in other more complicated ways. The key is that you update (i-1 , j), (i , j+1) and (i-1 , j+1) before (i , j).
Given this unusual grid, we can use a similar trick to answer queries in O(1) time. Remember a query gives us the corners of the query rectangle. In our example above, suppose the query rectangle has corners (3,3), (4,3), (3,2), and (4,2). The initial grid would give us the answer 7, but we would have to do work proportional to the size of the query rectangle (small in this case, potentially large in general). After the second step, we can instead get 7 as 40 - 21 - 23 + 11. In general, the trick is to:
Notice this is O(1) work. Draw a picture or two to convince yourself it is also correct.
Note that another approach to answering queries in O(1) time would be to simply pre-compute the answer to every possible query. But that would require O(x2y2) space and pre-processing time, and is not acceptable for version 3 of your program.
As in version 2, the initial corner finding should be done in parallel. As in version 3, you should create the grid that allows O(1) queries. The first step of building the grid should be done in parallel using the ForkJoin Framework. The second step should remain sequential; just use the code you wrote in version 3. Parallelizing it is part of the Above & Beyond.
To parallelize the first grid-building step, you will need each parallel subproblem to return a grid. To combine the results from two subproblems, you'll need to add the contents of one grid to the other. The grids may be small enough that doing this sequentially is okay, but for larger grids you will want to parallelize this as well using another ForkJoin computation. To test that this works correctly, you may need to set a sequential-cutoff that is lower than your final setting.
Note that your ForkJoin tasks will need several values that are the same for all tasks: the input array, the grid size, and the overall corners. Rather than passing many unchanging arguments in every constructor call, it may be cleaner and perhaps faster to pass an object that has fields for all these unchanging values.
Version 4 may suffer from doing a lot of grid-copying in the first grid-building step. An alternative is to have just one shared grid to which different threads add as they process different census-block-groups. But to avoid losing any of the data, that means grid elements need to be protected by locks. To allow simultaneous updates to distinct grid elements, each element should have a different lock.
In version 5, you will implement this strategy. You should not use the ForkJoin Framework; it is not designed to allow synchronization operations inside of it other than join. Instead you will need to take the "old-fashioned" approach of using explicit threads. It is okay to set the number of threads to be a static constant, such as 4.
How you manage locks is up to you. You could have the grid store objects and lock those, or you could have a separate grid of just locks. Note that after the first grid building step, you will not need to acquire locks anymore (use Thread.join to make sure your other threads are also done!).
Note you do not need to re-implement the code for finding corners of the country. Use the ForkJoin Framekwork code from versions 2 and 4. You also do not need to re-implement the second grid-building step. You are just re-implementing the first grid-building step using Java threads, a shared data structure, and locks.
The provided code will take care of parsing the input file into memory, performing the Mercator Projection, and putting the data you need in a large array. The provided code uses float instead of double because the former is precise enough for the purpose of representing latitude/longitude and takes only half the space.
You should avoid timing the parsing since it is slow but not interesting. The rest is up to you. Make good design decisions.
Your main method should be in a class called PopulationQuery and it should take at least 4 command-line arguments in this order:
You are welcome to add additional command-line arguments after these four for your own experimentation, testing, and timing purposes. A cleaner approach would be to use a different main method in another class.
Regardless of any extra command line parameters you may add, you should ensure that your program
will work when called from the main method in PopulationQuery
.
To facilitate the grading process, your program should EXACTLYmatch the format shown
below (note that >>
indicates user input, but should not appear in your output):
When your program sees any input that is not 4 integers on a line, exit the program without printing any additional output.>>java PopulationQuery CenPop2010.txt 100 500 -v1 Please give west, south, east, north coordinates of your query rectangle: >>1 1 100 500 population of rectangle: 312471327 percent of total population: 100.00 Please give west, south, east, north coordinates of your query rectangle: >>1 1 50 500 population of rectangle: 27820072 percent of total population: 8.90 Please give west, south, east, north coordinates of your query rectangle: >>exit
Also see below for how to write methods that the graphical interface can call; it does not call your main method. You can use the graphical interface with some of your versions even if others are not yet implemented.
The write-up requires you to measure the performance (running time) of various implementations with different parameter settings. To report interesting results properly, you should use a machine with at least four processors and report relevant machine characteristics.
You will also need to report interesting numbers more relevant to long-running programs. In particular you need to:
For guidelines on what experiments to run, see the Write-Up Questions. Note you may not have the time or resources to experiment with every combination of every parameter; you will need to choose wisely to reach appropriate conclusions in an effective way.
Turn in all your new files, including any additional Java files you created for testing, and any provided files you modified. Make sure your code is properly documented, etc. We are not specifying how to test your code, but we still want you to test your programs and show us how you did so.
Submit all of your files for Version 1, Version 2, and Version 3. This should include any additional files you created for testing. Style must be reasonable in this submission, but it will be primarily graded for correctness.
Submit all of your files for all five versions (including files you submitted with Phase A). This should include any additional files you created for testing. It will be graded for correctness and style. Submit and "Above & Beyond" code in an extracredit.zip file.
You may submit a txt file with separate image files for your charts. You may also insert your charts inline in an html or pdf file. Formatting is not a component of your grade, as long as the report is clear and readable.
You must submit all of your code using for timing in the preparation of your report. This code will not be graded and is only to help us understand what you did if there are any questions. The need to time your code should not impact the quality of your design.
As described in Grading Policies, Phase A and Phase B will be graded for both correctness and style. Our goal is to encourage appropriate attention to Phase A, while also incentivizing Phase B improvements whenever necessary.
Correctness points will be allocated to elements of Phase A and Phase B.
Style will be primarily graded using the final Phase B version of your entire project. Phase A submissions must be reasonable, but we expect that you may reconsider elements of style in the course of Phase B.
You will submit your timing code with your writeup, but we will not grade the style of that code.
The provided graphical user interface (GUI) for the program is intended to be fun, easy to use, and useful for checking your program against some geographical intuition (e.g., nobody lives in the ocean and many people live in Southern California).
The GUI presents a map of the U.S. as a background image with a grid overlaid on it. You can select consecutive grid squares to highlight arbitrary rectangles over the map. When you select run, the GUI will invoke your solution code with the selected rectangle and display the result.
To run the GUI, run the main method of the class USMaps. If you are using Java 6 instead of Java 7, you still need the VM argument -Xbootclasspath/p:jsr166.jar.
In the GUI, you can "zoom in" to the continental U.S. When zoomed, keep in mind two things:
Naturally, the GUI needs to call your code and it can do so only if you implement an API that the GUI expects. To use the GUI, you must write two methods in the class PopulationQuery with the following signatures:
The arguments to the preprocess method are the same arguments that should be passed via the command line to the main method in PopulationQuery , only parsed into their datatypes and not as Strings. This method should read the file and prepare any data structures necessary for the given version of the program. The arguments to the singleInteraction method are the arguments that are passed to the program when it prompts for query input. This method should determine the population size and the population percentage of the U.S. given the parameters, just as your program should when given integers at the prompt.
Turn in a report answering the following questions. Note there is a fair amount of data collection for comparing timing, so do not wait until the last minute. Prepare an actual report, preferably as a PDF file.
You may do any or all of the following; pick ones you find interesting.
This project was created in Spring 2010 by Dan Grossman. Brent Sandona created the GUI and Jacob Sanders added the support for zooming in on the continental U.S. Dan got several good ideas and pieces of feedback from other faculty members (Alan Borning, James Fogarty, Hal Perkins, Larry Snyder, maybe others) while he was figuring out the project. Martin Tompa improved the write-up in Fall 2010.