Homework #3: Classified Ads

Your assignment is to write a classifier which, given the IMG tag of an image (including, among other things, the URL of the image and ALT text), determines whether the image is an advertisement. We encourage you to work in groups of up to three people for this assignment (but note that the number of features and examples is per person!).

The assignment is broken into two parts which you will submit separately.

Useful Files

The following file are linked in elsewhere in the page, but you might as well have one coalesced list of the useful stuff:

Part 1: Picking Features and Examples

This part of the assignment is due Monday, March 1st at 5PM.

First, select images from a random distribution, collect them, and tag them as ads or not ads.

Then, select a set of image features over which to learn, and write functions to extract the features given the IMG tag.

As in the search assignment, we will use calling conventions to buffer the input and features from the learning engine. Here's a more complete description of this part with explicit format information:

Training set - select the training set from the "random" distribution provided by http://www.uroulette.com. When you reach the random page, open the HTML source of the page and collect each IMG tag as a data point (for pages with frames, open the source of each frame). Each individual should collect a total of at least 25 data points. If the same IMG tag is used twice on a given page, just list it once as a data point, but make note of the repetition in the data for that image. Treat separate frames as separate pages. To keep things somewhat random, use only the first-level random link URouLette gives you, and use every page you can (try not to skip pages).
Data points - each data point is an association list. The list should associate the following symbols with the data listed:
- :IMG - the IMG tag of this image as it appears on the page.
- :NUM - the number of times the tag appears on the page.
- :URL - the URL of the page on which the image occurs.
- :AD - either T if the image is an ad or NIL if it is not. You may not use this field in calculating any feature! Note: use your discretion to decide what is and what is not an ad.
Features - a feature is some quality of the image which you can measure. Each feature should have a symbolic name, an English description, a domain of possible values, and an extraction function. You will add your feature to the data point by associating the symbolic name of the feature with the feature's domain value for the data point. A feature will be represented by a list which contains the following elements (in the order listed):
- symbolic name - starts with ':', then your login id and a dash, then a descriptive identifier for the feature which is different from all your other features' identifiers.
- description - string containing an English description of the feature.
- domain - list of possible values; should list all possible values the feature might take on. Each value should be a symbol starting with a ':'.
- extractor - function which should take just one argument, a data point (as described above), and return the value from the feature domain to which the given data point corresponds.

Note: there are lots of difficulties with gathering the data: what do you do with broken images? What about automatically linked images (e.g., linked with JavaScript)? How should you count multiple copies of the same image link with different whitespace?

Make the best decision that you can under these circumstances, but explain the decision. For instance, you can't really tell whether a broken image is an ad, so leave it out, and automatically generated links may be partially specified, so leave them out. Post to the newsgroup to discuss other problems!!

Part 1 Submission

Each individual should gather at least 25 data points and produce at least 3 features. You will submit your assignment electronically to the "\\howl\Submit473" share we used for the first assignment. Submission details coming soon (but basically, you will just submit the data points and features including the feature extraction code, and, as always, insightful answers to my questions).

Questions:

Different people have different concepts of what an ad is. What's yours, and why is that a sensible definition?
If you really wanted to snip ads from your browsing, would you use URouLette to gather the images? What would be a better way to gather training examples?
For each of your features, explain why you chose it, and why you think it might help discriminate the examples.

Note on the final question: I'm not testing you in this part on whether the features are actually useful, so try one or two really wacky features and try to think how they might correlate to whether the image is an ad!

Part 1 example

Here's an example of a page of data points and a feature. The page is www.uroulette.com, and the feature determines whether there is one, several, or many occurrences of the image on a page. Note the escaped quotation marks; you'll have to do the same!

Data points:

(setf data-points
      '(((:IMG . "<IMG SRC=\"images/spinner.gif\" WIDTH=432 HEIGHT=254 BORDER=0>")
         (:NUM . 1)
         (:URL . "http://www.uroulette.com/")
         (:AD . NIL))
        
        ((:IMG . "<IMG SRC=\"images/uroulette.gif\">")
         (:NUM . 1)
         (:URL . "http://www.uroulette.com/")
         (:AD . NIL))
        
        ((:IMG . "<img width=440 height=40 border=1 ismap 
            alt=\"LinkExchange\"
            src=\"http://ad.linkexchange.com/1/X515989/logoshowad?free\">")
         (:NUM . 1)
         (:URL . "http://www.uroulette.com/")
         (:AD . T))))

Feature:

;; Return one of the elements of the domain based on the
;; matching border check which succeeds (assumes domain is
;; one longer than borders so that there is a default domain
;; value).
(defun extract-from-list (value borders domain comparator)
  (if (null borders)
      (car domain)
    (if (funcall comparator value (car borders))
        (car domain)
      (extract-from-list value (cdr borders) (cdr domain) comparator))))

;; Categorize the number of occurences as ONE, SEVERAL, or MANY.
(defun num-extractor (data-point)
  (let ((ranges '(1 5))
        (values '(:ONE :SEVERAL :MANY)))
    ;; The "assoc" pulls out the number of occurrences from the data
    ;; point.
    (extract-from-list (cdr (assoc :NUM data-point)) ranges values 
                       #'(lambda (x y) (<= x y)))))

(setf features
      `((:WOLF-FREQUENCY "Classifies the number of occurrences into discrete
                          qualitative categories, putting a value with infinite
                          domain into a small, finite domain."
                         (:ONE :SEVERAL :MANY)
                         ,#'num-extractor)))

Part 2: Building and Running the Decision Tree

This portion will be due Monday, March 8.

For this part of the assignment you will create a general purpose decision tree learner which you will then apply to the examples and features gathered by you and your classmates. We provide you with a core set of features which you must use in your decision tree along with a set of helper functions which should make constructing the decision tree much simpler. We also provide a description of the format of the output (the decision tree itself).

Note: you do not have to use all your features from the first part, and you can add new ones!

We will produce a learner which uses only the core attributes (and will not be at all fancy). After the learners have been submitted, we will run each group's learner on a new set of data and measure the accuracy. Scores will be based on percent of correct identifications (false positives and false negatives will be counted the same). Any learner that beats ours will receive bonus points. There will be prizes for the top learners!

Required Features and Given Datapoints

There are two files you need for the features. The first contains the extractor functions; the second contains the feature list. Load the extractors first and then the feature list. Note: the extractor file is very large; I do not recommend adding to it. If you want to write more extractors, put them in your own file.

This is the set of all data points that were submitted. Again, this file is huge. Don't even try to open it in Allegro, just load it. It sets the variable *data-points* to the data point list.

Format Specifications

First of all, make sure you are conforming to the feature and data point specifications given above. A feature should be a list containing name, description string, domain, and extractor function. A data point should be an association list between features and domain elements and a few special attributes (including the one you want to classify on which should take the values T and NIL only).

Decision tree learner

Your decision tree learner should use the following interface:

(defun my-dectree-learner (data-points features key-attr &optional
                                       (guess T)))

where the arguments are defined as

data-points: A list of data points in the format described above. Each data point should have had all the features in the feature list run on it already and the result added as a new association from the feature name to the domain element.
features: A list of features in the format described above. All the features should already have been applied to the data points (and the result recorded in the points).
key-attr: The attribute which the decision tree should classify. This should be a special feature in the data-points which takes on only the values T and NIL.
guess: An initial guess at whether an arbitrary data point is T or NIL (used initially and recursively to decide the classification in case of a tie or a lack of training examples)

The learner should return a decision tree.

Decision tree

A node in a decision tree is either a branch point (defined in a second) or a leaf. Leaves are either T (the data point falls into the category classified) or NIL (the point does not fall in the category).

A branch point is a cons. The car is a feature (a list containing name, description, domain, and extractor); the cdr is an association list which has an entry for each element of the feature's domain. The entries are themselves decision trees (either leaves or further branch points).

Sample decision tree

Here's a decision tree for the restaurant problem:

'((:PATRONS "Is there anyone in the restaurant?" (:FULL :SOME :NONE)) .
  ((:FULL . ((:HUNGRY "Are you hungry?" (:Y :N)) .
		   ((:Y . ((:TYPE "What type of restaurant is it?" (:FRENCH :THAI :BURGER :ITALIAN)) .
				 ((:FRENCH . NIL) 
				  (:THAI . ((:FRI-SAT "Is it a Friday or Saturday evening?" (:Y :N)) .
						  ((:Y . T) 
						   (:N . NIL))))
				  (:BURGER . T) 
				  (:ITALIAN . NIL))))
		    (:N . NIL))))
   (:SOME . T) 
   (:NONE . NIL)))

Helper Functions

Use this loader file to load the helper code and features/data-points in the correct order. Just make sure you have all the relevant files and then load the loader file (but fix the directory in the file to point to your own directory).

dectree.lsp contains some of the nasty information theoretic code you'll need for your decision tree learner and some other framework functions. It should be loaded before the other files.

For debugging purposes, here's a visualization tool that graphically displays your decision trees. Many thanks to Shawna for the visualization! We also have two sample domains with dummy features, the mushroom domain and the restaurant domain.

Here's a decision tree runner; this is what we will use to run your decision tree.

There is also some new helper code in the features and data points files.