Homework #3: Classified Ads
Your assignment is to write a classifier which, given the IMG tag of an
image (including, among other things, the URL of the image and ALT
text), determines whether the image is an advertisement. We encourage
you to work in groups of up to three people for this assignment (but
note that the number of features and examples is per person!).
The assignment is broken into two parts which you will submit
separately.
Useful Files
The following file are linked in elsewhere in the page, but you might
as well have one coalesced list of the useful stuff:
Part 1: Picking Features and Examples
This part of the assignment is due Monday, March 1st at 5PM.
First, select images from a random distribution, collect them, and tag
them as ads or not ads.
Then, select a set of image features over which to learn, and write
functions to extract the features given the IMG tag.
As in the search assignment, we will use calling conventions to buffer
the input and features from the learning engine. Here's a more
complete description of this part with explicit format information:
- Training set - select the training set from the "random"
distribution provided by http://www.uroulette.com. When you
reach the random page, open the HTML source of the page and collect
each IMG tag as a data point (for pages with frames, open the source
of each frame). Each individual should collect a total of at least 25
data points. If the same IMG tag is used twice on a given page, just
list it once as a data point, but make note of the repetition in the
data for that image. Treat separate frames as separate pages. To keep
things somewhat random, use only the first-level random link URouLette
gives you, and use every page you can (try not to skip pages).
- Data points - each data point is an association list. The list
should associate the following symbols with the data listed:
- :IMG - the IMG tag of this image as it appears on the page.
- :NUM - the number of times the tag appears on the page.
- :URL - the URL of the page on which the image occurs.
- :AD - either T if the image is an ad or NIL if it is
not. You may not use this field in calculating any feature!
Note: use your discretion to decide what is and what is not an ad.
- Features - a feature is some quality of the image which you can
measure. Each feature should have a symbolic name, an English
description, a domain of possible values, and an extraction
function. You will add your feature to the data point by associating
the symbolic name of the feature with the feature's domain value for
the data point. A feature will be represented by a list which contains
the following elements (in the order listed):
- symbolic name - starts with ':', then your login id and a
dash, then a descriptive identifier for the feature which is different
from all your other features' identifiers.
- description - string containing an English description of
the feature.
- domain - list of possible values; should list all possible
values the feature might take on. Each value should be a symbol
starting with a ':'.
- extractor - function which should take just one argument, a
data point (as described above), and return the value from the feature
domain to which the given data point corresponds.
Note: there are lots of difficulties with gathering the data: what do
you do with broken images? What about automatically linked images
(e.g., linked with JavaScript)? How should you count multiple copies
of the same image link with different whitespace?
Make the best decision that you can under these circumstances, but
explain the decision. For instance, you can't really tell whether a
broken image is an ad, so leave it out, and automatically generated
links may be partially specified, so leave them out. Post to the
newsgroup to discuss other problems!!
Part 1 Submission
Each individual should gather at least 25 data points and produce at
least 3 features. You will submit your assignment electronically to
the "\\howl\Submit473" share we used for the first
assignment. Submission details coming soon (but basically, you will
just submit the data points and features including the feature
extraction code, and, as always, insightful answers to my questions).
Questions:
- Different people have different concepts of what an ad is. What's
yours, and why is that a sensible definition?
- If you really wanted to snip ads from your browsing, would you use
URouLette to gather the images? What would be a better way to gather
training examples?
- For each of your features, explain why you chose it, and why you
think it might help discriminate the examples.
Note on the final question: I'm not testing you in this part on
whether the features are actually useful, so try one or two really
wacky features and try to think how they might correlate to whether
the image is an ad!
Part 1 example
Here's an example of a page of data points and a feature. The page is
www.uroulette.com, and the feature determines whether there is one,
several, or many occurrences of the image on a page. Note the escaped
quotation marks; you'll have to do the same!
Data points:
(setf data-points
'(((:IMG . "<IMG SRC=\"images/spinner.gif\" WIDTH=432 HEIGHT=254 BORDER=0>")
(:NUM . 1)
(:URL . "http://www.uroulette.com/")
(:AD . NIL))
((:IMG . "<IMG SRC=\"images/uroulette.gif\">")
(:NUM . 1)
(:URL . "http://www.uroulette.com/")
(:AD . NIL))
((:IMG . "<img width=440 height=40 border=1 ismap
alt=\"LinkExchange\"
src=\"http://ad.linkexchange.com/1/X515989/logoshowad?free\">")
(:NUM . 1)
(:URL . "http://www.uroulette.com/")
(:AD . T))))
Feature:
;; Return one of the elements of the domain based on the
;; matching border check which succeeds (assumes domain is
;; one longer than borders so that there is a default domain
;; value).
(defun extract-from-list (value borders domain comparator)
(if (null borders)
(car domain)
(if (funcall comparator value (car borders))
(car domain)
(extract-from-list value (cdr borders) (cdr domain) comparator))))
;; Categorize the number of occurences as ONE, SEVERAL, or MANY.
(defun num-extractor (data-point)
(let ((ranges '(1 5))
(values '(:ONE :SEVERAL :MANY)))
;; The "assoc" pulls out the number of occurrences from the data
;; point.
(extract-from-list (cdr (assoc :NUM data-point)) ranges values
#'(lambda (x y) (<= x y)))))
(setf features
`((:WOLF-FREQUENCY "Classifies the number of occurrences into discrete
qualitative categories, putting a value with infinite
domain into a small, finite domain."
(:ONE :SEVERAL :MANY)
,#'num-extractor)))
Part 2: Building and Running the Decision Tree
This portion will be due Monday, March 8.
For this part of the assignment you will create a general purpose
decision tree learner which you will then apply to the examples and
features gathered by you and your classmates. We provide you with
a core set of features which you must use in your decision tree along
with a set of helper functions which should make constructing the
decision tree much simpler. We also provide a description of the
format of the output (the decision tree itself).
Note: you do not have to use all your features from the first part,
and you can add new ones!
We will produce a learner which uses only the core attributes (and
will not be at all fancy). After the learners have been submitted, we
will run each group's learner on a new set of data and measure the
accuracy. Scores will be based on percent of correct identifications
(false positives and false negatives will be counted the same). Any
learner that beats ours will receive bonus points. There will be
prizes for the top learners!
Required Features and Given Datapoints
There are two files you need for the features. The first contains the
extractor
functions; the second contains the feature
list. Load the extractors first and then the feature list. Note:
the extractor file is very large; I do not recommend adding
to it. If you want to write more extractors, put them in your own
file.
This is the set of all data
points that were submitted. Again, this file is huge. Don't even
try to open it in Allegro, just load it. It sets the variable
*data-points*
to the data point list.
Format Specifications
First of all, make sure you are conforming to the feature and data
point specifications given above. A feature should be a list
containing name, description string, domain, and extractor function. A
data point should be an association list between features and domain
elements and a few special attributes (including the one you want to
classify on which should take the values T and NIL only).
Decision tree learner
Your decision tree learner should use the following interface:
(defun my-dectree-learner (data-points features key-attr &optional
(guess T)))
where the arguments are defined as
- data-points
- A list of data points in the format described above. Each data
point should have had all the features in the feature list run on it
already and the result added as a new association from the feature
name to the domain element.
- features
- A list of features in the format described above. All the features
should already have been applied to the data points (and the result
recorded in the points).
- key-attr
- The attribute which the decision tree should classify. This should
be a special feature in the data-points which takes on only the values
T and NIL.
- guess
- An initial guess at whether an arbitrary data point is T or NIL
(used initially and recursively to decide the classification in case
of a tie or a lack of training examples)
The learner should return a decision tree.
Decision tree
A node in a decision tree is either a branch point (defined in a
second) or a leaf. Leaves are either T (the data point falls into the
category classified) or NIL (the point does not fall in the category).
A branch point is a cons. The car is a feature (a list containing
name, description, domain, and extractor); the cdr is an association
list which has an entry for each element of the feature's domain. The
entries are themselves decision trees (either leaves or further branch
points).
Sample decision tree
Here's a decision tree for the restaurant problem:
'((:PATRONS "Is there anyone in the restaurant?" (:FULL :SOME :NONE)) .
((:FULL . ((:HUNGRY "Are you hungry?" (:Y :N)) .
((:Y . ((:TYPE "What type of restaurant is it?" (:FRENCH :THAI :BURGER :ITALIAN)) .
((:FRENCH . NIL)
(:THAI . ((:FRI-SAT "Is it a Friday or Saturday evening?" (:Y :N)) .
((:Y . T)
(:N . NIL))))
(:BURGER . T)
(:ITALIAN . NIL))))
(:N . NIL))))
(:SOME . T)
(:NONE . NIL)))
Helper Functions
Use this loader
file to load the helper code and features/data-points in the
correct order. Just make sure you have all the relevant files and then
load the loader file (but fix the directory in the file to point to your
own directory).
dectree.lsp
contains some of the nasty information theoretic code you'll need for
your decision tree learner and some other framework functions. It
should be loaded before the other files.
For debugging purposes, here's a visualization
tool that graphically displays your decision trees. Many thanks to
Shawna for the visualization! We also have two sample domains with
dummy features, the mushroom
domain and the restaurant
domain.
Here's a decision
tree runner; this is what we will use to run your decision tree.
There is also some new helper code in the features
and data
points files.