[an error occurred while processing this directive]

<h2>Project 3: e2undel</h2>

<p>
<table border=0>
<tr><td><b>Out:</b><td>Monday, November 27, 2011</tr>
<tr><td><b>Due Date:</b><td>Wednesday, December 7, 11:59pm</tr>
<tr><td><bIndividual Assignment</b>
</table>

<hr>
<h3>Download</h3>
<blockquote>
<a href='project3Files/project3.tar.gz'>project3.tar.gz</a>
</blockquote>

<hr>
<h3>Assignment Goals</h3>

<ul>
  <li> File system disk structures (for the <tt>ext</tt> family of file systems)

  <li> Use of file systems

  <li> Familiarity with raw devices vs. raw files vs. cooked files

  <li> "More real" programming

</ul>

<hr>
<h3>Project Introduction</h3>
<blockquote>
This assignment simulates the following common situation: you've got some
incredibly valuable file (say a photo of the sasquatch you met on a recent hike)
that you manage to delete by pressing the right camera button at the wrong time.
It would be handy for situations like that to have a program that can
undelete files.
That's our goal, to write a program that can undelete.
<p>
Implementations of undelete are very file system-type specific.  We're going
to write undelete for the <tt>ext2</tt> file system.  (Cameras
typically use <tt>fat32</tt>.)  The reason is that the
file system layout in <tt>ext2</tt> is very similar to that of its successors,
<tt>ext3</tt> and <tt>ext4</tt>, so it's pertinent to today's systems
even if no one is using <tt>ext2</tt> any longer.
(undelete is not possible for <tt>ext3</tt> and <tt>ext4</tt>, or at least 
not without applying a lot of complicated heuristics that make the job
a lot bigger than a 1.5 week assignment.)
<p>
This project leaves most everything up to you.  The coding portion is small;
figuring out what to do is about half the assignment; getting it implemented
the other half; the final report the third half (but a small half).
Details of what to do follow.
</blockquote>

<hr>
<h3>The Scenario</h3>
<blockquote>
When you delete an <tt>ext2</tt> file, the directory entry for that
file is removed, and the inode and data blocks are marked as free.
If you were to list all files on the file system, the deleted file no
longer exists.  However, the contents of the deleted file are still on disk, at least 
for a while, because the data blocks that were used to hold them are not wiped clean
when the file is deleted.  So long as subsequent file activity does 
not allocate those same blocks for some other use, the data just sits there.
Similar comments apply to the deleted file's inode.
<p>
For this assignment you're provided with an <tt>ext2</tt> file system
in which some files have been deleted.  You're asked to recover those files.
For two of the file systems we distribute with the assignment
(<tt>filesysFile-easy</tt> and <tt>filesysFile-medium</tt>),
the files were deleted
and then no further activity of any sort took place on that file system.
(The third file system is a more complicated scenario, discussed a bit later.)
As a practical matter, your code should easily be able to recover the deleted files
from those two file systems.
One can imagine writing code that is more robust than what is required by this assignment,
for instance, recognizing that only a portion of a deleted file can be
recovered, and that a portion is permanently lost.  It's great if you want to try implementing
those sorts of heuristics for this assignment, but it's not required - you
can make a simple, "best effort" to recover deleted files and leave it at that.
</blockquote>

<hr>
<h3>Distribution of <tt>ext2</tt> File Systems</h3>
<blockquote>
An issue that comes up in designing this project is how to distribute a test
<tt>ext2</tt> file system with deleted files.  Do I hand out hard drives,
or USB drives, or SDHC cards?
<p>
It turns out that the basic Linux philosophy of unifying concepts makes
this easy: a regular file can be treated as a device.  So, the test
<tt>ext2</tt> file systems are distributed as files.  You should think
of them as hardware, e.g., a hard disk partition.  Because they're
files, though, you can read them using standard file operations.
(Also, it doesn't matter what kind of file system these files are
stored in -- their <i>contents</i> are <tt>ext2</tt> <i>file systems</i>,
but they're just regular old files.)
<p>
You need to do "raw IO" on the contents of the file.  That means 
you should use the <tt>open</tt>, <tt>lseek</tt>,
<tt>read</tt>, <tt>write</tt>,
and <tt>close</tt> library routines, operating on "the file system file",
to deal with them.
To obtain <tt>man</tt> pages about those methods, you might need to mention
the <tt>man</tt> section, because those names are fairly common.
For instance:
<blockquote>
<tt>$ man 2 read</tt>
</blockquote>
Note that the <tt>.h</tt> files required to use those methods
are listed at the tops of their <tt>man</tt> pages.
<p>
The file <tt>fileIOExample.c</tt> in the distribution provides an
example that uses these file operations.
<p>
<b>Note that while you will be using a system <tt>.h</tt> file to define
some useful symbolic names (e.g., fields of an inode), you won't be using
any of the (non-macro) functions declared in those <tt>.h</tt> files.</b>
That is, you won't be using any <tt>ext2</tt>-specific library routines 
to read <tt>ext2</tt> metadata.  All your IO should be done using generic <tt>open</tt>,
<tt>lseek</tt>, <tt>read</tt>, and <tt>close</tt> calls.
</blockquote>

<hr>
<h3>What Is An <tt>ext2</tt> File System?</h3>
<blockquote>
Let's talk using device terminology (and let's talk only
about the common case, ignoring the many generalizations
that have been implemented).  A <i>partition</i>
is a fixed sized piece of a disk device, for example,
200GB of disk storage.
You <i>format</i> a disk to create a partition table,
which describes how many partitions there are on it, and how
big they are.
<p>
Once you have a partition, you initialize a file system on it.
That initialization sets up the on-disk data structures the 
file system implementation uses to keep track of files and their
contents.  Those structures are things like <i>inodes</i> and
<i>directories</i>.  Once initialized, the file system contains
only the root directory, and is ready for use creating other files, opening them,
reading their contents, and the like.
<p>
So, one part of what a <i>file system</i> is is the decision
about how to use the storage of the parition to lay out 
the file system's on disk data structures.  That's the only
thing we're concerned with in this assignment.
<p>
<a href='http://www.nongnu.org/ext2-doc/ext2.html'>The Second Extended File System</a>
web page describes exactly what the data structures are that <tt>ext2</tt>
uses, and exactly how they're located on the available storage.
Where the document talks about "the volume", you should think
"the file system file".
<p>
The <a href='http://www.nongnu.org/ext2-doc/ext2.html#DISK-ORGANISATION'>Disk Organization</a>
section of that document describes the disk layout.  You might skim
the couple of examples at the beginning of that section
to get a general sense of what everything
up to now has meant, before going through the document in more detail.
<p>
The document contains various symbolic names for things (e.g., <tt>i_atime</tt>).
Nearly all of these names are names you can use in your program - they're
fields of <tt>struct</tt>s defined in <tt>ext2fs.h</tt>.
I did encounter some isolated differences, though, and those are noted in a later section.
</blockquote>

<hr>
<h3>Programming Details</h3>
<blockquote>
<tt>ext2</tt>'s on disk structures can be represented in main memory by C <tt>struct</tt>'s.
For example, a <tt>struct ext2_inode</tt> is laid out in memory exactly like an <tt>inode</tt> is on disk.
<p>
Definitions for those structs are found in include files that are located (usually, depending on your
system) in <tt>/usr/include/ext2fs/</tt>.  File <tt>ext2fs.h</tt> is the main include, and the
only one you need to explictly include in your code:
<blockquote>
<tt>#include &lt;ext2fs/ext2fs.h&gt;</tt>
</blockquote>
It includes some of the other files in that directory, though, so if you want to 
have a look at a
<tt>struct</tt> definition, you might have to snoop around a bit.


<blockquote style='border:1px solid; padding:3px'>
<span style="color:red">
The 451 and CSE home Linux VMs don't have the include files
needed for Project 3 installed.</span>  The symptom will be a a file not found
error on the C code line:
<p>
<tt>#include &lt;ext2fs/ext2fs.h&gt;</tt>
<p>
To fix this on the 451 or CSE VMs:
<p>
<tt>$ sudo yum install e2fsprogs-devel</tt>
<p>
Other systems:
<ul>
<li>attu and forkbomb: the include files are already installed
<li>Ubuntu: <tt>$ sudo apt-get install e2fslibs-dev</tt>
</ul>
Note: You will have to give your VM access to the Internet while doing this.  On the lower,
right margin of the VMware Player window, click on the network icon.  Switch from "host only" 
to NAT networking.  Click again and select Connect.  Reboot your Linux machine.  (When
you're done installing the package, it's recommended you return the machine to host-only
networking.)
</blockquote>


<p>
Other details:
<ul>
<li><b>Your executable should be called <tt>e2undel</tt>.</b>
<br>It should be invocable with a command line like this:
<blockquote>
$ ./e2undel filesysFile-test
</blockquote>
The sole argument is the name of the file that holds the <tt>ext2</tt> file system.
<p>
<li><b>You should place recovered files in subdirectory <tt>recoveredFiles/</tt>.</b>
<br>For example, if the executable is <tt>~/project3/e2undel</tt>, the recovered files should
go in <tt>~/project3/recoveredFiles/</tt>.
<p>
<li><b>A recovered file's name should be <tt>file-nnnnn</tt>, where <tt>nnnnn</tt> is
the <tt>inode</tt> number the file once resided at.</b>
<br>For example, <tt>file-00182</tt>.  You can use <tt>sprintf</tt> to format the file
name.
<p>
<li><b>You should recover only things that might plausibly be deleted files.</b>
<br>You shouldn't produce a possibly deleted file for every subset of blocks on the volume, for instance.
<p>
<li><b>You should restore the last accessed and last modified meta data for the files.</b>
<br><tt>recoveredFiles/file-nnnnn</tt> should have those timestamps equal to those of the deleted file (which
necessarily is earlier than the time at which it was recovered).
<br><tt>$ man 3 futimes</tt>
<p>
<li><b>You can have at most a small constant number of <tt>inodes</tt> or data blocks
in memory at a time.</b>
<br>The sample <tt>ext2</tt> file systems are unrealistically small -- 1MB, 64MB, and 512MB.  In a more realistic
situation, the total space of the volume (partition) would exceed the size of main memory.
For that reason, it's not allowed in this assignment to simply read the entire volume into some
in-memory buffer, even though physically that's possible for the test cases.
<p>
You <b>are</b> allowed to keep in memory much of the file system data structures.  Just what
to keep is up to you, with the restriction that you can have only a small number of <tt>inode</tt>s
or data blocks in memory at a time.  ("A small number" is whatever you want, so long as it's constant
and small relative to the total number of <tt>inode</tt>s and data blocks on the volume.)
You can, however, bring all block and inode bitmaps into memory; they're orders of magnitude smaller than
the space required to bring in all inodes or blocks.
<p>
<li><b>Your code will be tested on a 64-bit system</b>
<br>It's hard to imagine writing any code for this assignment that won't port between 64- and 32-bits,
but if you work on a 32-bit system it would be prudent to test on 64-bits before turnin.
</ul>
</blockquote>

<hr>
<h3>How Do I Start?</h3>
<blockquote>
I recommend this.
<ol>
<li>Look at <a href='http://www.nongnu.org/ext2-doc/ext2.html'>The Second Extended File System</a>,
then the contents of <tt>/usr/include/ext2fs/ext2fs.h</tt> and neighboring include files, and repeat for a while.
<p>
<li>Now write some pseudo-code describing what you're going to do to find and recover deleted files.
<p>
<li>Now start writing C code.
</ol>
Note that the two sources just mentioned agree on just about all names.  The only ones I
encountered that differed were like this: the web page says <tt>EXT2_S_IFDIR</tt>
but the include file says <tt>LINUX_S_IFDIR</tt>.  Also, the include file defines
a useful macro, <tt>LINUX_S_ISDIR()</tt>, that is not mentioned in the web page.
<p>
Also note that <tt>ext2fs.h</tt> establishes defined constants <tt>SUPERBLOCK_OFFSET</tt> and
<tt>SUPERBLOCK_SIZE</tt>, not mentioned on the web page.  I'd use them.  (I did use them.)
</blockquote>


<hr>
<h3>Interesting Usage Notes</h3>
<ol>
<li>The distribution includes a <tt>bash</tt> script, <tt>mkFilesysFile.sh</tt>,
that creates a file system file.  You don't absolutely need to do that -- the distribution
includes already prepared file system files -- but it might help debug if you can create your
own test cases.
<p>
The script's contents are:
<pre><tt>
#!/bin/bash                                                                     

echo args should be: filename size_in_1KB_blocks

dd if=/dev/zero bs=1k count=$2 of=$1
mke2fs -F $1
tune2fs -c0 -i0 $1
mkdir -p mnt
sudo mount -o loop $1 mnt

echo "'sudo umount mnt'" to unmount
</tt></pre>
Sample usage to create a 32MB file system file: <tt>$ ./mkFilesysFile.sh mytestfile 32768</tt>
<p>
The <tt>dd</tt> command copies bytes from an artificial device that supplies any number of zeroes to
an output file.
<br>The <tt>mke2fs</tt> command initializes an <tt>ext2</tt> file system on "the volume" (file).
<br>The <tt>tune2fs</tt> command prevents the system from running <tt>fsck</tt> on the volume, ever.
<br>The <tt>mkdir</tt> command makes sure you have a directory named <tt>mnt</tt>.
<br>The <tt>sudo mount</tt> command mounts the <tt>ext2</tt> file system on top of subdirectory <tt>mnt</tt>.
If you <tt>cd mnt</tt> and create some files, you're creating them in the <tt>ext2</tt> file system.  You
can delete some as well. <b>Note that you must be able to acquire root privilege to mount.</b>  You
won't be able to do this on a shared CSE machine.  (You can use a shared CSE machine for debugging against
the supplied test files, though.) You will have root on your own machine, or a VM.
<br>The final <tt>echo</tt> command tells you how to unmount the volume.
<p>
One more detail: To create your own test file system, use the script, then create some files, then unmount.
Then remount (by typing the <tt>mount</tt> command, as in the script), delete the files, and unmount.
That reliably flushes the files you created to the volume before they get deleted.  If they're not flushed,
there's a chance the data for the files never makes it onto the volume, so the recovered files look mangled.
<p>
<li>The <tt>dumpe2fs</tt> program prints some information that might be useful as you debug your code.
To use it,
<blockquote>
<tt>$ dumpe2fs filesysFile-test</tt>
</blockquote>
<p>
<tt>$ man dumpe2fs</tt>
<p>
<li>I suggest you start by implementing some basic access routines, before implementing undelete itself.
Debug those access routines by printing enough debugging information that you can compare what your
code thinks the file system looks like to what <tt>dumpe2fs</tt> thinks it looks like (and then assume that
<tt>dump2efs</tt> is right).
</ol>

<hr>
<h3>The Test Files</h3>
<blockquote>
<tt>filesysFile-easy</tt> is a 1MB file system that has had a single file written to it and then deleted.
The file's contents are:
<pre>
This deleted file has one line of text, ending with this period.
</pre>
Obviously, it's intended for debugging.  (The file contents actually end with a 0x0a character.)
<p>
<tt>filesysFile-medium</tt> is a 64MB file system that has had many files written to it and then some deleted.
It shouldn't cause any serious problems for any program that successfully recovered the <tt>filesysFile-easy</tt>
file.  It's not ideal to begin debugging on simply because it's larger, and so more tedious to track down things
that go wrong.
<p>
Your code should successfully recover all deleted files from those two file systems (no false negatives).
It should also not recover anything that wasn't in fact deleted (no false positives).
For the first test file system, you know what
you're trying to recover.  For the second, the hint is that all the files are either some viewable multimedia file
or else simple text.  That means all these files are viewable, one way or another. (The <tt>file</tt> command 
applies heuristics to try to determine what the format of the contents of a file are.)
<p>
<tt>filesysFile-hard</tt> is a 512MB file system.  Unlike the previous two cases, activity here didn't end
immediately when files were deleted -- there may have been additional file creations and deletions after
the first deletions.
You have two primary and one secondary goal for this file system:
<ol>
<li> (Primary) Don't crash.
<li> (Primary) Recover some files that you're certain were once files on the file system.
<li> (Secondary) Don't recover something that you know is corrupted (i.e., unrecoverable as an exact copy of
a file that once existed).  
</ol>
The secondary goal is secondary because it's hard to define just what is required.  There are
unlimited heuristics (of unlimited complexity) that could be applied.  Your code should recognize
at least some cases where what it's recovering can't be a file.  When it does, it shouldn't leave
a recovered file in <tt>recoveredFiles/</tt> - 
assume we want either a correct file or nothing, not a mix of correct and incorrect recovered content.
At the same time, it's okay to recover some things that were never files - maybe even a lot of them.
Your decision about whether or not it's a file should be based solely on the file system's metadata,
and not on the contents of the potential file.  (For instance, don't use a jpeg library to try to parse the
contents to figure out if it's a valid jpeg file.)  Don't try to do something you can't get done in 10 days.

</blockquote>

<hr>
<h3>What to Turn In</h3>
<blockquote>
<p>Hand in a file with name like <tt>zahorjan.tar.gz</tt> that contains
everything needed to build your solution, as well as a <tt>makefile</tt> that
will actually build it.  You should also include a short
report that indicates what you code does, and explains any problems/bugs that remain
it if there are any.
<p>
We will do basically this:
<blockquote>
<tt>$ tar xzf zahorjan.tar.gz
<br>$ cd project3
<br>$ make
<br>$ ./e2undel /ourPath/ourTestFile
<br>$ diff -r ./recoveredFiles/ /ourPath/ourRecoveredFiles/
</tt></blockquote>
You should make a strong attempt to ensure that that procedure is going to work for what you submit.

</blockquote>

[an error occurred while processing this directive]