[an error occurred while processing this directive]

Project 3: e2undel

Out:Monday, November 27, 2011
Due Date:Wednesday, December 7, 11:59pm


Download

project3.tar.gz

Assignment Goals


Project Introduction

This assignment simulates the following common situation: you've got some incredibly valuable file (say a photo of the sasquatch you met on a recent hike) that you manage to delete by pressing the right camera button at the wrong time. It would be handy for situations like that to have a program that can undelete files. That's our goal, to write a program that can undelete.

Implementations of undelete are very file system-type specific. We're going to write undelete for the ext2 file system. (Cameras typically use fat32.) The reason is that the file system layout in ext2 is very similar to that of its successors, ext3 and ext4, so it's pertinent to today's systems even if no one is using ext2 any longer. (undelete is not possible for ext3 and ext4, or at least not without applying a lot of complicated heuristics that make the job a lot bigger than a 1.5 week assignment.)

This project leaves most everything up to you. The coding portion is small; figuring out what to do is about half the assignment; getting it implemented the other half; the final report the third half (but a small half). Details of what to do follow.


The Scenario

When you delete an ext2 file, the directory entry for that file is removed, and the inode and data blocks are marked as free. If you were to list all files on the file system, the deleted file no longer exists. However, the contents of the deleted file are still on disk, at least for a while, because the data blocks that were used to hold them are not wiped clean when the file is deleted. So long as subsequent file activity does not allocate those same blocks for some other use, the data just sits there. Similar comments apply to the deleted file's inode.

For this assignment you're provided with an ext2 file system in which some files have been deleted. You're asked to recover those files. For two of the file systems we distribute with the assignment (filesysFile-easy and filesysFile-medium), the files were deleted and then no further activity of any sort took place on that file system. (The third file system is a more complicated scenario, discussed a bit later.) As a practical matter, your code should easily be able to recover the deleted files from those two file systems. One can imagine writing code that is more robust than what is required by this assignment, for instance, recognizing that only a portion of a deleted file can be recovered, and that a portion is permanently lost. It's great if you want to try implementing those sorts of heuristics for this assignment, but it's not required - you can make a simple, "best effort" to recover deleted files and leave it at that.


Distribution of ext2 File Systems

An issue that comes up in designing this project is how to distribute a test ext2 file system with deleted files. Do I hand out hard drives, or USB drives, or SDHC cards?

It turns out that the basic Linux philosophy of unifying concepts makes this easy: a regular file can be treated as a device. So, the test ext2 file systems are distributed as files. You should think of them as hardware, e.g., a hard disk partition. Because they're files, though, you can read them using standard file operations. (Also, it doesn't matter what kind of file system these files are stored in -- their contents are ext2 file systems, but they're just regular old files.)

You need to do "raw IO" on the contents of the file. That means you should use the open, lseek, read, write, and close library routines, operating on "the file system file", to deal with them. To obtain man pages about those methods, you might need to mention the man section, because those names are fairly common. For instance:

$ man 2 read
Note that the .h files required to use those methods are listed at the tops of their man pages.

The file fileIOExample.c in the distribution provides an example that uses these file operations.

Note that while you will be using a system .h file to define some useful symbolic names (e.g., fields of an inode), you won't be using any of the (non-macro) functions declared in those .h files. That is, you won't be using any ext2-specific library routines to read ext2 metadata. All your IO should be done using generic open, lseek, read, and close calls.


What Is An ext2 File System?

Let's talk using device terminology (and let's talk only about the common case, ignoring the many generalizations that have been implemented). A partition is a fixed sized piece of a disk device, for example, 200GB of disk storage. You format a disk to create a partition table, which describes how many partitions there are on it, and how big they are.

Once you have a partition, you initialize a file system on it. That initialization sets up the on-disk data structures the file system implementation uses to keep track of files and their contents. Those structures are things like inodes and directories. Once initialized, the file system contains only the root directory, and is ready for use creating other files, opening them, reading their contents, and the like.

So, one part of what a file system is is the decision about how to use the storage of the parition to lay out the file system's on disk data structures. That's the only thing we're concerned with in this assignment.

The Second Extended File System web page describes exactly what the data structures are that ext2 uses, and exactly how they're located on the available storage. Where the document talks about "the volume", you should think "the file system file".

The Disk Organization section of that document describes the disk layout. You might skim the couple of examples at the beginning of that section to get a general sense of what everything up to now has meant, before going through the document in more detail.

The document contains various symbolic names for things (e.g., i_atime). Nearly all of these names are names you can use in your program - they're fields of structs defined in ext2fs.h. I did encounter some isolated differences, though, and those are noted in a later section.


Programming Details

ext2's on disk structures can be represented in main memory by C struct's. For example, a struct ext2_inode is laid out in memory exactly like an inode is on disk.

Definitions for those structs are found in include files that are located (usually, depending on your system) in /usr/include/ext2fs/. File ext2fs.h is the main include, and the only one you need to explictly include in your code:

#include <ext2fs/ext2fs.h>
It includes some of the other files in that directory, though, so if you want to have a look at a struct definition, you might have to snoop around a bit.
The 451 and CSE home Linux VMs don't have the include files needed for Project 3 installed. The symptom will be a a file not found error on the C code line:

#include <ext2fs/ext2fs.h>

To fix this on the 451 or CSE VMs:

$ sudo yum install e2fsprogs-devel

Other systems:

Note: You will have to give your VM access to the Internet while doing this. On the lower, right margin of the VMware Player window, click on the network icon. Switch from "host only" to NAT networking. Click again and select Connect. Reboot your Linux machine. (When you're done installing the package, it's recommended you return the machine to host-only networking.)

Other details:


How Do I Start?

I recommend this.
  1. Look at The Second Extended File System, then the contents of /usr/include/ext2fs/ext2fs.h and neighboring include files, and repeat for a while.

  2. Now write some pseudo-code describing what you're going to do to find and recover deleted files.

  3. Now start writing C code.
Note that the two sources just mentioned agree on just about all names. The only ones I encountered that differed were like this: the web page says EXT2_S_IFDIR but the include file says LINUX_S_IFDIR. Also, the include file defines a useful macro, LINUX_S_ISDIR(), that is not mentioned in the web page.

Also note that ext2fs.h establishes defined constants SUPERBLOCK_OFFSET and SUPERBLOCK_SIZE, not mentioned on the web page. I'd use them. (I did use them.)


Interesting Usage Notes

  1. The distribution includes a bash script, mkFilesysFile.sh, that creates a file system file. You don't absolutely need to do that -- the distribution includes already prepared file system files -- but it might help debug if you can create your own test cases.

    The script's contents are:

    
    #!/bin/bash                                                                     
    
    echo args should be: filename size_in_1KB_blocks
    
    dd if=/dev/zero bs=1k count=$2 of=$1
    mke2fs -F $1
    tune2fs -c0 -i0 $1
    mkdir -p mnt
    sudo mount -o loop $1 mnt
    
    echo "'sudo umount mnt'" to unmount
    
    Sample usage to create a 32MB file system file: $ ./mkFilesysFile.sh mytestfile 32768

    The dd command copies bytes from an artificial device that supplies any number of zeroes to an output file.
    The mke2fs command initializes an ext2 file system on "the volume" (file).
    The tune2fs command prevents the system from running fsck on the volume, ever.
    The mkdir command makes sure you have a directory named mnt.
    The sudo mount command mounts the ext2 file system on top of subdirectory mnt. If you cd mnt and create some files, you're creating them in the ext2 file system. You can delete some as well. Note that you must be able to acquire root privilege to mount. You won't be able to do this on a shared CSE machine. (You can use a shared CSE machine for debugging against the supplied test files, though.) You will have root on your own machine, or a VM.
    The final echo command tells you how to unmount the volume.

    One more detail: To create your own test file system, use the script, then create some files, then unmount. Then remount (by typing the mount command, as in the script), delete the files, and unmount. That reliably flushes the files you created to the volume before they get deleted. If they're not flushed, there's a chance the data for the files never makes it onto the volume, so the recovered files look mangled.

  2. The dumpe2fs program prints some information that might be useful as you debug your code. To use it,
    $ dumpe2fs filesysFile-test

    $ man dumpe2fs

  3. I suggest you start by implementing some basic access routines, before implementing undelete itself. Debug those access routines by printing enough debugging information that you can compare what your code thinks the file system looks like to what dumpe2fs thinks it looks like (and then assume that dump2efs is right).

The Test Files

filesysFile-easy is a 1MB file system that has had a single file written to it and then deleted. The file's contents are:
This deleted file has one line of text, ending with this period.
Obviously, it's intended for debugging. (The file contents actually end with a 0x0a character.)

filesysFile-medium is a 64MB file system that has had many files written to it and then some deleted. It shouldn't cause any serious problems for any program that successfully recovered the filesysFile-easy file. It's not ideal to begin debugging on simply because it's larger, and so more tedious to track down things that go wrong.

Your code should successfully recover all deleted files from those two file systems (no false negatives). It should also not recover anything that wasn't in fact deleted (no false positives). For the first test file system, you know what you're trying to recover. For the second, the hint is that all the files are either some viewable multimedia file or else simple text. That means all these files are viewable, one way or another. (The file command applies heuristics to try to determine what the format of the contents of a file are.)

filesysFile-hard is a 512MB file system. Unlike the previous two cases, activity here didn't end immediately when files were deleted -- there may have been additional file creations and deletions after the first deletions. You have two primary and one secondary goal for this file system:

  1. (Primary) Don't crash.
  2. (Primary) Recover some files that you're certain were once files on the file system.
  3. (Secondary) Don't recover something that you know is corrupted (i.e., unrecoverable as an exact copy of a file that once existed).
The secondary goal is secondary because it's hard to define just what is required. There are unlimited heuristics (of unlimited complexity) that could be applied. Your code should recognize at least some cases where what it's recovering can't be a file. When it does, it shouldn't leave a recovered file in recoveredFiles/ - assume we want either a correct file or nothing, not a mix of correct and incorrect recovered content. At the same time, it's okay to recover some things that were never files - maybe even a lot of them. Your decision about whether or not it's a file should be based solely on the file system's metadata, and not on the contents of the potential file. (For instance, don't use a jpeg library to try to parse the contents to figure out if it's a valid jpeg file.) Don't try to do something you can't get done in 10 days.

What to Turn In

Hand in a file with name like zahorjan.tar.gz that contains everything needed to build your solution, as well as a makefile that will actually build it. You should also include a short report that indicates what you code does, and explains any problems/bugs that remain it if there are any.

We will do basically this:

$ tar xzf zahorjan.tar.gz
$ cd project3
$ make
$ ./e2undel /ourPath/ourTestFile
$ diff -r ./recoveredFiles/ /ourPath/ourRecoveredFiles/
You should make a strong attempt to ensure that that procedure is going to work for what you submit.
[an error occurred while processing this directive]