CSE 451: Operating Systems, Spring 2011
  CSE Home   About Us   Search   Contact Info 
 
Course Home
Home
Administrivia
Overview
Course email
 
Materials
Course Calendar
Sections
goPost Board
Anonymous feedback
View feedback
 
Assignments
Homework
Projects
Turnin
Gradebook
 
Information
Home Virtual Machine
Linux information
   

Project 3: e2undel

Out:Monday, May 23, 2010
Due Date:Wednesday, June 1 Friday, June 3, 11:59pm


The 451 (and most likely the CSE home) VMs don't have the include files needed for Project 3 installed. The symptom will be a a file not found error on the C code line:

#include <ext2fs/ext2fs.h>

To fix this on the 451 or CSE VMs (or other 32-bit Fedora systems):

$ sudo yum install e2fsprogs-devel

Other systems:

  • attu and forkbomb: the include files are already installed
  • Ubuntu: $ sudo apt-get install e2fslibs-dev
Note: You will have to give your VM access to the Internet while doing this. On the lower, right margin of the VMware Player window, click on the network icon. Switch from "host only" to NAT networking. CLick again and select Connect. Reboot your Linux machine. (When you're done installing the package, it's recommended you return the machine to host-only networking.)

Skeleton Stuff

project3.tar.gz

Assignment Goals

  • File system disk structures (for the ext family of file systems)
  • Use of file systems
  • Familiarity with raw devices vs. raw files vs. cooked files
  • "More real" programming

Project Introduction

This assignment simulates the following common situation: you've got some incredibly valuable file (say a photo of the sasquatch you met on a recent hike) that you manage to delete by pressing the right camera button at the wrong time. It would be handy for situations like that to have a program that can undelete files. That's our goal, to write a program that can undelete.

Implementations of undelete are very file system-type specific. We're going to write undelete for the ext2 file system. (Cameras typically use fat32.) The reason is that the file system layout in ext2 is very similar to that of its successors, ext3 and ext4, so it's pertinent to today's systems even if no one is using ext2 any longer. (undelete is not possible for ext3 and ext4, or at least not without applying a lot of complicated heuristics that make the job a lot bigger than a 1.5 week assignment.)

This project leaves most everything up to you. The coding portion is small; figuring out what to do is about half the assignment; getting it implemented the other half; the final report the third half. Details of what to do follow.


The Scenario

When you delete an ext2 file, the directory entry for that file is removed, and the inode and data blocks are marked as free. If you were to list all files on the file system, the deleted file no longer exists. However, the contents of the deleted file are still on disk, at least for a while, because the data blocks that were used to hold them are not wiped clean when the file is deleted. So long as subsequent file activity does not allocate those same blocks for some other use, the data just sits there. Similar comments apply to the deleted file's inode.

For this assignment you're provided with an ext2 file system in which some files have been deleted. You're asked to recover those files. For the file systems we distribute with the assignment, the files were deleted and then no further activity of any sort took place on that file system. As a practial matter, your code should easily be able to recover these files. One can imagine writing code that is more robust than what is required here, for instance, recognizing that only a portion of a deleted file can be recovered, and that a portion is permanently lost. Doing that sort of thing can greatly increase the complexity of the code. It's okay to implement those sorts of heuristics for this assignment, but it's not required - you can make a simple, "best effort" to recover deleted files and leave it at that.


Distribution of ext2 File Systems

An issue that comes up in designing this project is how to distribute a test ext2 file system with deleted files. Do I hand out hard drives, or USB drives, or SDHC cards?

It turns out that the basic Linux philosophy of unifying concepts makes this easy: a regular file can be treated as a device. So, the test ext2 file systems are distributed as files. You should think of them as hardware, e.g., a hard disk partition. Because they're files, though, you can read them using standard file operations. (Also, it doesn't matter what kind of file system these files are stored in -- their contents are ext2 file systems, but they're just regular old files.)

You need to do "raw IO" on the contents of the file. That means you should use the open, lseek, read, write, and close library routines, operating on "the file system file", to deal with them. To obtain man pages about those methods, you might need to mention the man section, because those names are fairly common. For instance:

$ man 2 read
Note that the .h files required to use those methods are listed at the tops of their man pages.

The file fileIOExample.c in the distribution provides an example that uses these file operations.


What Is An ext2 File System?

Let's talk using device terminology (and let's talk only about the common case, ignoring the many generalizations that have been implemented). A partition is a fixed sized piece of a disk device, for example, 200GB of disk storage. You format a disk to create a partition table, which describes how many partitions there are on it, and how big they are.

Once you have a partition, you initialize a file system on it. That initialization sets up the on-disk data structures the file system implementation uses to keep track of files and their contents. Those structures are things like inodes and directories. Once initialized, the file system contains only the root directory, and is ready for use creating other files, opening them, reading their contents, and the like.

So, one part of what a file system is is the decision about how to use the storage of the parition to lay out the file system's on disk data structures. That's the only thing we're concerned with in this assignment.

The Second Extended File System web page describes exactly what the data structures are that ext2 uses, and exactly how they're located on the available storage. Where the document talks about "the volume", you should think "the file system file".

The Disk Organization section of that document describes the disk layout. You might skim the couple of examples at the beginning of that section to get a general sense of what everything up to now has meant, before going through the document in more detail.

The document contains various symbolic names for things (e.g., i_atime). Nearly all of these names are names you can use in your program - they're fields of structs defined in ext2fs.h. I did encounter some isolated differences, though, and those are noted in a later section.


Programming Details

ext2's on disk structures can be represented in main memory by C struct's. For example, a struct ext2_inode is laid out in memory exactly like an inode is on disk.

Definitions for those structs are found in include files that are located (usually, depending on your system) in /usr/include/ext2fs/. File ext2fs.h is the main include, and the only one you need to explictly include in your code:

#include <ext2fs/ext2fs.h>
It includes some of the other files in that directory, though, so if you want to have a look at a struct definition, you might have to snoop around a bit.

Other details:

  • Your executable should be called e2undel.
    It should be invocable with a command line like this:
    $ ./e2undel filesysFile-test
    The sole argument is the name of the file that holds the ext2 file system.

  • You should place recovered files in subdirectory recoveredFiles/.
    For example, if the executable is ~/project3/e2undel, the recovered files should go in ~/project3/recoveredFiles/.

  • A recovered file's name should be file-nnnnn, where nnnnn is the inode number the file once resided at.
    For example, file-00182. You can use sprintf to format the file name.

  • You should recover only things that might plausibly be deleted files.
    You shouldn't produce a possibly deleted file for every subset of blocks on the volume, for instance. Given the data distributed with this assignment, you shouldn't end up with thousands of recovered files.

  • You should restore the last accessed and last modified meta data for the files.
    recoveredFiles/file-nnnnn should have those timestamps equal to those of the deleted file (which necessarily is earlier than the time at which it was recovered).
    $ man 3 futimes

  • You can have at most a small constant number of inodes or data blocks in memory at a time.
    The sample ext2 file systems are unrealistically small -- 1MB and 64MB. In a more realistic situation, the total space of the volume (partition) would exceed the size of main memory. For that reason, it's not allowed in this assignment to simply read the entire volume into some in-memory buffer, even though physically that's possible for the two test cases.

    You are allowed to keep in memory much of the file system data structures. Just what to keep is up to you, with the restriction that you can have only a small number of inodes or data blocks in memory at a time. ("A small number" is whatever you want, so long as it's constant and small relative to the total number of inodes and data blocks on the volume.)


How Do I Start?

I recommend this.
  1. Look at The Second Extended File System, then the contents of /usr/include/ext2fs/ext2fs.h and neighboring include files, and repeat for a while.

  2. Now write some pseudo-code describing what you're going to do to find and recover deleted files.

  3. Now start writing C code.
Note that the two sources just mentioned agree on just about all names. The only ones I encountered that differed were like this: the web page says EXT2_S_IFDIR but the include file says LINUX_S_IFDIR. Also, the include file defines a useful macro, LINUX_S_ISDIR(), that is not mentioned in the web page.

Also note that ext2fs.h establishes defined constants SUPERBLOCK_OFFSET and SUPERBLOCK_SIZE, not mentioned on the web page. I'd use them. (I did use them.)


Interesting Usage Notes

  1. The distribution includes a bash script, mkFilesysFile.sh, that creates a file system file. It's contents are:
    
    #!/bin/bash                                                                     
    
    echo args should be: filename size_in_1KB_blocks
    
    dd if=/dev/zero bs=1k count=$2 of=$1
    mke2fs -F $1
    tune2fs -c0 -i0 $1
    mkdir -p mnt
    sudo mount -o loop $1 mnt
    
    echo "'sudo umount mnt'" to unmount
    
    Sample usage to create a 32MB file system file: $ ./mkFilesysFile.sh mytestfile 32768

    The dd command copies bytes from a device that supplies any number of zeroes to an output file.
    The mke2fs command initializes an ext2 file system on "the volume" (file).
    The tune2fs command prevents the system from running fsck on the volume, ever.
    The mkdir command makes sure you have a directory named mnt.
    The sudo mount command mounts the ext2 file system on top of subdirectory mnt. If you cd mnt and create some files, you're creating them in the ext2 file system. You can delete some as well. Note that you must be able to acquire root privilege to mount. You won't be able to do this on a shared CSE machine. You will on your own machine, or a VM.
    The final echo command tells you how to unmount the volume.

    One more detail: To create your own test file system, use the script, then create some files, then unmount. Then remount (by typing the mount command, as in the script), delete the files, and unmount. That reliably flushes the files you created to the volume before they get deleted. If they're not flushed, there's a chance the data for the files never makes it onto the volume, so the recovered files look mangled.

  2. The dumpe2fs program prints some information that might be useful as you debug your code. To use it,
    $ dumpe2fs filesysFile-test

    $ man dumpe2fs


FAQ

  1. Q: Blah blah blah Mac blah blah blah blah?
    A: I don't know. Please post your question to the goPost forum and with luck another Mac user will be able to help.

  2. Q: Is it okay to just google this functionality, and either copy some existing implementation or else read it and re-type it?
    A: No. We can google too, and can recognize production quality solutions.

  3. Q: Okay, how about Bing?
    A: Sure. Good luck.
    (That's just a joke. Bing is out too.)

What to Turn In

Hand in a file with name like zahorjan.tar.gz that contains everything needed to build your solution, as well as a makefile that will actually build it. You should also include a short (1-2 paragraph, most likely) report that indicates what you code does.

We will do basically this:

$ tar xzf zahorjan.tar.gz
$ cd project3
$ make
$ ./e2undel /ourPath/ourTestFile
$ diff -r ./recoveredFiles/ /ourPath/ourRecoveredFiles/
You should make a strong attempt to ensure that that procedure is going to work for what you submit.

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to zahorjan at cs.washington.edu]