|
|
|
|
Project 3: e2undel
Out: | Monday, May 23, 2010 |
Due Date: | Wednesday, June 1 Friday, June 3, 11:59pm |
|
The 451 (and most likely the CSE home) VMs don't have the include files
needed for Project 3 installed. The symptom will be a a file not found
error on the C code line:
#include <ext2fs/ext2fs.h>
To fix this on the 451 or CSE VMs (or other 32-bit Fedora systems):
$ sudo yum install e2fsprogs-devel
Other systems:
- attu and forkbomb: the include files are already installed
- Ubuntu: $ sudo apt-get install e2fslibs-dev
Note: You will have to give your VM access to the Internet while doing this. On the lower,
right margin of the VMware Player window, click on the network icon. Switch from "host only"
to NAT networking. CLick again and select Connect. Reboot your Linux machine. (When
you're done installing the package, it's recommended you return the machine to host-only
networking.)
Skeleton Stuff
project3.tar.gz
Assignment Goals
- File system disk structures (for the ext family of file systems)
- Use of file systems
- Familiarity with raw devices vs. raw files vs. cooked files
- "More real" programming
Project Introduction
This assignment simulates the following common situation: you've got some
incredibly valuable file (say a photo of the sasquatch you met on a recent hike)
that you manage to delete by pressing the right camera button at the wrong time.
It would be handy for situations like that to have a program that can
undelete files.
That's our goal, to write a program that can undelete.
Implementations of undelete are very file system-type specific. We're going
to write undelete for the ext2 file system. (Cameras
typically use fat32.) The reason is that the
file system layout in ext2 is very similar to that of its successors,
ext3 and ext4, so it's pertinent to today's systems
even if no one is using ext2 any longer.
(undelete is not possible for ext3 and ext4, or at least
not without applying a lot of complicated heuristics that make the job
a lot bigger than a 1.5 week assignment.)
This project leaves most everything up to you. The coding portion is small;
figuring out what to do is about half the assignment; getting it implemented
the other half; the final report the third half.
Details of what to do follow.
The Scenario
When you delete an ext2 file, the directory entry for that
file is removed, and the inode and data blocks are marked as free.
If you were to list all files on the file system, the deleted file no
longer exists. However, the contents of the deleted file are still on disk, at least
for a while, because the data blocks that were used to hold them are not wiped clean
when the file is deleted. So long as subsequent file activity does
not allocate those same blocks for some other use, the data just sits there.
Similar comments apply to the deleted file's inode.
For this assignment you're provided with an ext2 file system
in which some files have been deleted. You're asked to recover those files.
For the file systems we distribute with the assignment, the files were deleted
and then no further activity of any sort took place on that file system.
As a practial matter, your code should easily be able to recover these files.
One can imagine writing code that is more robust than what is required here,
for instance, recognizing that only a portion of a deleted file can be
recovered, and that a portion is permanently lost. Doing that sort of thing
can greatly increase the complexity of the code. It's okay to implement
those sorts of heuristics for this assignment, but it's not required - you
can make a simple, "best effort" to recover deleted files and leave it at that.
Distribution of ext2 File Systems
An issue that comes up in designing this project is how to distribute a test
ext2 file system with deleted files. Do I hand out hard drives,
or USB drives, or SDHC cards?
It turns out that the basic Linux philosophy of unifying concepts makes
this easy: a regular file can be treated as a device. So, the test
ext2 file systems are distributed as files. You should think
of them as hardware, e.g., a hard disk partition. Because they're
files, though, you can read them using standard file operations.
(Also, it doesn't matter what kind of file system these files are
stored in -- their contents are ext2 file systems,
but they're just regular old files.)
You need to do "raw IO" on the contents of the file. That means
you should use the open, lseek,
read, write,
and close library routines, operating on "the file system file",
to deal with them.
To obtain man pages about those methods, you might need to mention
the man section, because those names are fairly common.
For instance:
$ man 2 read
Note that the .h files required to use those methods
are listed at the tops of their man pages.
The file fileIOExample.c in the distribution provides an
example that uses these file operations.
What Is An ext2 File System?
Let's talk using device terminology (and let's talk only
about the common case, ignoring the many generalizations
that have been implemented). A partition
is a fixed sized piece of a disk device, for example,
200GB of disk storage.
You format a disk to create a partition table,
which describes how many partitions there are on it, and how
big they are.
Once you have a partition, you initialize a file system on it.
That initialization sets up the on-disk data structures the
file system implementation uses to keep track of files and their
contents. Those structures are things like inodes and
directories. Once initialized, the file system contains
only the root directory, and is ready for use creating other files, opening them,
reading their contents, and the like.
So, one part of what a file system is is the decision
about how to use the storage of the parition to lay out
the file system's on disk data structures. That's the only
thing we're concerned with in this assignment.
The Second Extended File System
web page describes exactly what the data structures are that ext2
uses, and exactly how they're located on the available storage.
Where the document talks about "the volume", you should think
"the file system file".
The Disk Organization
section of that document describes the disk layout. You might skim
the couple of examples at the beginning of that section
to get a general sense of what everything
up to now has meant, before going through the document in more detail.
The document contains various symbolic names for things (e.g., i_atime).
Nearly all of these names are names you can use in your program - they're
fields of structs defined in ext2fs.h.
I did encounter some isolated differences, though, and those are noted in a later section.
Programming Details
ext2's on disk structures can be represented in main memory by C struct's.
For example, a struct ext2_inode is laid out in memory exactly like an inode is on disk.
Definitions for those structs are found in include files that are located (usually, depending on your
system) in /usr/include/ext2fs/. File ext2fs.h is the main include, and the
only one you need to explictly include in your code:
#include <ext2fs/ext2fs.h>
It includes some of the other files in that directory, though, so if you want to
have a look at a
struct definition, you might have to snoop around a bit.
Other details:
- Your executable should be called e2undel.
It should be invocable with a command line like this:
$ ./e2undel filesysFile-test
The sole argument is the name of the file that holds the ext2 file system.
- You should place recovered files in subdirectory recoveredFiles/.
For example, if the executable is ~/project3/e2undel, the recovered files should
go in ~/project3/recoveredFiles/.
- A recovered file's name should be file-nnnnn, where nnnnn is
the inode number the file once resided at.
For example, file-00182. You can use sprintf to format the file
name.
- You should recover only things that might plausibly be deleted files.
You shouldn't produce a possibly deleted file for every subset of blocks on the volume, for instance.
Given the data distributed with this assignment, you shouldn't end up with thousands of
recovered files.
- You should restore the last accessed and last modified meta data for the files.
recoveredFiles/file-nnnnn should have those timestamps equal to those of the deleted file (which
necessarily is earlier than the time at which it was recovered).
$ man 3 futimes
- You can have at most a small constant number of inodes or data blocks
in memory at a time.
The sample ext2 file systems are unrealistically small -- 1MB and 64MB. In a more realistic
situation, the total space of the volume (partition) would exceed the size of main memory.
For that reason, it's not allowed in this assignment to simply read the entire volume into some
in-memory buffer, even though physically that's possible for the two test cases.
You are allowed to keep in memory much of the file system data structures. Just what
to keep is up to you, with the restriction that you can have only a small number of inodes
or data blocks in memory at a time. ("A small number" is whatever you want, so long as it's constant
and small relative to the total number of inodes and data blocks on the volume.)
How Do I Start?
I recommend this.
- Look at The Second Extended File System,
then the contents of /usr/include/ext2fs/ext2fs.h and neighboring include files, and repeat for a while.
- Now write some pseudo-code describing what you're going to do to find and recover deleted files.
- Now start writing C code.
Note that the two sources just mentioned agree on just about all names. The only ones I
encountered that differed were like this: the web page says EXT2_S_IFDIR
but the include file says LINUX_S_IFDIR. Also, the include file defines
a useful macro, LINUX_S_ISDIR(), that is not mentioned in the web page.
Also note that ext2fs.h establishes defined constants SUPERBLOCK_OFFSET and
SUPERBLOCK_SIZE, not mentioned on the web page. I'd use them. (I did use them.)
Interesting Usage Notes
- The distribution includes a bash script, mkFilesysFile.sh,
that creates a file system file. It's contents are:
#!/bin/bash
echo args should be: filename size_in_1KB_blocks
dd if=/dev/zero bs=1k count=$2 of=$1
mke2fs -F $1
tune2fs -c0 -i0 $1
mkdir -p mnt
sudo mount -o loop $1 mnt
echo "'sudo umount mnt'" to unmount
Sample usage to create a 32MB file system file: $ ./mkFilesysFile.sh mytestfile 32768
The dd command copies bytes from a device that supplies any number of zeroes to
an output file.
The mke2fs command initializes an ext2 file system on "the volume" (file).
The tune2fs command prevents the system from running fsck on the volume, ever.
The mkdir command makes sure you have a directory named mnt.
The sudo mount command mounts the ext2 file system on top of subdirectory mnt.
If you cd mnt and create some files, you're creating them in the ext2 file system. You
can delete some as well. Note that you must be able to acquire root privilege to mount. You
won't be able to do this on a shared CSE machine. You will on your own machine, or a VM.
The final echo command tells you how to unmount the volume.
One more detail: To create your own test file system, use the script, then create some files, then unmount.
Then remount (by typing the mount command, as in the script), delete the files, and unmount.
That reliably flushes the files you created to the volume before they get deleted. If they're not flushed,
there's a chance the data for the files never makes it onto the volume, so the recovered files look mangled.
- The dumpe2fs program prints some information that might be useful as you debug your code.
To use it,
$ dumpe2fs filesysFile-test
$ man dumpe2fs
FAQ
- Q: Blah blah blah Mac blah blah blah blah?
A: I don't know. Please post your question to the goPost forum and with luck
another Mac user will be able to help.
- Q: Is it okay to just google this functionality, and either copy some existing implementation
or else read it and re-type it?
A: No. We can google too, and can recognize production quality solutions.
- Q: Okay, how about Bing?
A: Sure. Good luck.
(That's just a joke. Bing is out too.)
What to Turn In
Hand in a file with name like zahorjan.tar.gz that contains
everything needed to build your solution, as well as a makefile that
will actually build it. You should also include a short (1-2 paragraph, most likely)
report that indicates what you code does.
We will do basically this:
$ tar xzf zahorjan.tar.gz
$ cd project3
$ make
$ ./e2undel /ourPath/ourTestFile
$ diff -r ./recoveredFiles/ /ourPath/ourRecoveredFiles/
You should make a strong attempt to ensure that that procedure is going to work for what you submit.
|