Because LFS stores new copies of data in another place on the disk the implementation of a snapshot is relatively simple. We must not allow to mark data as FREE that are dead and in same time used by snapshot. Of course there are problems with implementing it in Linux and how to reuse most of already written code.
The most simple way of implementing snapshot in Linux appears to be creation of new file-system type. This file-system has no backing device and during its initialization it references a host file-system instead. So it can use host file-system's structures like lfs_sb_info without fear that they will be freed.
Snapshot finds its host file-system by an id that is provided in the mount time. It grabs it with code similar to kernel function grab_super(). From that point on the host file-system cannot be umounted. Even if it is umounted from all of its mount-points it is still mounted in kernel. After the snapshot is umounted deactivate_super() is called. It drops active reference to the host file-system.
When mounting a snapshot we must sync the host file-system and remember the last segment's segment_counter and address of the ifile inode. Snapshot creates its own ifile inode with the same address and since it is read-only, it never changes.
The last thing the snapshot does is stopping garbage collecting of any segments that were not empty before the last sync before the snapshot was taken. Message LFS_GC_INFO_WINDOW with last segment_counter is sent to the garbage collecting process. It is not an error to collect data, which were created before the snapshot. Since these segments are not freed while the snapshot is mounted, such an operation would only waste space and not free any segment.
All free space tracking data are part of struct free_space defined in free-space.h.
We test if a snapshot is mounted in lfs_update_free_segs() . If so, we test segment_counter if segment was written out before or after the snapshot was taken. If it happened after, the segment is marked as FREE as usually. Otherwise it is enqueued into the struct lfs_sb_info::snap_delayed queue and freed after the snapshot is umounted.
There is also a special atomic value struct lfs_sb_info::snap_frozen that locks freeing process. Freeing process needs to be locked before snapshoting the host file-system until the segment_counter is determined.
After a snapshot is mounted we need to determine new values of free_space::free and free_space::max. We take actual value of free_segments and counts maximal free space value in the same fashion as when the host file-system is mounted. A new value of free_space::free is equal to the new value of free_space::max, because there are no live data in area that is usable after snapshot.
Old values of the free space accounting are saved.
Depending on the fragmentation, the resulting value of the free space accounting can be even greater then the value before the snapshot. In such a case we take the value before the snapshot. If the space is used according to the new value, there would be problems when umounting the snapshot.
When some space is returned while a snapshot is mounted, we must test carefully whether the space was obtained by freeing data created before snapshot was taken or after. This can be determined from address of a block or an inode. There is one problem remaining. Once an inode or a block is dirtied we must acquire additional space for it. But old space cannot be freed. Instead it must be added to free_space::delayed, so-called free space transfer. Of course this can happen only the first time data is accessed. When they acquire new free space from the area after snapshot, they must not be transfered any more, until snapshot is umounted.
Addresses cannot be used to determine whether data was transfered or not, as data is written asynchronously to the free space accounting. It is required to use another way how to track whether data was accounted in free space before or after the snapshot was mounted. It differs for a snapshot and for direct or indirect blocks.
segment_counter for each inode is kept in struct lfs_ifile_info. Once an inode is created, its segment_counter is set to 0. When an inode is written to the disk, its segment_counter is updated with segment_counter of the currently written segment. When an inode is read from the disk again, its segment_counter is initialized from the segment usage table in the ifile. Finally, when the inode is transfered we zero its segment_counter.
Transfer is done only if a segment_counter of an inode is non-zero. Because all data is synced before taking a snapshot and thus counters are updated, there is no way how to transfer data written before the snapshot was taken. Because the only operation, which may change counter to non-zero is an operation that writes inodes, we know that all data accounted from free space after snapshot have segment_counter set to 0 or to something greater than the segment_counter of the last snapshot.
There is no reasonable way how to attach something like inode's segment_counter to buffers, therefore a different approach must be taken. The preallocation bit in address does the job. This preallocation bit is used to mark data blocks as dirty and preallocated (see Section 5.1.3). This flag can be reused for deciding if a block was subtracted from the free space after a snapshot. When a block is dirtied it is checked if it is preallocated too. If not, it must be preallocated and checked whether its segment_counter is smaller than the snapshot segment_counter. If so, its space is transfered. In any other case the block was written after a snapshot and so it must be accounted in free space after the snapshot. As for inodes, the only way to clear the preallocation flag is to write block to the disk. This operation updates address of that block and moves it to a segment with a counter greater than the snapshot. Each subsequent dirtying of this block will not issue a transfer because of a new value of the counter.
Free space accounting, as described in the previous section, needs to be able to simply decide if a block was created before or after a snapshot. It is mandatory for correct work. It can be easily determined using a segment_counter. But the segment counter has smaller granularity than the sync operation that typically finishes a partial segment. The simplest workaround for this problem is not to allow sync to create partial segment. So when the sync operation is required by a snapshot a whole segment is always written out. As a result, the working area is completely in area behind the snapshot.
For more details, see the code in files snapshot.c, snapshot.h and free_space.h.
Viliam Holub 2006-12-04