Subsections

On-disc structures

This chapter describes the on-disk structures we used in our implementation of a log-structured file system. Please note that we assume the reader is familiar with both traditional UNIX file system layout (that is explained for example in [3]) and general concepts of log structured filesystems as they are described in [1] and [2]. Please read through at least these two papers if you have not done so already. We have based our work particularly on the latter and won't go into much detail in cases when our implementation is almost the same.

All on-disk structures described in this section are defined in file include/lfs_fs.h. Please refer to it for exact definitions, comments and details.


Data types

Before we proceed to various structures present on the device, it essential to spend a number paragraphs with a few basic but fundamental design decisions and basic types. First and foremost, all multi-byte integers on the disk are stored in little-endian format. It is therefore necessary to use appropriate conversions whenever accessing any such integer that ever reaches the disk. Secondly, all structures we store on the disk are packed so that the compiler does not include any padding and thus they have the same binary representation on all platforms. This also means that the programmer is responsible for word alignment of individual items of these structures.

The current source code often requires that size of a particular structure must be a power of two. In this document we try to explicitly warn about these requirements whenever describing such a structure but it is necessary to double check before changing any on-disk structure whatsoever. Sometimes there are other requirements, for example two structures found in the ifile must have the same size. In other words, changing the on-disk structures is a delicate task that must be done thoughtfully.

Finally, this is an overview of the basic types specific to LFS that are used in on-disk structures:

lfs_addr_t
is an address on the disk from the beginning of the device in blocks (thus it must be multiplied by the block size in order to get the address in bytes). Not surprisingly, it is used to store addresses of blocks.

lfs_inode_addr_t
is an address on the disk from the beginning of the device in bytes. However, as the name suggests, it is only used to specify locations of inodes on the disk.

lfs_finfo_block_t
is a logical index of a block within its file or indirect block ordering (see section 4.3) in blocks. It is used in finfo structures to identify blocks within a segment.
All types described above are internally 64-bit unsigned integers and thus impose no practical limit on the device or file size.

Characteristics common with traditional designs

We assume the reader is familiar with concepts of traditional file systems in UNIX environments (which are described for example in [3]) and therefore we will not describe those that LFS uses as well in much detail. However, let us at least briefly mention the most important ones.

Most file systems divide the underlying device into smallest addressable and indivisible units called blocks and so do we. Even though some of our metadata can share blocks (for example inodes can), files in particular are always internally accessed with block granularity. At this stage, our blocks are 4096 bytes long (see limitations on page [*]).

Most importantly, we also use the traditional indexing structure of UNIX file systems, namely the inode and indirect blocks. However, indirect blocks are numbered as described in the next section. LFS also has superblocks that represent data global to the file system and are the only structures that are not written in a log-like manner. The last common concept shared with traditional file systems is that directories are represented with files with a given internal format and special access methods.

Figure 4.1: Block indices.
Image block_indices


Block indices

Both indirect and data blocks are assigned an index that uniquely identifies them within an inode. The index of a data block is simply its offset in bytes divided by the block size. Indirect blocks are numbered in a pre-order depth first search manner. The figure 4.1 describes exactly how this is done.

Segments

All log structured file systems we know of statically partition the disk into segments of the same and fixed size. [4] discusses the impact of various segment sizes and recommends of the disk. Given today's disks, the recommended value usually varies in between one and two megabytes. In order to diminish the cleaning overhead and to limit the number of possible configurations (see limitations on page [*]), the current version of LFS uses segments one megabyte long.

Figure 4.2: Disk structure overview
Image segment_structure

There are three main types of segments (see figure 4.2):

Reserved segment
is a segment in the beginning of the device that is never touched by the file system. Usually there is one such segment on the device that can be used for whatever purpose the user chooses.

Superblock segment
is a segment that contains a superblock. There are four superblocks on each LFS device and thus four superblock segments. Superblock segments do not contain any other data apart from the superblock.

Data segment
is an ordinary segment that contains user data and file system metadata. We will deal with these segments in more detail in the section 4.7 on page [*].


Ifile

Every log structured file system needs a means of tracking state of individual segments and inodes. Whereas Sprite-LFS [1] used a special kernel table to store this information, BSD-LFS [2] moved both to an immutable file called ifile and we have adopted the same approach. Ifile in our implementation of LFS consists of two parts and its structure is depicted in the figure 4.3.

Figure 4.3: Ifile: The ifile consists of segment usage table and inode table. The size of the former does not change whereas the latter grows as the number of inodes on the file system increases.

Segment usage table

The first part of the ifile is called segment usage table. This table is created by mkfs and has a fixed size for the entire life of a particular file system instance. It contains an entry for each segment with information about how much live data and inodes the segment contains, what is its current logical sequential number or segment counter (see section 4.7) and flags to distinguish in between various types and states of segments. This part of ifile is especially important for the garbage collecting and segment allocating subsystems described by sections...

Inode table

The second part of the ifile is an inode table. This table grows as inodes are created in the file system. Every inode is described by an item in this table which enables us to:

When an inode is deleted (its nlink and reference counter both reach zero), the address in the corresponding inode table entry is set to zero and the item is prepended to a linked list of free items. When a new inode is requested, an item is taken out of the linked list and used. If there is no unused item, a new one is created at the end of the ifile and initialized. In both cases, the version of this inode number is incremented (see section 4.6 for explanation of versions). The inode table never shrinks but so far no one has considered this a problem.

Size constraints

Obviously, since it is likely that the number of items in both segment usage table and inode table will be very large, it is important to keep their items small. Currently, both structures are 16 bytes long. The current code depends on the fact that both structures have the same size which is a power of two. This requirement greatly reduces complexity and must be observed when modifying ifile internals.

Working with ifile

When accessing various parts of the ifile from kernel, the programmer should use the appropriate routines in ifile.c. A very brief summary of the four most important are given in table 4.1. Please refer to the source code and relevant comments for more information. As a side note, a new inode is created by lfs_ifile_new_ino() and they are deleted by lfs_ifile_free_ino().


Table 4.1: Ifile access functions
Function Description
lfs_ifile_get_segusage() obtains a pointer to an entry in the segment usage table. Gets, maps and locks the relevant page.
lfs_ifile_get_inode() fetches a pointer to an entry in the inode table. Gets, maps and locks the relevant page.
lfs_ifile_put_page() unlocks, unmaps and puts the page returned by the two functions above. Has to be called when you are done with any of the items.


Synchronization

Of course, it is essential that concurrent access to all parts of the ifile is avoided. This is done using page locks. Each item is simply protected by the page lock of the page it resides in. Functions listed in table 4.1 lock pages properly for you.


Live and dead entities and inode versions

Since log structured file systems store all data in a single log, the data is never written to the same position from which it was read or which it was written to before. The same stands also for inodes, indirect blocks and even for the ifile (see section 4.5). Naturally, only the chronologically last copy of such data contains the currently valid version. Such inodes and blocks will be referred to as live. On the other hand, all other copies are outdated at any given moment. Such copies are referred to as dead and are subject to garbage collection.

In order to collect garbage safely, it is important to be able to determine whether a particular block or inode is alive or not. Strictly speaking, an inode on the disk is alive if and only if the corresponding item of the inode table points to it. Similarly a block is alive if and only if it is being pointed to by either the inode or an indirect block. In order to speed the process of finding out the live status of an entity, log structured file systems introduce so called inode versions. An inode version is a number that is associated with an inode number and that is incremented every time the inode is truncated to zero or the inode number is recycled and assigned to a newly created inode. When it is known that a block or an inode structure on the disk have an inode version different from the current one, they can be safely considered dead (see figure 4.4). Versions of inodes are used when collecting data and indirect blocks in the same way, the benefit of using them can actually be much bigger because loading the inode or even a few indirect blocks can be avoided.

Figure 4.4: Inode versions. a) There is only one inode of a given number present on the disk, it has version six, is alive and pointed to by the inode table. b) An inode has changed and been written to the disk. A new copy with the same version is created and the corresponding pointer in the inode table is redirected to point at it. The previous inode is dead but to find this out, we must check where the live inode is. c) The inode has been deleted. Both inodes on the disk are dead because the inode table entry does not point anywhere. d) The inode number and the corresponding inode table entry have been recycled. Its version is incremented and the new inode has been written to the disk. The previous inodes are dead. This can be determined from their version because it is different from the version recorded in the inode table.
Image inode_versions


Data segment layout

Log structured file systems need to finish a segment when they finish their part of the sync syscall. Both Sprite-LFS [1] and BSD-LFS [2] use so called partial segments in order not to waste space by leaving the rest of the segment unused. Our implementation closes partial segments at a few more occasions (fsync, pre-sync, flushing excessive amounts of journals and so on) in a very similar way.

Throughout this text, unless we explicitly specify we refer to a partial segment, we mean the whole, physical, one megabyte long one. For example, the inode usage table in ifile has exactly one item for each physical segment, even if it contains multiple partial segments.

Every partial segment has a special structure called segment summary at the end. We will discuss particularities of partial segments and their summaries in the next chapter, at this point we only need to say that the summary contains the length of the partial segment. Moreover, the physical segment is always fully covered by partial segments, even if the last one is entirely empty. Such empty partial segment is called a phantom segment and is required to locate the previous partial segments. Thus, there is always a valid segment summary at the end of every physical segment (if it has ever been written to). Because every partial segment has its size recorded at a known place at the end and because we know the end of the last partial segment, all segment summaries are accessible and all partial segments identifiable. The figure 4.5 contains an example.

Figure 4.5: Partial Segments. The physical segment above contains four partial segments. Each of them has a segment summary represented by a dark grey box. In this case, the final partial segment does not contain any data. In other words, it is a phantom segment. Please note that the dotted lines are not actual pointers. These can be computed from the address of the segment summary (the end of the partial segment) and the size of the partial segment stored in the summary.
Image partial_segments

Even though segments form a log, this does not continuously grow from the beginning of the device towards the end. As data are overwritten and garbage collected, any free segment can be the next one in the log. Therefore, the log is actually formed by a chronological or logical ordering of the segments. During a crash recovery, a roll forward utility must be able to follow the chain of segments in the order in which they were created by the file system module. Therefore, every segment summary contains the number of the segment that will be used next. Superblock contains the number of the next segment at the time of the end of a sync operation.

Needles to say, that is not enough. Only half-written segments cannot be recovered. Therefore, every partial segment summary contains a CRC of itself and of first 32 bytes of each sector that was written. Furthermore, if the next planned segment was not actually written to, an old segment could be mistaken for a new one. That is why every segment is assigned a global number that is incremented every time a new segment becomes active (i.e. the one being written to). This number is called the segment counter. Again, the superblock contains the counter of the next segment at the time of the last sync and every segment summary contains the counter of the next segment too. This leads to segments chained in a way similar to the figure 4.6. Moreover, the segment counter is the most important means of specifying time of various events in LFS. For example, it is necessary to know what the counter was when the last sync took place or a snapshot was mounted.

Figure 4.6: Segment chains. The figure contains a few consecutive segments on a disk which happened to form parts of four different chains. Most important is the currently active one. It begins at superblock and contains segments with counters 1015, 1016 and 1017. Even though segment 1017 points to segment 523, that is clearly an old segment because its counter does not immediately follow the previous one. The rest of the chains are not useful for roll-forward but demonstrate the behavior of the file system. The previous active chain starts outside of this picture, contains the segment 1014 and then suddenly becomes the current chain. The old chain 1 also has a beginning outside of the picture, contains segments 875 and 876 and leaves to some other part of the disk too. Finally the old chain 2 contains only segment with counter 624. Whichever segment once pointed to it has been garbage collected, reused, been assigned a new counter and became a part of a new chain. We can also see that the segment 1016 once used to be 625 but has been recycled too. Once the segment 624 is reused as well, the old chain 2 will be no more.
Image segment_chain


Partial segment layout

Partial segments contain data blocks, indirect blocks, inodes, journal lines, finfo structures and, as we already know, a segment summary. The figure 4.7 shows how these parts are ordered and located whereas the Figure 4.8 presents you the definition of segment summary. Please note that journal_block, inodes_block and finfos_block are addresses in blocks.

Figure 4.7: Partial segment structure. Please note that the unused space may not be present and may actually take up less than a block.
Image partial_structure

Figure 4.8: Segment summary

Blocks

If there are any indirect or data blocks present in a partial segment, they are placed one after another right from the beginning. As we have already stated in the previous section, the beginning of a partial segment can be determined by subtracting the partial segment size from the address of the end of the segment summary. Number of these blocks can be determined from finfos (see below).

Inodes

Inodes, if present, immediately follow the blocks. Currently one on-disk inode has size of 256 bytes, which means there are 16 inodes per one block. The address of the first block containing the inodes can be obtained from the inodes_block field of the segment summary which also contains the number of inodes stored in this segment in inodes_count. If this number is not divisible by 16, the rest of the last block is unused and cleared. Therefore one can traverse the inodes not by their count but until a structure filled with zeros is reached (as segment building code does).

Again, it is important to bear in mind that the size of the on-disk inode is a power of two. The current code depends on it and not observing this principle would break segment building.

Journals

The purpose of journal blocks is to inform the roll-forward utility about directory operations4.1 These blocks contain so called jline structures that can span block and even segment boundaries but not across a sync. Each one describes an individual directory operation and its format differs according to the operation's type (see section 5.11.8 for details). The most important rule is that a jline describing a particular operation are saved no later than in the same partial segment that contains any part of the affected inodes. On the other hand, it can be stored to disk substantially earlier because jlines are never reordered and so they reach the disk exactly in the order in which they were carried out by the operating system.

When there are journal blocks in the partial segment, journal_count contains the number of blocks they span across and journal_block points to the first such block. The end of jlines is determined by reaching a cleared jline structure.

Finfos

When cleaning a segment, the garbage collector must be able to determine what individual blocks in a segment contain. That means to which inode they belong, whether they hold data or are indirect blocks and what is their index (as described in the section 4.3. This is accomplished using so called finfo structures. Take a look at its definition in the figure 4.9. It is a variable length structure consists of a header identifying the inode and a number of 64-bit unsigned integers containing indices. The field indirect_block_count tells how many indices of indirect blocks follow the header while data_block_count determines the number of offsets of direct blocks in the finfo. At the moment, only one of these numbers can be non-zero. version contains the version and ino the number of the inode the blocks belong to.

Finfos also have size constraints. The Size of struct lfs_finfo must be divisible by the size of lfs_finfo_block_t which itself must be a power of two.

Figure 4.9: Finfo structure definition

The finfo blocks correspond to the indirect and data blocks at the beginning of the partial segment. First index in first finfo identifies the first block in the partial segment, the second one refers to the second block and so on. After there are no indices left in the first finfo, the second one is used etc. See figure 4.10.

Figure 4.10: Finfos and corresponding blocks. Each grey rectangle is a logical offset of the corresponding block.
Image finfo_ordering

Viliam Holub 2006-12-04