Utilities Implementation Overview

mkfs.lfs

The userspace utility mkfs.lfs is used to build a LFS file system on a block device, usually a disk partition. You can see the commandline usage below. mkfs.lfs supports many more options than described here, but for the reasons mentioned in the introduction, a file system with only a well tested default setting is built. Of course any setting supported by the mkfs.lfs creates a correct file system.

mkfs.lfs 's only task is to build a new empty volume and setup all on-disk structures (for details on structures see Sec. 4) so that the volume can be mounted by the Linux kernel. Creating a new LFS volume is a little bit more complicated than a static file system like the most popular ext2 or its descendant ext3. mkfs.lfs does not only write segment headers and superblocks to the disk, but also has to write initial information to .ifile and the root directory. Since the file contains information about segment-usage and the inode table, writing this file to the disk changes the information which is included in the file itself. Therefore a special measures must be taken. The whole process is divided into three phases. In the first phase the root directory is written, next the .ifile is created and written. After its inode is know, writing the superblocks and empty segments (non-data segments) finalizes the whole process. Detailed description of all phases follows.

Initiation

The first step is to determine the size of the target device, because it determines positions of the superblocks as well as the initial size of the .ifile . As already mentioned, the .ifile consists of two parts (see Sec. 4.5 and Fig. 4.3). First part is a static table with an entry for each segment, whereas the other part is a dynamic inode table. As mkfs.lfs creates a file system with just a root directory and the .ifile , the inode table has a known size with just two inodes.

We are trying to keep the mkfs-code as simple as possible. Because we are about to write only a limited amount of data (.ifile, root) to the beginning of the disk, we use mmap() to remap first few megabytes of the device to the address space of the mkfs.lfs process. For the sake of simplicity we support .ifile with double indirect blocks only, which limits its size to approx. 1MiB. This implies that the largest supported volume is 64 TiB. This is enough for current hard drives as well as for the near feature. If support for larger devices is needed, it is possible to add this.

We were considering whether to use mmap() or seek() and write() but for the above mentioned limits and since the mkfs.lfs supports different segment and block size configurations, it is very much simpler not to deal with write-buffers and be able to access the written parts of disk randomly. We benefit from random access especial when writing out the .ifile as the finfo structures must be updated. On the other hand we use seek() and write() for writing non-data segments farther from the beginning of the device. Both approaches are described in more detail in Sec. 7.1.4 and Sec. 7.1.5

Ifile initiation

After we know how large the .ifile should be, we allocate its in-memory representation. Throughout the whole process of a file system creation the .ifile is updated and read until finally written to the disk. The most important for understanding the following text is that the segment usage table is also initiated. Positions of the superblocks is fully determined by the size of the device. The inode table carries information about which segments are reserved and which are used for superblocks as well as which segment where written by data. .ifile is described in more detail in Sec 7.1.4.

Write functions design

This section describes the general structure of functions used to write data to the device. Currently only root directory and .ifile is written, but it gives us an opportunity to extend the mkfs.lfs abilities to include more data and structures if needed.

Each write function takes as a parameter number of the first segment where to write. Because of the log nature of this file system, we write contiguously to successive segments as mkfs.lfs , to certain extent, simulates writing by the kernel module. Therefore each of the functions which write data to the device return the number of the last written segment. This, incremented by 1, can be passed to a next write call. It is legal to write a segment only partially, so we do not have to check if the last segment was fully written, but the new write can start in the next one. This simplifies the write code very much. Leaving gaps in segments is not an issue since data will become changed and moved to another segments and those abandoned will be reclaimed by the garbage collector

Another write issue is related to the superblocks. No data is written to segments that are reserved for superblocks (see Sec. 4.2 on page ). Therefore each function must take care of that and skip them. It is simply achieved by reading the segment usage table from the in-memory .ifile .

Root directory

Function write_root() writes the root directory together with its inode to the first data-segment (no data segments were written yet). Actually, the first available segment must be found since there might be some reserved segments at the beginning of the device, perhaps followed by the first superblock.

In this operation only a single segment is written as the root directory has only one block and only one inode is place in this segment. As the minimal size of a segment is 256KiB, there is space enough for both.

Data written to this segment are basically static, or better always the same regardless of settings. First, the segment summary is filled, a single block root directory is written to the first data block (., .. and .ifile entries) and the inode of that directory is placed in the next block (inode number LFS_INODE_ROOT as defined in include/linux/lfs_fs.h) and the only finfo record is filled. Finally, the segment usage table is updated accordingly and root's inode address is set in the inode table.

More details can be found in the source code in utils/mfks/mkfs.c:write_root

Ifile

Ifile completition

The .ifile holds the up to date information about the data written to the disk. Unfortunately writing the .ifile to the disk changes the information inside the .ifile itself. Moreover updating the .ifile is complicated by the fact that the affected section of the .ifile is most likely already written to the disk and would have to be looked up and changed.

We opted for a different approach. Since we know how large the .ifile is that we are about to write out to the device, we can determine which segments are going to be written and how many data are going to be used in each segment. Segments that will not be written are marked as empty in the segment usage table. All this is accomplished in the fill_segment_usage_table() function in the utils/mfks/ifile.c file. We do not have to figure out what the address of the .ifile inode will be, since the inode address is not written to the inode table. Address of this inode must be written to the superblocks, because there must be a way how to find the inode table, at the moment of the file system mount. The inode table is stored in the .ifile .

Not only the data part of the .ifile must be written, but also the indirect blocks in the case of a large file. Addresses are not known prior to the actual write out.

Ifile in-memory structure

**Figure 7.1:** in-memory .ifile structure

Fig. 7.1.4 presents the in memory structure of the .ifile . There are several counters which are explained in the code commentary but the data part is worth more detailed description here.

When the .ifile is initiated in the ifile_init() function, its size is already known so all memory can be allocated. ifile_new() computes the required space for data and sets the raw_data item of the ifile structure to point to the allocated unstructured memory. Because we require access to different parts of the .ifile as a different structured memory, we remap the other pointers of the struct ifile to point to various offset within the allocated memory area. remap_pointers() sets the ifile.seg_usage to point to the beginning and ifile.itab just after the segment usage table. Extra space is allocated for indirect blocks (iblocks_raw) and similarly the iblk_2nd is set to point to the second-level indirect block within this indirect block array.

Ifile write out

Writing the .ifile to the disk is performed by two nested loops. The outer loop in the ifile.c:write_to_disk() iterates per segment. First, the new segment is zeroed and the segment summary is filled. Second, the number of blocks which fits to the current segment must be figured out. We cannot use the whole unused space in the segment, because of the finfo records in the end of the segment. Both data blocks and relevant finfos must fit into the same segment. get_finfos_block() returns base pointer of the finfo block. Finfo records are filled when each block is written. Because the finfo base pointer points to a mmap()ed area, writing via this pointer means writing directly to the device, so unnecessary memory copying is avoided.

write_blocks() is responsible for writing number of data or indirect blocks as well as filling the finfo records. The inner loop in the write_blocks() function writes a single block to the device per iteration. There are several different kinds of blocks and each of needs a different way of handling. The cases are as follows :

data block - all data blocks are written out prior to the indirect blocks. This enables us to update indirect blocks with data block addresses at the moment when the address is known without any need of address precomputation. A data block address is saved by save_dblock_addr() which means that the address is written to the .ifile inode or to an indirect block. The location of the address record is determined by the block number and save_dblock_addr() performs that simple computation.
second level indirect block - identical to the data blocks, its address is saved to a parent indirect block by save_2nd_iblock_addr()
first level indirect block and second level indirection parent block - the address is saved directly to the .ifile inode blocks array (LFS_INODE_INDIR_BLOCK(1) resp. LFS_INODE_INDIR_BLOCK(2))
single indirect block - when an .ifile has only a single indirect block, we have to skip the second level in the ifile.iblocks_raw. That is the reason why there is a special adjustment code in the else branch inside the inner loop (see write_blocks() code).

Ifile inode

For the sake of simplicity, write_inode() places the .ifile inode into a new segment. It might look like a waste of space, but .ifile inode will be the most frequently updated inode so after a sync this inode will be written to a new location and the segment is going to be reclaimed by the garbage collector as already mentioned in Sec. 7.1.2. write_inode() fills the segment summary and writes the inode to the first data block of that segment. Since there are no file-data-blocks, no finfos are included. Address of the .ifile is saved in the superblock

Non-data segments

The non-data segments are written last. This includes superblocks and empty segment summaries of unused segments.

Superblocks are written in the write_sb() function. First the in-memory temporary superblock (CPU endianness template) is converted to little endianess (LFS native) and then written to the disk. Because the disk might be very large, we prefere seeking over memory mapping. Since the device file is large, usually more than 4GiB, we have to use lseek64() which can deal with large files.

Garbage collector - user space part

Overview

The main function of the user space GC (we will call it UrSp_GC from now on) is a message loop. It waits for any messages from the kernel, parses them, and processes the information. If need be, it sends requests to the kernel part.

The UrSp_GC maintains its data basically in a linklist of segments, with an index table for faster access. The linklist is always up-to-date with all the information received from kernel. It includes attributes like live blocks in next 10 segments, number of consecutive segments on either side of the segment in question, etc.

There are two types of cleaning requests. First is emergency request, it happens when system sources are very low as is disk space. For this request, the UrSp_GC mantains a special table, and it quickly picks the best candidate to clean. The second type of request is issued when the number of free segments decreases to certain value. The UrSp_GC then executes a function over the linklist to find the best candidates to collect.

Depending on the state of the system (used segments, number of received messages, etc.) the UrSp_GC can run the choosing procedure and send a clean request by itself.

Source code

The source code of UrSp_GC is located in the directory /gc (NOT /src/gc!). The whole code is thoroughly commented, so I will describe only the files, and what is included in them and advise you to go ahead and open the source code for further details.
Header files: gc_const.h - this file includes all the global constants for the UrSp_GC
The following header files are each declarations for their source (.c) files.
gc.h - this file includes declarations for the main gc.c file
gc_er_table.h - contains declaration for emergency table
gc_idx_table.h - declaration for index table pointing to linklist
gc_linklist.h - declarations for the linklist with segment information
gc_netlink_comm.h - declarations for netlink communication

Source files: gc.c - implementation of the main message loop and message processing
gc_er_table.c - source code for the emergency table
gc_idx_table.c - implementation of the index table (pointing to linklist)
gc_linklist.c - implementation of the linklist with segment information
gc_netlink_comm.c - netlink socket operations

Most important functions

main function. The main function incorporates the message loop (using the netlink socket operations from gc_netlink_comm.c). Calls parse and do_task when a message from kernel is received.
parse function. This function parses the data from the socket in to UrSp_GC's variables and data structures.
do_task function. It's basically a switch over different message types from kernel. It uses data from parse to update the persistent data structures (linklist, idx_table, ...). It calls functions from gc_er_table.c, gc_idx_table.c and gc_linklist.c or choose_segs_to_clean from main.c, which finds the most suitable candidates for collection.

There are functions for the data-modifying messages (SET, SET_PARTIAL, UPDATE, identUNSET, identREPAIR) in component files (gc_er_table.c, gc_idx_table.c and gc_linklist.c), that change underlying structures accordingly.

Decision process

There are two cases, where the UrSp_GC makes a decision. It is WHETHER to clean and WHAT to clean. Both are heavily parametrized, so we will try to give you the big picture here. For further information refer to gc.c.
We would also like to mention, that this part is definitely worth deep research and the functions for these operations tended to get more complex over the course of the project and they would most definitely get much more complex in a real product.

We base the process of choosing whether to clean on these information:
First, we base this on time and number of received messages. We will only further consider cleaning if at least GC_RECALC_TIME seconds passed since last recalculation/cleaning OR if there was at least GC_NEW_SEGS_FROM_LAST_RECALC created OR if we recieved more than GC_CHANGES_NEEDED changing messages from last collection.
If one of these three time/message dependant constraints is fullfilled we look into the issue of used space. There are to values of interest here. If there is less than segment, we don't clean anything. If there is more than that, but less than segments, we clean only lonesome segments (with empty segments on both sides from them). And finally if there is more segments than the second constant, we do the full choosing algorithm. Where is the smallest possible number of usable segments on a volume and is the actual number of segments of this volume. and are standard min, max functions returning the smaller/bigger of their operands. We use real numbers instead of further constants to make the function readable.

The segment choosing algorithm is obviously most based on live data () in each segment and on the age of that segment. But it is also based on the information about the segments around the evaluated one. The algorithm checks the sum of life blocks in next few semgents (this is also important for choosing blocks of segments to be cleaned instead of ``best'' individuals). Then it takes in account free segments in the previous one hundred and next one hundred of segments. Yet another modifier is the number of consecutive segments in front and behind this one. It's obvious, that it is better to collect something at the edge rather than in the middle of a bunch of segments.
Schematic equation for evaluation:

(7.1)

Logging

The UrSp_GC logs the important messages in the syslog, where they can be viewed. The lines with logs are preceded by lfs_gc. It logs on 2 different levels LOG_INFO and LOG_ERR. The messages of the second type are also printed on the error output.

Viliam Holub 2006-12-04