The userspace utility mkfs.lfs is used to build a LFS file system on a block device, usually a disk partition. You can see the commandline usage below. mkfs.lfs supports many more options than described here, but for the reasons mentioned in the introduction, a file system with only a well tested default setting is built. Of course any setting supported by the mkfs.lfs creates a correct file system.
mkfs.lfs 's only task is to build a new empty volume and setup all on-disk structures (for details on structures see Sec. 4) so that the volume can be mounted by the Linux kernel. Creating a new LFS volume is a little bit more complicated than a static file system like the most popular ext2 or its descendant ext3. mkfs.lfs does not only write segment headers and superblocks to the disk, but also has to write initial information to .ifile and the root directory. Since the file contains information about segment-usage and the inode table, writing this file to the disk changes the information which is included in the file itself. Therefore a special measures must be taken. The whole process is divided into three phases. In the first phase the root directory is written, next the .ifile is created and written. After its inode is know, writing the superblocks and empty segments (non-data segments) finalizes the whole process. Detailed description of all phases follows.
The first step is to determine the size of the target device, because it determines positions of the superblocks as well as the initial size of the .ifile . As already mentioned, the .ifile consists of two parts (see Sec. 4.5 and Fig. 4.3). First part is a static table with an entry for each segment, whereas the other part is a dynamic inode table. As mkfs.lfs creates a file system with just a root directory and the .ifile , the inode table has a known size with just two inodes.
We are trying to keep the mkfs-code as simple as possible. Because we are about to write only a limited amount of data (.ifile, root) to the beginning of the disk, we use mmap() to remap first few megabytes of the device to the address space of the mkfs.lfs process. For the sake of simplicity we support .ifile with double indirect blocks only, which limits its size to approx. 1MiB. This implies that the largest supported volume is 64 TiB. This is enough for current hard drives as well as for the near feature. If support for larger devices is needed, it is possible to add this.
We were considering whether to use mmap() or seek() and write() but for the above mentioned limits and since the mkfs.lfs supports different segment and block size configurations, it is very much simpler not to deal with write-buffers and be able to access the written parts of disk randomly. We benefit from random access especial when writing out the .ifile as the finfo structures must be updated. On the other hand we use seek() and write() for writing non-data segments farther from the beginning of the device. Both approaches are described in more detail in Sec. 7.1.4 and Sec. 7.1.5
After we know how large the .ifile should be, we allocate its in-memory representation. Throughout the whole process of a file system creation the .ifile is updated and read until finally written to the disk. The most important for understanding the following text is that the segment usage table is also initiated. Positions of the superblocks is fully determined by the size of the device. The inode table carries information about which segments are reserved and which are used for superblocks as well as which segment where written by data. .ifile is described in more detail in Sec 7.1.4.
This section describes the general structure of functions used to write data to the device. Currently only root directory and .ifile is written, but it gives us an opportunity to extend the mkfs.lfs abilities to include more data and structures if needed.
Each write function takes as a parameter number of the first segment where to write. Because of the log nature of this file system, we write contiguously to successive segments as mkfs.lfs , to certain extent, simulates writing by the kernel module. Therefore each of the functions which write data to the device return the number of the last written segment. This, incremented by 1, can be passed to a next write call. It is legal to write a segment only partially, so we do not have to check if the last segment was fully written, but the new write can start in the next one. This simplifies the write code very much. Leaving gaps in segments is not an issue since data will become changed and moved to another segments and those abandoned will be reclaimed by the garbage collector
Another write issue is related to the superblocks. No data is written to segments that are reserved for superblocks (see Sec. 4.2 on page ). Therefore each function must take care of that and skip them. It is simply achieved by reading the segment usage table from the in-memory .ifile .
Function write_root() writes the root directory together with its inode to the first data-segment (no data segments were written yet). Actually, the first available segment must be found since there might be some reserved segments at the beginning of the device, perhaps followed by the first superblock.
In this operation only a single segment is written as the root directory has only one block and only one inode is place in this segment. As the minimal size of a segment is 256KiB, there is space enough for both.
Data written to this segment are basically static, or better always the
same regardless of settings. First, the segment summary is filled, a
single block root directory is written to the first data block (.,
.. and .ifile entries) and the inode of that directory is
placed in the next block (inode number LFS_INODE_ROOT as defined in
include/linux/lfs_fs.h) and the only finfo record is filled.
Finally, the segment usage table is updated accordingly and root's inode address is
set in the inode table.
More details can be found in the source code in utils/mfks/mkfs.c:write_root
The .ifile holds the up to date information about the data written to the disk. Unfortunately writing the .ifile to the disk changes the information inside the .ifile itself. Moreover updating the .ifile is complicated by the fact that the affected section of the .ifile is most likely already written to the disk and would have to be looked up and changed.
We opted for a different approach. Since we know how large the .ifile is that we are about to write out to the device, we can determine which segments are going to be written and how many data are going to be used in each segment. Segments that will not be written are marked as empty in the segment usage table. All this is accomplished in the fill_segment_usage_table() function in the utils/mfks/ifile.c file. We do not have to figure out what the address of the .ifile inode will be, since the inode address is not written to the inode table. Address of this inode must be written to the superblocks, because there must be a way how to find the inode table, at the moment of the file system mount. The inode table is stored in the .ifile .
Not only the data part of the .ifile must be written, but also the indirect blocks in the case of a large file. Addresses are not known prior to the actual write out.
Fig. 7.1.4 presents the in memory structure of the .ifile . There are several counters which are explained in the code commentary but the data part is worth more detailed description here.
When the .ifile is initiated in the ifile_init() function, its size is already known so all memory can be allocated. ifile_new() computes the required space for data and sets the raw_data item of the ifile structure to point to the allocated unstructured memory. Because we require access to different parts of the .ifile as a different structured memory, we remap the other pointers of the struct ifile to point to various offset within the allocated memory area. remap_pointers() sets the ifile.seg_usage to point to the beginning and ifile.itab just after the segment usage table. Extra space is allocated for indirect blocks (iblocks_raw) and similarly the iblk_2nd is set to point to the second-level indirect block within this indirect block array.
Writing the .ifile to the disk is performed by two nested loops. The outer loop in the ifile.c:write_to_disk() iterates per segment. First, the new segment is zeroed and the segment summary is filled. Second, the number of blocks which fits to the current segment must be figured out. We cannot use the whole unused space in the segment, because of the finfo records in the end of the segment. Both data blocks and relevant finfos must fit into the same segment. get_finfos_block() returns base pointer of the finfo block. Finfo records are filled when each block is written. Because the finfo base pointer points to a mmap()ed area, writing via this pointer means writing directly to the device, so unnecessary memory copying is avoided.
write_blocks() is responsible for writing number of data or indirect blocks as well as filling the finfo records. The inner loop in the write_blocks() function writes a single block to the device per iteration. There are several different kinds of blocks and each of needs a different way of handling. The cases are as follows :
For the sake of simplicity, write_inode() places the .ifile inode into a new segment. It might look like a waste of space, but .ifile inode will be the most frequently updated inode so after a sync this inode will be written to a new location and the segment is going to be reclaimed by the garbage collector as already mentioned in Sec. 7.1.2. write_inode() fills the segment summary and writes the inode to the first data block of that segment. Since there are no file-data-blocks, no finfos are included. Address of the .ifile is saved in the superblock
The non-data segments are written last. This includes superblocks and empty segment summaries of unused segments.
Superblocks are written in the write_sb() function. First the in-memory temporary superblock (CPU endianness template) is converted to little endianess (LFS native) and then written to the disk. Because the disk might be very large, we prefere seeking over memory mapping. Since the device file is large, usually more than 4GiB, we have to use lseek64() which can deal with large files.
The main function of the user space GC (we will call it UrSp_GC from now
on) is a message loop. It waits for any messages from the kernel, parses them,
and processes the information. If need be, it sends requests to the kernel
part.
The UrSp_GC maintains its data basically in a linklist of segments, with an index table for faster access. The linklist is always up-to-date with all the information received from kernel. It includes attributes like live blocks in next 10 segments, number of consecutive segments on either side of the segment in question, etc.
There are two types of cleaning requests. First is emergency request, it happens when system sources are very low as is disk space. For this request, the UrSp_GC mantains a special table, and it quickly picks the best candidate to clean. The second type of request is issued when the number of free segments decreases to certain value. The UrSp_GC then executes a function over the linklist to find the best candidates to collect.
Depending on the state of the system (used segments, number of received messages, etc.) the UrSp_GC can run the choosing procedure and send a clean request by itself.
Viliam Holub 2006-12-04