|Native LINUX Filesystems
Extended filesystems (Ext, Ext2, Ext3, Ext4)
Extended filesystem (ext fs), second extended filesystem (ext2fs) and third extended filesystem (ext3fs) were designed and implemented on Linux by Rmy Card, Laboratoire MASI--Institut Blaise Pascal, , Theodore Ts'o, Massachussets Institute of Technology, and Stephen Tweedie, University of Edinburgh
Extended filesystem (Ext FS)
This is original filesystem used in early Linux systems.
The standard filesystem for Linux, ext2, is a high-performance, non-journaled filesystem. Although ext2 lacks journaling features, many users choose it because of its high speed and reliability.
Second Extended Filesystem (Ext2 FS)
The Second Extended File System provides standard Unix file semantics and advanced features. Ext2 filesystem format forms the basis for following native LINUX file system versions. Due to optimizations included in the kernel code, Ext2fs has extensions to the current filesystem: access control lists conforming to the Posix semantics, undelete, and on-the-fly file compression. . I
Long file names (255 characters to 1012_ and variable length directory entries.
VFS layer filesystems to 4 TB
Reserves 5% of the blocks super user (root) to recover from user processes filling up filesystems.
Filesystem metadata (inodes, bitmap blocks, indirect blocks and directory blocks) synchronous write
Choice of logical block size when creating the filesystem, typically be 1024, 2048 and 4096 bytes to speed up I/O since with fewer I/O requests, and thus fewer disk head seeks.
Fast symbolic links that do not use any data block on the filesystem; filename is not stored in the inode
filesystem state using a special field in the superblock to indicate the status of the file system. When a filesystem is mounted in read/write mode, its state is set to ``Not Clean''. When it is unmounted or remounted in read-only mode, its state is reset to ``Clean''. At boot time, the filesystem checker uses this information to decide if a filesystem must be checked (fsck)... The filesystem checker tests this to force the check of the filesystem regardless of its apparently clean state (fsck).
Filesystems checks are forced at regular intervals. A mount counter is maintained in the superblock. A last check time and a maximal check interval are also maintained in the superblock. Each time the filesystem is mounted in read/write mode, counters and timestamps arechecked. When it reaches a maximal value (also recorded in the superblock), the filesystem checker forces the check even if the filesystem is ``Clean''.
provides an attribute allows the users to request secure deletion on files. When such a file is deleted, random data is written in the disk blocks previously allocated to the file. This prevents malicious people from gaining access to the previous content of the file by using a disk editor.
Ext2 Physical Structure
Unlike FFS, the ext2 filesystems is made up of block groups instead of FFS cylinder groups. Block groups are not tied to the physical layout of the blocks on the disk, since modern drives tend to be optimized for sequential access (‘smart” drives – SAN, SCSI, SATA) and hide their physical geometry to the operating system.
Ext2 filesystem layout
| Boot | Block | Block | ... | Block |
| sector | group 1 | group 2 | | group n |
Each block group contains a redundant copy of crucial filesystem control informations (superblock and the filesystem descriptors) and also contains a part of the filesystem (a block bitmap, an inode bitmap, a piece of the inode table, and data blocks). The structure of a block group is represented in this table:
| Super | FS | Block | Inode | Inode | Data |
| block | desc. | bitmap | bitmap | table | blocks |
Using block groups improves reliability since the control structures are replicated in each block group, it is easy to recover from a filesystem where the superblock has been corrupted. This structure also helps to get good performances: by reducing the distance between the inode table and the data blocks, it is possible to reduce the disk head seeks during I/O on files.
In Ext2fs, directories are managed as linked lists of variable length entries containing the inode number, the entry length, the file name and its length. Variable length entries permit long file names without wasting disk space in directories.
Ext2fs buffer cache management performs readaheads:reading data blocks contiguouslys. This way, it tries to ensure that the next block to read will already be loaded into the buffer cache. Readaheads are extended directory reads
Ext2fs performs allocation optimizations. Block groups are used to cluster together related inodes and data to reduce the disk head seeks made when the kernel reads an inode and its data blocks.
Preallocates up to 8 adjacent blocks when allocating a new block for writing data. Preallocation hit rates are around 75% even on very full filesystems and gets good write performances under heavy load. It also allows contiguous blocks to be allocated to files, thus it speeds up the future sequential reads.
Journaled filesystems include additional record keeping that increases the ability of the filesystem to recover from a crash.
· ext3 - the ext2 filesystem with journaling extensions.
· jfs - Journaled File System - a filesystem contributed to Linux by IBM.
· xfs - A filesystem contributed to open source by SGI.
· reiserfs, developed by Namesys, is the default filesystem for SUSE Linux, DARPA.
•Third Extended Filesystem (Ext3 FS)
Ext3 supports the same features as Ext2, but also includes Journaling.
• Fourth Extended Filesystem (Ext4 FS)
Any existing Ext3 filesystem can be migrated to Ext4 with an easy procedure which consists in running a couple of commands in read-only mode.
Migrate existing Ext3 filesystems to Ext4
You need to use the tune2fs and fsck tools in the filesystem, and that filesystem needs to be unmounted. Run:
tune2fs -O extents,uninit_bg,dir_index /dev/yourfilesystem
After running this command you MUST run fsck. If you don't do it, Ext4 WILL NOT MOUNT your filesystem. This fsck run is needed to return the filesystem to a consistent state. It WILL tell you that it finds checksum errors in the group descriptors - it's expected, and it's exactly what it needs to be rebuilt to be able to mount it as Ext4, so don't get surprised by them. Since each time it finds one of those errors it asks you what to do, always say YES. If you don't want to be asked, add the "-p" parameter to the fsck command, it means "automatic repair":
(e2)fsck -pfDCO /dev/yourfilesystem
Bigger filesystem/file sizes
Currently, Ext3 support 16 TB of maximum filesystem size, and 2 TB of maximum file size. Ext4 adds 48-bit block addressing, so it will have 1 EB of maximum filesystem size and 16 TB of maximum file size. 1 EB = 1,048,576 TB (1 EB = 1024 PB, 1 PB = 1024 TB, 1 TB = 1024 GB).
Sub directory scalability
Right now the maximum possible number of sub directories contained in a single directory in Ext3 is 32000. Ext4 breaks that limit and allows a unlimited number of sub directories.
The traditionally Unix-derived filesystems like Ext3 use a indirect block mapping scheme to keep track of each block used for the blocks corresponding to the data of a file. This is inefficient for large files, especially on large file delete and truncate operations, because the mapping keeps a entry for every single block, and big files have many blocks -> huge mappings, slow to handle. Modern filesystems use a different approach called "extents". An extent is basically a bunch of contiguous physical blocks.
When Ext3 needs to write new data to the disk, there's a block allocator that decides which free blocks will be used to write the data. But the Ext3 block allocator only allocates one block (4KB) at a time. That means that if the system needs to write the 100 MB data mentioned in the previous point, it will need to call the block allocator 25600 times (and it was just 100 MB!).
Ext4 uses a "multiblock allocator" (mballoc) which allocates many blocks in a single call, instead of a single block per call, avoiding a lot of overhead. This improves the performance, and it's particularly useful with delayed allocation and extents. This feature doesn't affect the disk format. Also, note that the Ext4 block/inode allocator has other improvements, described in detail in this paper.
Delayed allocation is a performance feature (it doesn't change the disk format) found in a few modern filesystems such as XFS, ZFS, btrfs or Reiser 4, and it consists in delaying the allocation of blocks as much as possible, contrary to what traditionally filesystems (such as Ext3, reiser3, etc) do: allocate the blocks as soon as possible.
EXT4 Delayed allocation, on the other hand, does not allocate the blocks immediately when the process write()s, rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk. This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't.
Fsck is a very slow operation, especially the first step: checking all the inodes in the file system. In Ext4, at the end of each group's inode table will be stored a list of unused inodes (with a checksum, for safety), so fsck will not check those inodes. The result is that total fsck time improves from 2 to 20 times, depending on the number of used inodes.
The journal is the most used part of the disk, making the blocks that form part of it more prone to hardware failure. And recovering from a corrupted journal can lead to massive corruption. Ext4 checksums the journal data to know if the journal blocks are failing or corrupted. But journal checksumming has a bonus: it allows one to convert the two-phase commit system of Ext3's journaling to a single phase, speeding the filesystem operation up to 20% in some cases - so reliability and performance are improved at the same time.
“No Journaling" mode
Journaling ensures the integrity of the filesystem by keeping a log of the ongoing disk changes. However, it is know to have a small overhead. Some people with special requirements and workloads can run without a journal and its integrity advantages. In Ext4 the journaling feature can be disabled, which provides a small performance improvement.
(This feature is being developed and will be included in future releases).
Larger inodes, nanosecond timestamps, fast extended attributes, inodes reservation...
Larger inodes: Ext3 supports configurable inode sizes (via the -I mkfs parameter), but the default inode size is 128 bytes. Ext4 will default to 256 bytes. This is needed to accommodate some extra fields (like nanosecond timestamps or inode versioning), and the remaining space of the inode will be used to store extend attributes that are small enough to fit it that space. This will make the access to those attributes much faster, and improves the performance of applications that use extend attributes by a factor of 3-7 times.
Inode reservation consists in reserving several inodes when a directory is created, expecting that they will be used in the future. This improves the performance, because when new files are created in that directory they'll be able to use the reserved inodes. File creation and deletion is hence more efficient.
Nanoseconds timestamps means that inode fields like "modified time" will be able to use nanosecond resolution instead of the second resolution of Ext3.
This feature, available in Ext3 in the latest kernel versions, and emulated by glibc in the filesystems that don't support it, allows applications to preallocate disk space: Applications tell the filesystem to preallocate the space, and the filesystem preallocates the necessary blocks and data structures, but there's no data on it until the application really needs to write the data in the future.
Barriers on by default
This is an option that improves the integrity of the filesystem at the cost of some performance (you can disable it with "mount -o barrier=0", recommended trying it if you're benchmarking). The filesystem code must, before writing the [journaling] commit record, be absolutely sure that all of the transaction's information has made it to the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to get all of the journal data onto the media before writing the commit record; if the commit record gets written first, the journal may be corrupted. The kernel's block I/O subsystem makes this capability available through the use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk structures remain consistent at all times.
XFS was originally referred to as the ‘X’ File System (XFS) and the name was used ever since. The Original XFS implementation under Irix at SGI (Silicon Graphics ) in 1993. SGI explicitly set out to build a new FS from scratch for the high end Origin servers with a completely new design for a journaled. The XFS port to Linux was started.
In 1999 and took considerable work to "prove" that there was no contaminated code and every single header file and code file was examined line-by-line. The effort was further complicated by the use of Vnodes and the System V and BSD 4.X code and FS framework under Irix. Hence the Linux FS interfaces required some rethinking and rework to adapt the stock XFS code. The work Dave Chinner, aka RedHat, has done on XFS over the years has greatly improved the performance in numerous areas.
For added file growth, XFS allows a large number of inodes and directori
The file name size limit is 255 characters. To support large files and a larger partition (more addressing values), the file system is 64-bit. The file and space limitations are as follows:
32-bit system / 64-bit system
File size: 16 Terabytes / 16 Exabytes
File system: 16 Terabytes / 18 Exabytes
The file system consistency is guaranteed by the use of Journaling. The Journal size is calculated by the partition size used to make the XFS file system. If a crash occurs, the Journal can be used to ‘redo’ the transactions. When the file system is mounted and a previous crash is detected, the recovery of the items in the Journal is automatic.
For support of faster throughput, Allocation Groups are used. Allocation Groups provide for simultaneous I/O by multiple application threads at once. The ability allows systems with multiple processors or multi-core processors to provide better throughput with the file system. These benefits are better when the XFS file system spans multiple devices.
For multiple devices to be used within the same XFS file system, RAID 0 can be implemented. The more devices used in the striped file system, the higher the throughput can be achieved. To also provide higher throughput, XFS uses Direct I/O to allow data retrieved from the XFS file system to go directly to the application memory space. Since the data is bypassing cache and the processor, the retrieval and writing of data is faster.
Another feature to provide faster throughput is Guaranteed Rate I/O. Guaranteed Rate I/O provides a method to reserve bandwidth to and from the file system. The method provides faster access for real-time applications. Another ability of XFS to increase performance and reduce fragmentation is by using Delayed Allocation. This method works well for files of unknown size being written while they are being modified. Another method that XFS uses to prevent fragmentation is to use sparse files. Here the real file contains large sections of zeroes. The zeroes are eliminated, but metadata is used to represent the zeroes. Instead of writing all the zeroes it writes the metadata, saving space. When the file is accessed, it is expanded again to its normal state in memory.
Even with the file system having methods to reduce fragmentation, it can still occur when it becomes low on free space. To help alleviate this issue, Online Defragmentation is a process which can move files into contiguous blocks to reduce fragmentation. The process can occur while the XFS volume is mounted and being used.
Allocating space on the file system is accomplished by using Extents. To manage the free space on the file system, B+ Trees are used to track these spaces. Other file systems use a bitmap to track free and used space. Two B+ Trees are used to track free space on an XFS file system. One tree is used to store the starting block of the free extents, while the second B+ Tree indexes the number of free extents for each starting block. To write a file, the file system can check for a free space with enough contiguous extents, and then find the starting block to begin writing.
NOTE: Bitmaps are used to track used and unused space. These bitmaps are not images, but a file where each bit represents an addressable block. Each bit is either on (1) or off (0) to represent if it is used or free.
For space preservation, block sizes are variable and can be set as 512 bytes to 64kb. If a system will have many small files, then the smaller block size should be used. If larger files are to be used, then use larger block sizes. The block size is set at the time of file system creation. Wasted space by block size is discussed in the article on Extents.
To provide more storage space than is physically available, Data Management API (DMAPI) can be utilized to support offline storage for unused files. Files that are rarely accessed can be moved to another storage device allowing the space to be used by other files. When the file is requested, DMAPI moves the offline file back to the hard drive for access by the application.
When disk space is running low on an XFS volume, it is possible to perform an Online Resizing to increase the free space. Extra space is created by using unused partitions on other hard disks. The space is added to the existing file system increasing its size.
If multiple user accounts are set up, as well as groups, it is possible to monitor the disk usage by each user and/or group. Atomic Disk Quotas allow for disk usage management to not only monitor the usage but to place limits on them.
To provide better backups, Snapshots can be used to create a read-only ‘image’ of the file system so it can be backed up even though the ‘real’ file system is still being used and modified.
XFS also provides for Native Backup/Restore capabilities. Utilities such as xfsdump and xfsrestore allow a user to backup and restore files while the file system is in use. The process can be done without creating a snapshot and can even be performed in multiple streams to various devices. The restore and backup process can be interrupted and resumed without causing issues.