Friday, March 23, 2012

ReFS file system inside

Microsoft company have recently released a public beta-version of Windows 8 Server with support of the advertised ReFS (Resilient File System), early known under the code name "Protogon". This file system is offered as an alternative to NTFS file system, proven over the years of its existence in the segment of Microsoft-based data storage systems with its further migration into the area of client systems.

This article gives an overview of ReFS file system structures, its advantages and disadvantages as well as analysis of its architecture from the point of view of data consistency maintenance and data recovery chances in case of corruptions or deletions by the user. The article also discloses the researches of architecture properties of the file system and its performance capabilities.

Windows Server 8 Beta

The file system variant available in this operating system version supports data clusters of 64KB in size and metadata clusters of 16KB. Presently it remains unclear if ReFS would support other cluster sizes. Now “Cluster size” option is ignored and is set by default at creation of ReFS volume. At file system formatting 64KB is the only option for cluster size. Besides, this size is the only mentioned in blogs of the developers.

This cluster size is more than sufficient for organization of file systems of any size from the set of actually implemented, but at the same time causes notable redundancy of data storing.


File system architecture

Despite the fact that ReFS is often mentioned as the file system similar to NTFS at the top level, this similarity only concerns compatibility of some metadata structures such as “standard information”, “file name”, compatibility by values of some attributes flags etc. Disk implementation of ReFS structures is completely different from other Microsoft file systems.

Major structural elements of the new file system are B+-trees. All elements of the file system structure are represented as one-level lists or multi-level B+-trees what allows to significantly scale almost all file system elements. Such file system structure together with actual 64-bit count of all system elements excludes emergence of “bottlenecks” during further scaling.

Except the B+-tree root record the rest of the records have the size of the integral metadata block (in this case 16KB); and intermediate (address) nodes have a small full size (about 60 bytes). For this reason a small quantity of tree levels is usually required for description of even huge structures what has quite positive effect on general performance capabilities of the system.

Major structural element of the file system is “Directory” represented as B+-tree in which the key is the number of object folder. Unlike other similar file systems, a file in ReFS is not a separate key element of the “Directory”, and rather exists as a record in the parent folder. Probably, hard links on ReFS are not supported due to this architecture property.

“Directory leaves” are typed records. There exist three major record types for object folder: directory descriptor, index record and sub-object descriptor. All such records are zipped as a separate B+-tree with folder identifier; the root of this tree is the leaf of the “Directory” B+-tree what allows to zip almost any number of records into the folder. The lowest level of folder B+-trees contains, first of all, directory descriptor record that includes basic data about the folder (such as name, “standard information”, file name attribute etc.). Data structures have much in common with that of NTFS, though, at the same time they have a range of structural differences, the major of which is absence of a typed list of named attributes.

Further on in the directory follow so-called index records: short structures containing data about folder elements. These records are much shorter compared to NTFS what overloads the volume with metadata to a lesser extent. Directory elements records follow the last. For folders these elements contain folder name, folder identifier in the “Directory” and the structure of “Standard information”. For files this identifier is absent, but instead the structure contains all basic data about the file including the root of file fragments B+-tree. Accordingly, the file may consist of nearly any number of fragments.

The files are allocated on disk in blocks of 64 KB, though they are addressable in the same way as metadata blocks (in clusters of 16KB in size). File data residence is not supported by ReFS, for this reason a file of 1byte in size will take the whole 64KB block on the disk what results in substantial storage redundancy what concerns small files; on the other hand, this makes free space management easier and allocation of a new file takes much less time.

Metadata size of an empty file system is about 0.1% of the file system itself (i.e. about 2GB for 2TB volume). Some basic metadata are duplicated to increase resilience to failures.

Judging on the architecture, boot from ReFS partitions is possible, but is not implemented in this Windows Server edition.


Resilience to failures

The research wasn't focused on stability of the existing ReFS implementation. But judging on architecture, the file system has all necessary tools for safe files recovery even after severe hardware failures. Parts of metadata structures contain their own identifiers what allows to check structures origin; links to metadata contain 64-bit check sums of the referenced blocks what very often allows to estimate the consistency of the contents read by the block link.

At the same time it's worth mentioning that the check sums of user data (files contents) are not counted. On the one hand this turns off the mechanism for consistency test in the data area, on the other hand this speeds up system operation due to minimal modifications in the metadata area.

Any modifications of metadata structure are made in two stages: at first, a new (modified) metadata copy is made in a free disk space, then, on success, atomic update operation shifts the link from the old (not-modified) to a new (modified) metadata area. This strategy (Copy-on-Write (CoW) allows to avoid journalizing automatically maintaining consistency of the data. Confirmation of such modifications on disk may not be made for a long time allowing to combine several modifications of file system statuses into one.

This scheme is not applied to user data, for this reason any modifications of file content are made directly to file. File deletion is conducted with reorganization of metadata structure (using CoW) what saves the previous version of metadata block on the disk. This makes recovery of deleted files possible until they are overwritten with new user data.


Storage overuse

In this article we speak about the use of storage space with data storing scheme. With testing purposes Windows Server was copied to ReFS partition 580GB in size. The size of metadata on an empty file system was 0.73GB.

At copying of the installed Windows Server to ReFS partition overuse of files data increased from 0.1% on NTFS to nearly 30% on ReFS. Besides, about 10% of overuse was added by metadata. As a result, user data 11 GB in size (more than 70 thousand files) together with metadata took 11.3GB on NTFS, when the same data on ReFS took 16.2GB showing that overuse on ReFS is nearly 50% for this data type. What concerns small quantity of files of a large size this effect is certainly absent.


Operational speed

As we speak about Beta-version file system performance capabilities were not benchmarked. But file system architecture allows to make some conclusions. Copying of more than 70 thousand files on ReFS created 4-level “Directory” B+-tree: “root”, intermediate level 1, intermediate level 2, “leaves”.

As a result, folder attributes search (on condition of tree root cashing) requires 3 readings of blocks 16KB each. For comparison, this operation on NTFS requires 1 reading of blocks 1-4KB in size (on condition of $MFT location card cashing).

Files attributes search by folder and file name in the folder (not large folder with several records) on ReFS requires the same 3 readings. And on NTFS it will require 2 readings 1KB each or 3-4 readings (if file record is in a non-resident attribute “index”). In large folders the number of NTFS readings increases much faster than the number of readings required for ReFS.

This applies to file contents as well: where the growth of the number of file fragments on NTFS results in sorting large lists located in different $MFT fragments, on ReFS this is done by efficient search in B+-tree.


Summary

It's early to make final conclusions, but judging on the current file system implementation we can see that the file system is indeed designed for server segment and, first of all, virtualization systems, DBMS and backup servers where operational speed and reliability are of principal importance. Major disadvantages of the file system, such as inefficient data zipping on disk, come to nothing on systems that operate large files.

SysDev Laboratories will monitor the development of this file system and plan to include this file system into the list of supported file systems for data recovery. Experimental support of ReFS of beta-version of Microsoft Windows 8 Server has already been successfully implemented in UFS Explorer software and is available for closed beta-testing by our partners. Official release of utilities for recovery of deleted files from ReFS as well as data recovery after file system damages caused by hardware failures is planned earlier or simultaneously with release of Microsoft Windows 8 Server with ReFS support.