Microsoft
company have recently released a public beta-version of Windows 8
Server with support of the advertised ReFS (Resilient File System),
early known under the code name "Protogon". This file
system is offered as an alternative to NTFS file system, proven over
the years of its existence in the segment of Microsoft-based data
storage systems with its further migration into the area of client
systems.
The file
system variant available in this operating system version supports
data clusters of 64KB in size and metadata clusters of 16KB.
Presently it remains unclear if ReFS would support other cluster
sizes. Now “Cluster size” option is ignored and is set by default
at creation of ReFS volume. At file system formatting 64KB is the
only option for cluster size. Besides, this size is the only
mentioned in blogs of the developers.
This article
gives an overview of ReFS file system structures, its advantages and
disadvantages as well as analysis of its architecture from the point
of view of data consistency maintenance and data recovery chances in
case of corruptions or deletions by the user. The article also
discloses the researches of architecture properties of the file
system and its performance capabilities.
Windows Server 8 Beta
This cluster
size is more than sufficient for organization of file systems of any
size from the set of actually implemented, but at the same time
causes notable redundancy of data storing.
File system
architecture
Despite the
fact that ReFS is often mentioned as the file system similar to NTFS
at the top level, this similarity only concerns compatibility of some
metadata structures such as “standard information”, “file
name”, compatibility by values of some attributes flags etc. Disk
implementation of ReFS structures is completely different from other
Microsoft file systems.
Major
structural elements of the new file system are B+-trees. All elements
of the file system structure are represented as one-level lists or
multi-level B+-trees what allows to significantly scale almost all
file system elements. Such file system structure together with actual
64-bit count of all system elements excludes emergence of
“bottlenecks” during further scaling.
Except the
B+-tree root record the rest of the records have the size of the
integral metadata block (in this case 16KB); and intermediate
(address) nodes have a small full size (about 60 bytes). For this
reason a small quantity of tree levels is usually required for
description of even huge structures what has quite positive effect on
general performance capabilities of the system.
Major
structural element of the file system is “Directory” represented
as B+-tree in which the key is the number of object folder. Unlike
other similar file systems, a file in ReFS is not a separate key
element of the “Directory”, and rather exists as a record in the
parent folder. Probably, hard links on ReFS are not supported due to
this architecture property.
“Directory
leaves” are typed records. There exist three major record types for
object folder: directory descriptor, index record and sub-object
descriptor. All such records are zipped as a separate B+-tree with
folder identifier; the root of this tree is the leaf of the
“Directory” B+-tree what allows to zip almost any number of
records into the folder. The lowest level of folder B+-trees
contains, first of all, directory descriptor record that includes
basic data about the folder (such as name, “standard information”,
file name attribute etc.). Data structures have much in common with
that of NTFS, though, at the same time they have a range of
structural differences, the major of which is absence of a typed list
of named attributes.
Further on
in the directory follow so-called index records: short structures
containing data about folder elements. These records are much shorter
compared to NTFS what overloads the volume with metadata to a lesser
extent. Directory elements records follow the last. For folders these
elements contain folder name, folder identifier in the “Directory”
and the structure of “Standard information”. For files this
identifier is absent, but instead the structure contains all basic
data about the file including the root of file fragments B+-tree.
Accordingly, the file may consist of nearly any number of fragments.
The files
are allocated on disk in blocks of 64 KB, though they are addressable
in the same way as metadata blocks (in clusters of 16KB in size).
File data residence is not supported by ReFS, for this reason a file
of 1byte in size will take the whole 64KB block on the disk what
results in substantial storage redundancy what concerns small files;
on the other hand, this makes free space management easier and
allocation of a new file takes much less time.
Metadata size of an empty
file system is about 0.1% of the file system itself (i.e. about 2GB
for 2TB volume). Some basic metadata are duplicated to increase
resilience to failures.
Judging on
the architecture, boot from ReFS partitions is possible, but is not
implemented in this Windows Server edition.
Resilience to failures
The research
wasn't focused on stability of the existing ReFS implementation. But
judging on architecture, the file system has all necessary tools for
safe files recovery even after severe hardware failures. Parts of
metadata structures contain their own identifiers what allows to
check structures origin; links to metadata contain 64-bit check sums
of the referenced blocks what very often allows to estimate the
consistency of the contents read by the block link.
At the same
time it's worth mentioning that the check sums of user data (files
contents) are not counted. On the one hand this turns off the
mechanism for consistency test in the data area, on the other hand
this speeds up system operation due to minimal modifications in the
metadata area.
Any
modifications of metadata structure are made in two stages: at first,
a new (modified) metadata copy is made in a free disk space, then, on
success, atomic update operation shifts the link from the old
(not-modified) to a new (modified) metadata area. This strategy
(Copy-on-Write (CoW) allows to avoid journalizing automatically
maintaining consistency of the data. Confirmation of such
modifications on disk may not be made for a long time allowing to
combine several modifications of file system statuses into one.
This scheme
is not applied to user data, for this reason any modifications of
file content are made directly to file. File deletion is conducted
with reorganization of metadata structure (using CoW) what saves the
previous version of metadata block on the disk. This makes recovery
of deleted files possible until they are overwritten with new user
data.
Storage overuse
In this article we speak about the use of storage space with data
storing scheme. With testing purposes Windows Server was copied to
ReFS partition 580GB in size. The size of metadata on an empty file
system was 0.73GB.
At copying
of the installed Windows Server to ReFS partition overuse of files
data increased from 0.1% on NTFS to nearly 30% on ReFS. Besides,
about 10% of overuse was added by metadata. As a result, user data 11
GB in size (more than 70 thousand files) together with metadata took
11.3GB on NTFS, when the same data on ReFS took 16.2GB showing that
overuse on ReFS is nearly 50% for this data type. What concerns small
quantity of files of a large size this effect is certainly absent.
Operational speed
As we speak about Beta-version file system performance capabilities
were not benchmarked. But file system architecture allows to make some
conclusions. Copying of more than 70 thousand files on ReFS created
4-level “Directory” B+-tree: “root”, intermediate level 1,
intermediate level 2, “leaves”.
As a result,
folder attributes search (on condition of tree root cashing) requires
3 readings of blocks 16KB each. For comparison, this operation on
NTFS requires 1 reading of blocks 1-4KB in size (on condition of $MFT
location card cashing).
Files
attributes search by folder and file name in the folder (not large
folder with several records) on ReFS requires the same 3 readings.
And on NTFS it will require 2 readings 1KB each or 3-4 readings (if
file record is in a non-resident attribute “index”). In large
folders the number of NTFS readings increases much faster than the
number of readings required for ReFS.
This applies
to file contents as well: where the growth of the number of file
fragments on NTFS results in sorting large lists located in different
$MFT fragments, on ReFS this is done by efficient search in B+-tree.
Summary
It's early
to make final conclusions, but judging on the current file system
implementation we can see that the file system is indeed designed for
server segment and, first of all, virtualization systems, DBMS and
backup servers where operational speed and reliability are of
principal importance. Major disadvantages of the file system, such as
inefficient data zipping on disk, come to nothing on systems that
operate large files.
SysDev
Laboratories will monitor the development of this file system and
plan to include this file system into the list of supported file
systems for data recovery. Experimental support of ReFS of
beta-version of Microsoft Windows 8 Server has already been
successfully implemented in UFS Explorer software and is available
for closed beta-testing by our partners. Official release of
utilities for recovery of deleted files from ReFS as well as data
recovery after file system damages caused by hardware failures is
planned earlier or simultaneously with release of Microsoft Windows 8
Server with ReFS support.