Archive for the ‘storage’ Category
Today I encountered an obscure file attribute/equivalent mount option (if you are fine with these semantics mount-wide). It is more likely that one would know about this option should he/she be familiar with MTA software and presumably other software with strict data durability guarantees made by a POSIX file system, especially with regard to metadata.
The crux of the problem is that the call to
rename(2) does not guarantee durability of the changes when
rename returns. Using
dirsync promises that metadata alterations in a directory are synchronous rather than asynchronous. One may want to read this post in more detail if he/she isn’t already aware of
dirsync and maintains programs that heavily rely on the atomicity of
rename and other metadata operations. This includes all renames, creations, and deletions.
rename makes atomicity guarantees, which are not to be confused with durability guarantees. Guarantees include:
- One will never have two persistent links to the same file, even if one should suffer a crash during or after a
renameoperation. (A transient double-existence while the system is still on is deemed acceptable)
- Even if another link is being destroyed by the
rename(i.e. a file exists with the destination name), there will exist no time where the destination file name does not exist (as either as the old or new content)
I wrote this post because I did not know a-priori what to be looking for when encountering some self-doubt about the robustness of a two disparate systems utilizing two phase commit during crash recovery, of which one half was a file system. Keywords that came to my mind did not yield useful search results, so I ended walking around the Linux source instead when I came upon
dirsync. This use of the search term is sufficiently obscure (it is much more often used as a shorthand for ‘directory synchronization’, e.g.
rsync-ish tools) that one must disambiguate it by adding fairly specific keywords, such as ‘inode’. Hopefully this post will raise awareness about the possible danger faced by most program assuming the atomicity and durability of metadata changes and serve as good search-engine fodder to that effect.
Edit: I need to do some more investigation on how what the tradeoffs are vs. fsync(). I think there’s mostly a speed benefit to avoiding a heavy fsync() call. To the best of my knowledge, there is no fsync_metadata_only library function, and dirsync will give you those semantics, albeit using fairly blunt tools.
Forgive the writing, I’ll fix it up later if I get complaints.
Gluster (and its filesystem, GlusterFS) is the only distributed computing + distributed file system project that gives me warm fuzzies inside, and if you check my del.icio.us tags, you’ll see that I have visited and reviewed quite a few options in this space (many of which I didn’t bookmark as well). Also reviewed were OCFS(1|2) , GFS(1|2), GFarmFS, Ceph, and CODA.
Why warm fuzzies for Gluster? Because it doesn’t rebuild the world from scratch and it is relatively simple in configuration and implementation. GlusterFS is implemented as a FUSE file system for GNU/Linux (which incurs some overhead, but greatly speeds up development for the obvious reasons) and relies on the underlying file systems that already have received a lot of attention to detail. It also means that you can mix, match, compose, and migrate easily: since it sits above any normal POSIX block device, you can have your exorbitantly expensive fibre channel next to your cheap software RAID6 SATA array in combination with your medium-priced ATA over Ethernet and rely on GlusterFS to distribute data between them using an underlying file system you already know and love. Some of your block devices may be formatted ext3, others JFS or XFS. It doesn’t really matter as long as you have basic POSIX capabilities. GlusterFS also supports optional striping and replication, and I have heard a report of easily saturating a full-duplex 10GBit line in both directions from about five machines (granted, each was probably running RAID) while using GlusterFS.
As it is said: complexity is the enemy of dependability. Gluster is the only solution I’ve seen so far that I as a lone administrator would trust in part because it appeals to my brand of engineering sensibilities. Paramount among them is (to some people counter-intuitively) appreciation for many things that GlusterFS unabashedly doesn’t do, simplifying the design. An example of this is authentication. If you want to use Gluster with authentication, expose (on a trusted machine that’s a cluster client) a SMB/NFS server that takes care of user permissions and hooks up to your LDAP server et al. Gluster doesn’t include any baggage to not trust clients or have fancy quasi-centralized metadata servers, and this I see as a benefit. If someone invents such baggage later on, it will likely be fulfilled as a module (just like the replication module) that I can choose or not choose at-will. Is it as deeply integrated or slick as some of the clustered file systems that require a RDBMS to coordinate? Not really, but the deep and slick scare the bejeezus out of me because they tend to become unwelcome to combination with other techniques, and not in the least because you’re pulling your hair out trying to keep the trains running on time when you are exercising more of the features. (The industrial-strength clustered file systems are not known for their ease of maintenance)
It is also my belief that this dedication to simplicity will result in a more robust substrate to build more advanced fuzzy features on. I prefer my base functionality in a tool to be more predictable than clever. Clever translates to me as “often right, but at hard-to-predict times sometimes very, very funkily wrong.”
Besides the usual supercomputing cluster type applications, I believe Gluster+GlusterFS+(XenSource | LKVM | jails | etc) would provide an excellent way to protect the underlying infrastructure (by using VM-style abstraction as a heavy handed form of capability-based security) and build a service much like Amazon’s EC2. Perhaps an experiment for one day…