Archive for the ‘dbms’ Category
Today I encountered an obscure file attribute/equivalent mount option (if you are fine with these semantics mount-wide). It is more likely that one would know about this option should he/she be familiar with MTA software and presumably other software with strict data durability guarantees made by a POSIX file system, especially with regard to metadata.
The crux of the problem is that the call to
rename(2) does not guarantee durability of the changes when
rename returns. Using
dirsync promises that metadata alterations in a directory are synchronous rather than asynchronous. One may want to read this post in more detail if he/she isn’t already aware of
dirsync and maintains programs that heavily rely on the atomicity of
rename and other metadata operations. This includes all renames, creations, and deletions.
rename makes atomicity guarantees, which are not to be confused with durability guarantees. Guarantees include:
- One will never have two persistent links to the same file, even if one should suffer a crash during or after a
renameoperation. (A transient double-existence while the system is still on is deemed acceptable)
- Even if another link is being destroyed by the
rename(i.e. a file exists with the destination name), there will exist no time where the destination file name does not exist (as either as the old or new content)
I wrote this post because I did not know a-priori what to be looking for when encountering some self-doubt about the robustness of a two disparate systems utilizing two phase commit during crash recovery, of which one half was a file system. Keywords that came to my mind did not yield useful search results, so I ended walking around the Linux source instead when I came upon
dirsync. This use of the search term is sufficiently obscure (it is much more often used as a shorthand for ‘directory synchronization’, e.g.
rsync-ish tools) that one must disambiguate it by adding fairly specific keywords, such as ‘inode’. Hopefully this post will raise awareness about the possible danger faced by most program assuming the atomicity and durability of metadata changes and serve as good search-engine fodder to that effect.
Edit: I need to do some more investigation on how what the tradeoffs are vs. fsync(). I think there’s mostly a speed benefit to avoiding a heavy fsync() call. To the best of my knowledge, there is no fsync_metadata_only library function, and dirsync will give you those semantics, albeit using fairly blunt tools.