编写程序以应对导致Linux上的写入丢失的I/O错误

本文介绍了编写程序以应对导致Linux上的写入丢失的I/O错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

TL; DR:如果Linux内核丢失了缓冲的I/O写操作，应用程序有什么办法找出来?

我知道您必须fsync()文件(及其父目录)以确保持久性.问题是如果内核由于I/O错误而丢失了待写的脏缓冲区，应用程序如何检测到这一点并恢复或中止操作?

考虑数据库应用程序等，其中写入顺序和写入持久性可能至关重要.

丢了写?怎么样?

在某些情况下，Linux内核的块层可能会丢失丢失缓冲的I/O请求，这些请求已由write()，pwrite()等成功提交，但出现以下错误:

Buffer I/O error on device dm-0, logical block 12345
lost page write due to I/O error on dm-0

(请参见 end_buffer_write_sync(...)和在fs/buffer.c 中.

在较新的内核上，错误将包含丢失异步页面写入" ，例如:

Buffer I/O error on dev dm-0, logical block 12345, lost async page write

由于应用程序的write()已经返回且没有错误，因此似乎无法向应用程序报告错误.

正在检测它们?

我对内核源代码并不熟悉，但我认为，如果它执行异步写入，它将在无法写出的缓冲区上设置AS_EIO:

    set_bit(AS_EIO, &page->mapping->flags);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

但是我不清楚在以后fsync()将文件确认为磁盘上时，应用程序是否或如何找到此信息.

它看起来像 wait_on_page_writeback_range(...)在mm/filemap.c 可能由 do_sync_mapping_range(...)在fs/sync.c 中，由 sys_sync_file_range(...) .如果无法写入一个或多个缓冲区，则返回-EIO.

如果按照我的猜测，这会传播到fsync()的结果，那么如果应用程序出现紧急情况并且无法从fsync()接收到I/O错误并知道如何重新执行其工作，则该应用程序会崩溃重新启动时，那应该是足够的保障?

应用程序大概没有办法知道文件中的哪个字节偏移量与丢失的页面相对应，因此如果知道如何可以重写它们，但是如果应用程序重复了所有未完成的工作，因为文件的最后一个成功的fsync()，并且重写了与该文件的丢失写入相对应的所有脏内核缓冲区，该缓冲区应该清除丢失页面上的所有I/O错误标志并允许下一个fsync()完成-是吗?

还有没有其他无害的情况，其中fsync()可能会返回-EIO，而紧急援助和重做工作会过于激烈?

为什么?

当然不会发生此类错误.在这种情况下，错误是由于dm-multipath驱动程序的默认值与SAN用于报告未能分配精简配置的存储的检测代码之间的不幸交互而引起的.但这不是它们发生的唯一情况-例如，我也看到过libvirt，Docker等使用的精简配置LVM的报告.诸如数据库之类的关键应用程序应尝试应对此类错误，而不是一味地认为一切正常.

如果内核认为可以在不死于内核恐慌的情况下丢失写操作，则应用程序必须找到应对之道.

实际的影响是，我发现了一个案例，其中SAN的多路径问题导致丢失的写丢失并导致数据库损坏，这是因为DBMS不知道其写失败了.不好玩.

解决方案

fsync()如果内核丢失了写操作，则返回-EIO

(注意:早期部分引用了较早的内核；在下面进行了更新以反映现代内核)

它看起来像异步缓冲区写出在end_buffer_async_write(...)失败的情况下，在失败的脏缓冲区页面上为文件设置-EIO标志:

set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);

然后由wait_on_page_writeback_range(...)检测为do_sync_mapping_range(...)，由wait_on_page_writeback_range(...)检测为，以实现C库调用fsync().

但只有一次！

对sys_sync_file_range的评论

168  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169  * I/O errors or ENOSPC conditions and will return those to the caller, after
170  * clearing the EIO and ENOSPC flags in the address_space.

建议当fsync()返回-EIO或(在联机帮助页中未记录)-ENOSPC时，它将清除错误状态，因此即使页面出现错误，后续的fsync()也会报告成功.从来没有写过.

足够wait_on_page_writeback_range(...) 在测试错误位时清除错误位 :

301         /* Check for outstanding write errors */
302         if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303                 ret = -ENOSPC;
304         if (test_and_clear_bit(AS_EIO, &mapping->flags))
305                 ret = -EIO;

因此，如果应用程序希望它可以重试fsync()，直到成功并相信数据在磁盘上，那就非常错误了.

我很确定这是我在DBMS中发现的数据损坏的根源.它重试fsync()，并认为成功后一切都会好起来.

可以吗?

fsync() 上的 POSIX/SuS文档不要确实指定这两种方式:

Linux的fsync() 手册页什么也没说.

因此，看来fsync()错误的含义是不知道您的写操作发生了什么，可能不管有没有奏效，最好再试一次以确定".

更新的内核

在4.9个 end_buffer_async_write 集-EIO在页面上，只需通过mapping_set_error.

    buffer_io_error(bh, ", lost async page write");
    mapping_set_error(page->mapping, -EIO);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

在同步方面，我认为这很相似，尽管现在要遵循的结构非常复杂. mm/filemap.c中的filemap_check_errors现在可以:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;

具有大致相同的效果.错误检查似乎全部通过 filemap_check_errors 进行了测试并清除:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;
    return ret;

我在笔记本电脑上使用btrfs，但是当我创建ext4环回以在/mnt/tmp上进行测试并在其上设置性能探针时:

sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100
sudo mke2fs -j -T ext4 /tmp/ext
sudo mount -o loop /tmp/ext /mnt/tmp

sudo perf probe filemap_check_errors

sudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync

我在perf report -T中找到以下调用堆栈:

        ---__GI___libc_fsync
           entry_SYSCALL_64_fastpath
           sys_fsync
           do_fsync
           vfs_fsync_range
           ext4_sync_file
           filemap_write_and_wait_range
           filemap_check_errors

通读表明，是的，现代内核的行为相同.

这似乎意味着，如果fsync()(或者大概是write()或close())返回-EIO，则在您最后一次成功fsync() d或close() d两次之间，文件处于某种未定义状态.及其最近的write()十个状态.

测试

我已经实现了一个测试案例来证明这一点行为.

含义

DBMS可以通过进入崩溃恢复来解决此问题.普通的用户应用程序应该如何应对呢? fsync()手册页没有警告说它意味着"fsync-if-you-feel-like-it"，我希望有 lot 个应用程序不能很好地应对这种行为. >

错误报告

https://bugzilla.kernel.org/show_bug.cgi?id=194755
https://bugzilla.kernel.org/show_bug.cgi?id=194757

进一步阅读

lwn.net在文章改进的块层错误处理" 中对此进行了提及. /p>

postgresql.org邮件列表线程.

TL;DR: If the Linux kernel loses a buffered I/O write, is there any way for the application to find out?

I know you have to fsync() the file (and its parent directory) for durability. The question is if the kernel loses dirty buffers that are pending write due to an I/O error, how can the application detect this and recover or abort?

Think database applications, etc, where order of writes and write durability can be crucial.

Lost writes? How?

The Linux kernel's block layer can under some circumstances lose buffered I/O requests that have been submitted successfully by write(), pwrite() etc, with an error like:

Buffer I/O error on device dm-0, logical block 12345
lost page write due to I/O error on dm-0

(See end_buffer_write_sync(...) and end_buffer_async_write(...) in fs/buffer.c).

On newer kernels the error will instead contain "lost async page write", like:

Buffer I/O error on dev dm-0, logical block 12345, lost async page write

Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.

Detecting them?

I'm not that familiar with the kernel sources, but I think that it sets AS_EIO on the buffer that failed to be written-out if it's doing an async write:

    set_bit(AS_EIO, &page->mapping->flags);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

but it's unclear to me if or how the application can find out about this when it later fsync()s the file to confirm it's on disk.

It looks like wait_on_page_writeback_range(...) in mm/filemap.c might by do_sync_mapping_range(...) in fs/sync.c which is turn called by sys_sync_file_range(...). It returns -EIO if one or more buffers could not be written.

If, as I'm guessing, this propagates to fsync()'s result, then if the app panics and bails out if it gets an I/O error from fsync() and knows how to re-do its work when restarted, that should be sufficient safeguard?

There's presumably no way for the app to know which byte offsets in a file correspond to the lost pages so it can rewrite them if it knows how, but if the app repeats all its pending work since the last successful fsync() of the file, and that rewrites any dirty kernel buffers corresponding to lost writes against the file, that should clear any I/O error flags on the lost pages and allow the next fsync() to complete - right?

Are there then any other, harmless, circumstances where fsync() may return -EIO where bailing out and redoing work would be too drastic?

Why?

Of course such errors should not happen. In this case the error arose from an unfortunate interaction between the dm-multipath driver's defaults and the sense code used by the SAN to report failure to allocate thin-provisioned storage. But this isn't the only circumstance where they can happen - I've also seen reports of it from thin provisioned LVM for example, as used by libvirt, Docker, and more. An critical application like a database should try to cope with such errors, rather than blindly carrying on as if all is well.

If the kernel thinks it's OK to lose writes without dying with a kernel panic, applications have to find a way to cope.

The practical impact is that I found a case where a multipath problem with a SAN caused lost writes that landed up causing database corruption because the DBMS didn't know its writes had failed. Not fun.

解决方案

fsync() returns -EIO if the kernel lost a write

(Note: early part references older kernels; updated below to reflect modern kernels)

It looks like async buffer write-out in end_buffer_async_write(...) failures set an -EIO flag on the failed dirty buffer page for the file:

set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);

which is then detected by wait_on_page_writeback_range(...) as called by do_sync_mapping_range(...) as called by sys_sync_file_range(...) as called by sys_sync_file_range2(...) to implement the C library call fsync().

But only once!

This comment on sys_sync_file_range

168  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169  * I/O errors or ENOSPC conditions and will return those to the caller, after
170  * clearing the EIO and ENOSPC flags in the address_space.

suggests that when fsync() returns -EIO or (undocumented in the manpage) -ENOSPC, it will clear the error state so a subsequent fsync() will report success even though the pages never got written.

Sure enough wait_on_page_writeback_range(...) clears the error bits when it tests them:

301         /* Check for outstanding write errors */
302         if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303                 ret = -ENOSPC;
304         if (test_and_clear_bit(AS_EIO, &mapping->flags))
305                 ret = -EIO;

So if the application expects it can re-try fsync() until it succeeds and trust that the data is on-disk, it is terribly wrong.

I'm pretty sure this is the source of the data corruption I found in the DBMS. It retries fsync() and thinks all will be well when it succeeds.

Is this allowed?

The POSIX/SuS docs on fsync() don't really specify this either way:

Linux's man-page for fsync() just doesn't say anything about what happens on failure.

So it seems that the meaning of fsync() errors is "dunno what happened to your writes, might've worked or not, better try again to be sure".

Newer kernels

On 4.9 end_buffer_async_write sets -EIO on the page, just via mapping_set_error.

    buffer_io_error(bh, ", lost async page write");
    mapping_set_error(page->mapping, -EIO);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

On the sync side I think it's similar, though the structure is now pretty complex to follow. filemap_check_errors in mm/filemap.c now does:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;

which has much the same effect. Error checks seem to all go through filemap_check_errors which does a test-and-clear:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;
    return ret;

I'm using btrfs on my laptop, but when I create an ext4 loopback for testing on /mnt/tmp and set up a perf probe on it:

sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100
sudo mke2fs -j -T ext4 /tmp/ext
sudo mount -o loop /tmp/ext /mnt/tmp

sudo perf probe filemap_check_errors

sudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync

I find the following call stack in perf report -T:

        ---__GI___libc_fsync
           entry_SYSCALL_64_fastpath
           sys_fsync
           do_fsync
           vfs_fsync_range
           ext4_sync_file
           filemap_write_and_wait_range
           filemap_check_errors

A read-through suggests that yeah, modern kernels behave the same.

This seems to mean that if fsync() (or presumably write() or close()) returns -EIO, the file is in some undefined state between when you last successfully fsync()d or close()d it and its most recently write()ten state.

Test

I've implemented a test case to demonstrate this behaviour.

Implications

A DBMS can cope with this by entering crash recovery. How on earth is a normal user application supposed to cope with this? The fsync() man page gives no warning that it means "fsync-if-you-feel-like-it" and I expect a lot of apps won't cope well with this behaviour.

Bug reports

https://bugzilla.kernel.org/show_bug.cgi?id=194755
https://bugzilla.kernel.org/show_bug.cgi?id=194757

the