2台目のNAS(その4:ZIL/L2ARCがremoved)

zpoolがDEGRADEDになってしまいました。

# zpool status -v
  pool: hoge
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 09:48:45 with 0 errors on Wed Jun  8 12:50:31 2022
config:

        NAME        STATE     READ WRITE CKSUM
        hoge        DEGRADED     0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada3p2  ONLINE       0     0     0
            ada4p2  ONLINE       0     0     0
        logs
          nvd0p1    REMOVED      0     0     0
        cache
          nvd0p2    REMOVED      0     0     0

errors: Permanent errors have been detected in the following files:

        hoge/media/tmp:<0x0>
        hoge/var/mail:<0x0>
        hoge/var:<0x0>
        hoge/var/log:<0x0>
        hoge/var/db:<0x0>
        hoge/var/db/pkg:<0x0>

/var/log/messagesを見ると、

Jun 22 15:19:29 potato kernel: nvme0: RECOVERY_START TTTTTTTTTTTTTTTTT vs UUUUUUUUUUUUUUUUU
Jun 22 15:19:29 potato kernel: nvme0: Controller in fatal status, resetting
Jun 22 15:19:29 potato kernel: nvme0: Resetting controller due to a timeout and possible hot unplug.
Jun 22 15:19:29 potato kernel: nvme0: RECOVERY_WAITING
Jun 22 15:19:29 potato kernel: nvme0: resetting controller
Jun 22 15:19:29 potato kernel: nvme0: waiting
Jun 22 15:19:29 potato kernel: nvme0: failing outstanding i/o
Jun 22 15:19:29 potato kernel: nvme0: READ sqid:1 cid:121 nsid:1 lba:VVVVVVV len:4
Jun 22 15:19:29 potato kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:121 cdw0:0
Jun 22 15:19:29 potato kernel: nvme0: failing outstanding i/o
Jun 22 15:19:29 potato kernel: nvme0: WRITE sqid:2 cid:117 nsid:1 lba:WWWWWWW len:1
Jun 22 15:19:29 potato kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:117 cdw0:0
Jun 22 15:19:29 potato kernel: nvd0: detached
Jun 22 15:19:29 potato kernel: nvme0: READ sqid:2 cid:0 nsid:1 lba:8389176 len:16
Jun 22 15:19:29 potato kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:0 cdw0:0
Jun 22 15:19:30 potato kernel: nvme0: waiting
Jun 22 15:19:30 potato syslogd: last message repeated 1 times
Jun 22 15:19:30 potato ZFS[37628]: pool I/O failure, zpool=potato error=6
Jun 22 15:19:30 potato ZFS[37632]: vdev state changed, pool_guid=XXXXXXXXXXXXXXXXXXX vdev_guid=YYYYYYYYYYYYYYYYYYY
Jun 22 15:19:30 potato ZFS[37636]: vdev is removed, pool_guid=XXXXXXXXXXXXXXXXXXX vdev_guid=YYYYYYYYYYYYYYYYYYY
Jun 22 15:19:30 potato ZFS[37640]: vdev state changed, pool_guid=XXXXXXXXXXXXXXXXXXX vdev_guid=ZZZZZZZZZZZZZZZZZZZ
Jun 22 15:19:30 potato ZFS[37644]: vdev is removed, pool_guid=XXXXXXXXXXXXXXXXXXX vdev_guid=ZZZZZZZZZZZZZZZZZZZ
Jun 22 15:19:30 potato ZFS[37648]: pool I/O failure, zpool=hoge error=6
Jun 22 15:22:00 potato ZFS[37661]: pool I/O failure, zpool=hoge error=6
Jun 22 16:02:44 potato ZFS[37724]: pool I/O failure, zpool=hoge error=6
Jun 22 16:04:34 potato ZFS[37785]: pool I/O failure, zpool=hoge error=6
Jun 22 16:04:36 potato ZFS[37792]: pool I/O failure, zpool=hoge error=6

という状態。何らかの要因でMVNE SSDがOSから見えなくなってしまったようです。rebootすると、

# zpool status -v
  pool: hoge
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 09:48:45 with 0 errors on Wed Jun  8 12:50:31 2022
config:

        NAME        STATE     READ WRITE CKSUM
        hoge        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada3p2  ONLINE       0     0     0
            ada4p2  ONLINE       0     0     0
        logs
          nvd0p1    ONLINE       0     0     0
        cache
          nvd0p2    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        hoge/media/tmp:<0x0>
        hoge/var/mail:<0x0>
        hoge/var:<0x0>
        hoge/var/log:<0x0>
        hoge/var/db:<0x0>
        hoge/var/db/pkg:<0x0>
        hoge/home/potato:<0x0>

となりONLINEには復帰したもののerrorsがうざい。念の為にzpool scrubを実行。

# zpool scrub hoge

errorsを消すために、zpool clear hogeを実行。

# zpool clear hoge

これで、やっとerrorsが消えました。

# zpool status -v
  pool: hoge
 state: ONLINE
  scan: scrub repaired 0B in 09:43:39 with 0 errors on Fri Jun 24 19:27:30 2022
config:

        NAME        STATE     READ WRITE CKSUM
        hoge        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada3p2  ONLINE       0     0     0
            ada4p2  ONLINE       0     0     0
        logs
          nvd0p1    ONLINE       0     0     0
        cache
          nvd0p2    ONLINE       0     0     0

errors: No known data errors

その3へ戻る その5へ続く