7

I'll start off by saying I do have a backup of this pool, but I also want to add that this kind of thing, if it's not recoverable, seems pretty bad if it can happen, so I want to figure out if anything can be done. I was adding a drive to the zpool when an error caused a system crash.

This is the current output:

zpool import music
cannot import 'music': I/O error
    Destroy and re-create the pool from
    a backup source.

All the partitions look fine (I think) from zdb, I tried specifying them all with -d, and I tried not including the one that was just added. It's currently been running a -X import for a few hours now which I'm not optimistic about. The import is average 70 M/s currently according to iotop (it's been a few hours, and it's not a large zpool).

Is there any other way to learn what it thinks went wrong and how to fix it? Also happy to post any more useful info.

New contributor
zpand is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

1 Answer 1

11

First off, a couple of reality checks...

  1. An I/O error on import usually isn’t ZFS “losing its mind” or whatever, it’s the kernel saying “Dude, I can’t read something I really need!” thing. That’s most often a flaky disk, bad cable, sketchy HBA, power hiccup, or one unlucky sector right where ZFS stores critical metadata.

  2. zpool import -X is the nuclear option. It can take forever and your 70 MB/sec reported speed doesn’t mean it’s making good progress, but it also doesn’t mean it’s dead. It’s just slow and painful, which is by design.

Before poking it any more...

  1. Stop doing anything that writes to those disks. Just don't! No labelclear, no repartitioning, no “let’s try one more random thing” stuff. It's dangerous!

  2. If possible, plug the drives straight into a real HBA or motherboard ports. USB docks and cheap adapters are a great way to make recovery harder. In a nutshell, avoid USB+ZFS combination at all.

  3. Take notes and grab output while the system is stable.

Step 1: Make sure this isn’t just device path nonsense.

ZFS really likes stable device names. Use by id paths, not sdX roulette.

ls -l /dev/disk/by-id
zpool import -d /dev/disk/by-id
zpool import -d /dev/disk/by-id -n music

If the pool shows up cleanly there, that's good! If it only half shows up, or devices are missing, that’s already a big clue...

Step 2: Ask ZFS what it would do, without committing.

Start with a normal rewind preview:

zpool import -n -F -d /dev/disk/by-id music

Sidenotes...

a) -F is the “roll back a bit” option, and very often it's enough after a crash mid add.

b) readonly=on is your seatbelt. I mean it!

c) cachefile=none keeps your system config clean while you experiment.

If that fails, then yeah, try the big hammer:

zpool import -X -o readonly=on -o cachefile=none -d /dev/disk/by-id music

Still slow, still ugly, but at least safe!

Step 3: Look at what’s actually screaming underneath.

While import is failing or crawling, check logs. They usually rat out the bad actor.

dmesg -T | egrep -i 'zfs|zio|I/O error|timeout|reset|ata|nvme|scsi'

If you see link resets, timeouts, or one disk spamming errors, that’s your problem! Then pull S.M.A.R.T. on every disk, especially the one you were adding:

smartctl -a /dev/sdX

If S.M.A.R.T. looks ugly or the kernel log is full of retries, this is hardware first, ZFS second!

Step 4: Use zdb for actual signal, not vibes.

The useful commands are:

zdb -l /dev/disk/by-id/thatdisk
zdb -ul /dev/disk/by-id/thatdisk
zdb -e -p /dev/disk/by-id music

You’re checking:

a) Do all disks agree on the pool GUID?

b) Are labels readable everywhere?

c) Are there valid uberblocks or is one disk lying to you?

Step 5: If one disk is flaky, clone it before doing anything heroic.

If a drive is throwing read errors, stop touching it directly! Clone it with ddrescue to a same size or bigger disk, then retry imports on the clone. That's MUCH safer!

That’s standard US data recovery playbook:

a) Image first.

b) Experiment later.

c) Originals go back on the shelf, as that's your last resort.

Step 6: If you get it imported read only.

Don’t scrub right away if hardware still smells funny!

First:

a) Copy off anything important.

b) Stabilize hardware.

c) Scrub and replace bad disks.

What would actually help next... If you want real guidance instead of guesswork, post these:

  1. zpool import -d /dev/disk/by-id

  2. zpool import -n -F -d /dev/disk/by-id music

  3. zdb -e -p /dev/disk/by-id music

  4. The relevant dmesg -T errors should come handy!

  5. smartctl -a and for each disk in ZFS pool!

Most likely outcomes for this exact situation:

a) Crash during add left the pool needing a rewind and -F fixes it.

b) The new disk or its cable is garbage and is poisoning the import.

c) Controller reset during the crash and one disk now shows up “different”.

d) One bad sector in the worst possible spot, where cloning saves the day.

Long story short... This usually ends up being hardware or a simple rewind, not “ZFS ate my pool”, so go get the logs and S.M.A.R.T. data and it’ll stop being a mystery pretty fast.

Good luck!

3
  • +1, and the general philosophy of your answer applies perfectly also to similar issues at least with MD, LVM, Btrfs, and GPFS, in my experience (only the actual commands will vary, of course). Most of the time, it's hardware that needs fixing first, and then it's typically quite easy to get the volume/array going, unless you make it much worse by randomly hitting things with hammers you find somewhere on the internet before actually understanding the root cause/fixing the flaky HW. In short, any kind of --force, or "read-write bruteforce repair" are last resort, not first thing to try. Commented 15 hours ago
  • Thank you for this well thought-out guide! I'll post more detains in a reply below. Commented 15 hours ago
  • It dawned on me that something similar had happened before, so i tried something: I tried importing the pool omitting each device in turn. Eventually, the pool imported perfectly, (no -X needed), so something was up with that device (which by all my observations still worked, and once I properly added another device in place of that one, I formatted it, and that worked too). The disk in question was the one I was intending to replace anyway, as I know it to be failing (slowly). Anyway the issue is fixed but if you have any more questions about what happened, I'm happy to answer them. Commented 15 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.