Migrating a ZFS Pool from RAIDZ1 to RAIDZ2

19 mtlynch 19 7/23/2025, 2:33:43 PM mtlynch.io ↗

Comments (19)

abrookewood · 52m ago

That sparse drive trick is neat: https://forums.truenas.com/t/raidz1-to-raidz2-without-doubli...

commandersaki · 2h ago

Hi can someone tell me the difference between RAIDZ1 and RAIDZ2 in a practical sense like you were on the field, rather than the theory.

I know RAIDZ1 is ZFS variant of RAID5 and RAIDZ2 is ZFS variant of RAID6; I will use all these terms interchangeably because I'm not too interested in the ZFS special sauce here.

In the early 2000s, a lot of people were pushing RAID5. Having worked in a hosting / colocation data centre for many years I had witnessed many RAID5 failures. What would happen is an array would degrade, and more often than not a secondary drive will fail due to undue load on the array as part of the degraded status. Also a lot of times failure would happen on the rebuild process because a lot of the HW implementations were flakey -- but again also the undue stress on all drives as you rebuild. This is why I would suggest a RAID10 setup at the time because of the double lucky failure, and more importantly because you can trivially use a software implementation which is much more safe. Also a lot of the motherboards at the time were offering RAID but this was really just a binary blob in the kernel doing software RAID with a facade of making it appear like hardware which fooled a lot of people.

Well we've finally done away with hardware / proprietary RAID and we have ZFS, mdadm, etc. I've normally dismissed RAID6/RAIDZ2 because of the parity/rebuild process and concerns of putting undue stress on the drive. But I think maybe this is premature and that I didn't really understand the consequences of a single drive failure versus a double drive failure. So this is kind of what I want to know:

1. When a single drive fails, is there any undue stress on the array, or because the array can pretty much operate unaffected, there's actually no performance degradation until you rebuild the missing drive, and in the case of software it's really just a negligible hit on the CPU if it has to do hashing/erasure coding/etc. I guess the rebuild process is really just the cost of a zfs scrub at this point but at least it is on a healthy array.

2. The good news of RAID6 over RAID10 is you can always survive two drive failure; but I think this is where things are concerning because the rebuild across two drives places a lot of undue stress on the remaining disks, and if any of those disks die then you're shit outta luck. This scenario is much more similar to a single drive failure in a RAID5 failure. But again, I think the rebuild cost is that of a zfs scrub but with the minimal set of disks. So RAIDZ2 would be a much more solid choice over RAID10 right; at least you will always know you can survive a two drive failure?

SirMaster · 6h ago

Just buy 1 extra disk and use that, and then keep it as a spare for your next failure IMO.

burnt-resistor · 44m ago

RAID10 (mdraid+lvm+xfs (never use lvm raid) or btrfs) is way more convenient in terms of rebuild speed, simplicity, performance, and also supports online growing and online shrinking (btrfs only). If there are any failures in a batch, it signals the need for possible proactive replacement. The biggest predictor Google found that SMART doesn't catch is slightly elevated temperatures. ZoL (as opposed to SmartOS/Solaris ZFS) bit me before (array permanently unmountable on good drives) and there was absolutely zero (0) support and they were shameless about it.

toast0 · 3h ago

I found this (and the subsequent discussion) super confusing, so let me restate your plan, and see if this is what you meant?

OP described going from 4x disks, add 3x, end up with a 7x raidz2 pool. The steps are

4/4 (steal disk, set up new pool) -> 3/4 + 4/5 (copy and destroy old pool)-> 4/5 (add missing disk) -> 5/5 -> 7/7.

Your suggestion is to add 4x disks, so you can do

4/4 (set up new pool) -> 4/4 + 4/5 (copy and destroy old pool) -> 4/5 (add missing disk from old pool) -> 5/5 -> 7/7 + spare (hot or cold?).

mtlynch · 6h ago

Why is that better? Don't you end up in the same state as my solution except you bought an extraneous 18 TB disk?

burnt-resistor · 37m ago

Raidz2 (survives 2 lost drives) is slower and more complicated than raidz (survives 1 drive loss). The advantage of keeping a powered-off spare (that's good) is there zero wear and (near) zero power usage. The odds of 2 drive failures in rapid succession without temperature rise or SMART issues is around similar odds of a jet turbine breaking down.

Raidz3 (survives 3 lost drives) is pure insanity in terms of computational demands. Better off splitting up pools, using multiple systems, or using multiple copy systems up the stack like with Ceph on multiple, simple RAID1 or RAID10 storage nodes.

SirMaster · 6h ago

It would be better because then you wouldn't have to degrade your RAIDZ1 and run with 0 redundancy that could fail at any moment.

mtlynch · 6h ago

The extra drive would also have zero redundancy. I know it's risk of 1 drive failing vs. any of 3 drives failing, but disk failures follow a bathtub curb, so I'm more worried about a brand new drive failing than 3 healthy drives that have been running successfully for months.

With your extra drive solution, I still have to recreate all my datasets and shares, whereas in my solution, they migrated intact, and I still had backups in case of pool failure. I could zfs send into a giant file on the 18 TB drive, but I'd be reticent to do that because it's just an opaque file that I can't verify will successfully restore. Whereas with my solution, I had the two pools running side by side and could verify everything restored successfully onto the new pool before blowing away the old pool.

SirMaster · 5h ago

You seem to be fundamentally misunderstanding me I guess because nothing you are writing here makes sense for what I am proposing.

Why did you pull a drive from your RAIDZ1 to purposefully degrade it?

I am not sure why you keep saying 18TB. Your drives are 8TB. I am suggesting that you should have simply bought another 8TB disk so you wouldn't have to degrade your RAIDZ1.

mtlynch · 5h ago

Oh, I see. I thought you meant I should buy an 18 TB drive, move my data there, blow up my pool, create a RAIDZ2 pool, then move my data back.

Yes, I agree I could have reduced the risk of pool failure if I bought an extra 8 TB disk and not degraded the pool.

So, it came down to do I for sure spend an extra $120 on an extra drive or do I just take the <1% chance that one of my three other disks fails in a 6-hour window while I'm migrating data. I took the chance, but I also had my data backed up in cloud storage at the file level in case there was a pool failure.

In other words, it wasn't worth $120 to me to avoid a <1% risk that I'd have 8ish hours of hassle of recovering from cloud backup after a pool failure.

fn-mote · 2h ago

Really enjoyed the write up.

I have to point out that the surprise risk of the $300 bill from Wasabi dwarfs the cost of the 8TB extra disk.

In retrospect, I would have paid the money for the lowered risk, but everybody has a different tolerance for that.

Again, great work and very detailed.

mtlynch · 8h ago

Author here. Happy to take any questions or feedback about this post.

xmodem · 7h ago

> The neat part is that I did it with only three additional 8 TB disks and never transferred my data to external storage.

That's neat! I didn't know there was a way to do this while maintaining data redundancy.

> Step 1: Borrow one disk to create a RAIDZ2 pool

> To begin, I remove one disk from my original RAIDZ1 pool, leaving it in a degraded state.

Oh, there isn't. :facepalm:

mtlynch · 7h ago

What's the issue?

I have backups, so I was prepared for data loss if the pool failed.

Is this any riskier than recovering from a single disk failure on a RAIDZ1 pool?

benlivengood · 1h ago

Administrative risk, mostly. If you accidentally wipefs the wrong disk when moving it to the RAIDZ2 pool or offline a real disk instead of a tmp disk or make some other simple mistake then a zpool might be unrecoverable.

An alternative would have been to build a RAIDZ2 out of three new disks and two tmp files and then copy the data over and finally offline a RAIDZ1 disk and online it on the RAIDZ2 zpool. Two copies of the data in each pool would require an actual disk failure in both zpools to lose data during the resilvering (even though both were degraded), and when breaking the original zpool to add a replacement disk for the 5th device in the RAIDZ2 pool you'd have to lose two existing disks in that pool to lose data.

No criticism; it's cool to be able to do moves with minimal resources and I've also thought about potential ways to upgrade zpools under weird constraints; especially no free SATA ports for example. I had thought about using the four SATA ports to copy data from half the source disks in a RAIDZ2 to half the destination disks of a new RAIDZ2 while the removed disks would still be a functional zpool on their own as a backup. But ultimately I found a cheap extra box and went for a complete upgrade to larger disks with the benefit of keeping the old box as a local mirror.

My upgrade was from a single RAIDZ2 of 4x 4TB disks (8 usable) to a RAIDZ1 of 4x 14TB disks (42 usable) because once both pools existed I was comfortable running the old disks in a 12TB RAIDZ1 to receive snapshots from the primary zpool while I had <12TB used. Still requires 2 disks to fail to lose data, and now I do some offsite backups as well (rsync and s3backer) and have a test system to perform upgrades on before the main system.

SirMaster · 6h ago

Then why go through all this trouble at all?

1. Build new RAIDZ2 pool with all your disks that you plan to use. 2. Restore backup to the new pool.

I keep a backup too and so this is how I move to a new larger ZPOOL with a new layout.

Either you have to do this because you don't have a backup, and so this is risky. Or you don't need to do this because you have a backup and can just build your new pool and restore your data from the backup.

mtlynch · 6h ago

I address this in the post: https://mtlynch.io/raidz1-to-raidz2/#step-2-backing-up-my-da...

xmodem · 4h ago

No issue. Just that the introduction to your post got me excited that I was going to learn how to do something I didn't previously think was possible.

I drank every cocktail (aaronson.org)

CARA – High precision robot dog using rope (aaedmusa.com)

Vintage Macintosh Programming Book Library (2017) (vintageapple.org)

Tesla Q2 2025 Update – biggest quarterly revenue decline in more than a decade [pdf] (tesla.com)

The Promised LAN (tpl.house)

Parsing Protobuf like never before (mcyoung.xyz)

Neil Armstrong's customs form for moon rocks (2016) (magazine.uc.edu)

AI overviews cause massive drop in search clicks (arstechnica.com)

BGP Tools (bgp.tools)

US AI Action Plan (ai.gov)

Building better AI tools (hazelweakly.me)

Major rule about cooking meat turns out to be wrong (seriouseats.com)

Lumo: Privacy-first AI assistant (proton.me)

The Big OOPs: Anatomy of a Thirty-Five Year Mistake (computerenhance.com)

Seven Sisters eclipse will temporarily block stars from view (discovermagazine.com)

I made Tinder but it's only pictures of my wife and I can only swipe right (trytender.app)

What to expect from Debian/Trixie (michael-prokop.at)

Show HN: TheProtector – Linux Bash script for the paranoid admin on a budget (github.com)

Jitsi privacy flaw enables one-click stealth audio and video capture (zimzi.substack.com)

Quantum Won't Replace Your Computer (medium.com)

I'm Unsatisfied with Easing Functions (davepagurek.com)

Checklists are hard, but still a good thing (utcc.utoronto.ca)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

Cops say criminals use a Google Pixel with GrapheneOS – I say that's freedom (androidauthority.com)

A diverse cast of rocky worlds around a small star revealed by astronomers (nouvelles.umontreal.ca)

Tram Trains (worksinprogress.news)

FastVLM: Efficient Vision Encoding for Vision Language Models (machinelearning.apple.com)

Vector Tiles are deployed on OpenStreetMap.org (blog.openstreetmap.org)

Why Elixir? Common misconceptions (matthewsinclair.com)

Interactive Programming in C (2014) (nullprogram.com)

How YouTube won the battle for TV viewers (wsj.com)

Kimi-K2 Tech Report [pdf] (github.com)

SIMD Perlin Noise: Beating the Compiler with SSE (2014) (scallywag.software)

You can now disable all AI features in Zed (zed.dev)

Show HN: The missing link of a bookstore's tech stack (bookhead.net)

Manticore Search: Fast, efficient, drop-in replacement for Elasticsearch (github.com)

How to increase your surface area for luck (usefulfictions.substack.com)

Reverse engineering GitHub Actions cache to make it fast (blacksmith.sh)

Robot scans rare library books at 2.5k pages per hour (popsci.com)

Geocities Backgrounds (pixelmoondust.neocities.org)

The Surprising gRPC Client Bottleneck in Low-Latency Networks (blog.ydb.tech)

When Is WebAssembly Going to Get DOM Support? (queue.acm.org)

Using Radicle CI (radicle.xyz)

SQL Injection as a Feature (idiallo.com)

AccuWeather to discontinue free access to Core Weather API (developer.accuweather.com)

Cerebras launches Qwen3-235B, achieving 1.5k tokens per second (cerebras.ai)

A DOJ Whistleblower Speaks Out (nytimes.com)

Reversing a Fingerprint Reader Protocol (2021) (blog.th0m.as)

Checking Out CPython 3.14's remote debugging protocol (rtpg.co)

Show HN: NativeSwap – Low cost cross-chain swaps without wrappers or bridges (nativeswap.io)

Migrating a ZFS Pool from RAIDZ1 to RAIDZ2

Comments (19)