On disks and controllers

Posted on: 2024-12-06

Some lessons I learned along the way of creating and expanding various storage arrays

Controllers

You need to put more disks in the system, but the internal SATA ports are all occupied? If it's a non-critical device you might think about typing "SATA Controller" into a shopping portal of your choice and finding some very cheap devices which cram up to 16 SATA links through a 4x or sometimes even 1x PCIe Link. This does not spell "performance" ...

Those controllers are built to a price but in general they work just fine, aside some minor issues. Typically they have SATA controllers with two or four actual SATA ports and then route some, or all of them, through a SATA port multiplier. This means you are not just limited by the PCIe bandwidth, but also by multiple disks sharing the same SATA port. This still isn't a huge deal if you're building a backup server and don't need stellar performance. But there's one hidden issue: When the controller is occupied with one disk on the multiplier, it can't talk to other disks on the same multiplier as the port is busy. For RAID this is not great. In case one harddrive is responding slow because it's dying, it will degrade the total performance of the array. Performance will go down massively as other disks can't be accessed.

So what's the solution? It's server grade SAS (Serial Attached SCSI) HBAs (Host Bus Adapters). SAS controllers also speak SATA, so with some adapter cables, they are pretty much interchangeable. They do come at a hefty price though, but not if you're looking at the used components market. Companies regularly go belly-up or just refresh their hardware to stay on some support contract. This leaves a huge market of used components which still have lots of life left in them. They are usually wider (PCIe x8) than SATA controllers and can thus accommodate the full bandwidth of all devices.

Buying new cheap SATA controllers and port multipliers cost about 80€. For the same price again we later upgraded to a SAS HBA. Running a full ZFS scrub of a largish pool took about a week before. It went down to a little over a day after the upgrade! And that is with all disks working fine. Once harddisks will eventually start dying, I'll be really glad to have made the upgrade.

Lesson: Don't buy new cheap SATA controllers, buy used SAS HBAs for the same price.

But what if I want even more disks?

SATA port multipliers don't work with SAS, but SAS has it's own implementation, called Port Expanders. But unlike their SATA counterparts they were actually designed correctly and not just some afterthought in the specification. Since, even with SATA disks attached, the controller still "speaks" SAS to the Expander, there are no issues with accessing multiple drives at the same time, even tough the drives themselves are SATA.

If you're looking for something ready to buy, look for JBOD arrays on the used market. They allow you to attach 20 or more disks to a single controller, beware of the fan noise, though. Those are made for datacenters, not living rooms.

If you want to build a JBOD yourself, there are expander cards available which look like a PCIe card but don't actually slot in there. Or they do, but just for power. Most of them have a Molex power connector as well. You could take a regular PC tower case, put in a power supply, one of those cards and a lot of disks. But no mainboard! It will connect to your main server via a SAS cable (SFF-8088 or similar). Take the ATX connector from the power supply which would normally go into the mainboard and connect the green wire to one of the black wires. This can be done cheaply with a paperclip, and you're good to go. You don't even have to insulate the paperclip as it's at the same voltage as the metal case itself.

Harddisks

While there are no technical limits with the previous setup on using SSDs, I was mostly focused on harddisks. A single SSD can usually saturate a SATA or SAS link, so if you want performance, make sure each disk has a dedicated connection to the host, or better yet, switch to nvme. For bulk storage, harddisks are way cheaper.

There was a lot of hate for SMR disks (shingled magnetic recording) but I've used many of them in storage arrays for years and had no issues whatsoever. Performance wasn't great - but then again - for performance just use SSDs. Harddisks are for bulk storage.

When buying harddisks, primarily look for density. The higher, the better. The less platters you have to keep spinning, the lower the power cost, which will be your main factor.

I'm not sure about used Harddisks. Most of the time it's not worth it, as smaller disks use too much power anyway and larger disks don't have good availability on the used market. Also you don't know if they have been treated well - if not, a harddisk might suddenly die on you, negating all the savings you had.

SSDs

There are so many on the market it's really difficult to choose. Do I get the budget "Evo" or the Pro series? TLC or QLC? if you're not into benchmarking the hell out of it and just want it for some regular Desktop or fileserver use, any disk will do fine.

The primary difference between nvme and SATA or SAS is interface speed, so I won't go into detail about it. "Pro" models usually have some more spare blocks than cheaper ones, so your disk might last a bit longer or be a bit faster, but the difference is negligible. But we should perhaps look into various types of flash memory. The difference is how many bits they cram into each NAND flash cell:

Technology	Bits/Cell	Remarks
SLC	1	Fast write performance, very reliable, expensive
MLC	2	A little bit slower than SLC, but still quite good
TLC	3	OK-ish write performance and reliability
QLC	4	Slow write performance, reduced reliability

So with QLC, a disk can pack four times the amount of data in a flash chip compared to SLC! To mitigate write performance issues, a QLC drive typically has a small percentage of its total space formatted as SLC and writes all data to it first. When disk load is low, data is shuffled over from SLC to the slower QLC part of the disk. Also the QLC part is verified from time to time and data in failing blocks shifted to new ones, or simply refreshed. All this happens entirely inside the SSD and is totally invisible to the operating system or user.

That's why I said for typical loads it doesn't matter. Your QLC drive will work just as fast as an SLC drive. With an exception: If you copy huge amounts of data onto the disk at once (like when ingesting video material), the SLC will overflow and now you have to wait until the data is actually written to the QLC. You notice this in a sharp drop in write speed when copying files. So don't worry, the disk is fine, it just can't keep up.

Another specialized use case is CEPH or complex databases. For this you actually need datacenter SSDs. As someone who ignored the consumer / enterprise market segmentation in the harddisk market I chose to ignore it for SSDs too, and it bit me very hard. Let's say it right here and now: Never run CEPH on consumer SSDs! Not, "but my load is very light" or "I'm only storing a few objects". No! Just Don't! Get datacenter SSDs right away!

Why datacenter SSDs?

So what's so special about them? First, most of them aren't packed as dense. So you'll typically find MLC or sometimes TLC disks. This will grant you full continuous write performance without slowing down. But even for light CEPH and database loads, datacenter SSDs are beneficial. Why?

With any kind of storage media, when issuing a write() syscall on a file, the data doesn't arrive on the actual disk immediately. The syscall returns quickly, the application continues working and the data stays in a buffer in the OS kernel. At some point it's sent to the disk which will then place it somewhere, thus making sure saved data is actually saved. In most cases this behavious is fine. But sometimes, you really need to make sure the data has been safely stored on the disk before continuing. Like a bunch of SQL queries in a transaction needs to be stored completely and before the next transaction starts. This is to preserve integrity, even when the power fails or the system crashes. Or CEPH needs to ensure storage is consistent across all nodes. To do this, applications call fsync(). Now the OS has to empty out those buffers and actually write everything to the disk. The disk itself, even if it has DRAM cache, needs to make sure all the data has been sent at least to the SLC portion of the drive. DRAM contents get lost when there is a power failure, so data staying only there is no good. When all this writing is done, the application may finally continue. On consumer SSDs this can cause a lot of slowdown and wearout of the drive.

Datacenter disks have a trick up their sleeves, it's called power loss protection and it means that they have some capacitors which store a tiny amount of electrical energy. This is just enough energy so that when the power fails, they can dump their DRAM cache onto the SLC flash. This suddenly turns the DRAM cache mentioned above into a "safe space" for newly written data. The disk can report write completion a lot faster, even if it hasn't actually hit the flash yet, thus speeding up the whole operation.

But let's not overdo it. Even if you run a small MySQL or Postgres database on your machine for development, there is no real difference between consumer and Datacenter SSD. The OS of course has to fsync() some stuff internally too and all SSDs are designed to handle it. It only really matters for server systems, accessed by multiple people. Or CEPH. Don't use consumer SSDs with CEPH! Did I mention this already?

Looking at the used hardware market is also a good solution for SSDs. Unlike harddisks where you don't really know how they were treated, SSDs can report their health reasonably well using SMART. Used disks won't go down to consumer price levels though, but you can still save a lot of money by buying used.

About wearout ... and HP

At least nvme disks report two different kinds of wearout. Flash cells die from being overwritten too often and data needs to be placed elsewhere. Manufacturers specify an amount of TBW (total bytes written) at which the device is considered "worn out". In the SMART log this is called "Percentage Used". One might think that when it get's to 100%, the disk is dead. But it's not. It's only that the manufacturer will no longer replace it under warranty or other service contracts will run out. A more important value to consider is the "Available Spare". It tells you how many spare blocks, which can replace broken ones, the drive still has available. This number will hopefully stay at or around 100% for a long time. When it starts falling, you should really look into replacing the disk. When it hits a critical threshold a good SSD will turn itself readonly, thus protecting the data stored on it. But don't rely on that behavior, it might also suddenly die. We have some SSDs (for non-critical caching systems) sitting at "150% used" and working just fine as they still have spare blocks.

All disks will report those values via SMART. The "Percentage used" value is also nicely displayed in Proxmox Virtual Environment, if you use that.

Unless you have HP disks! Yes, I violated by own rule about not buying HP equipment and I got bitten again. Should have really known better. But they looked like Intel SSDs with a tiny "HPE" sticker on them. Honestly, I couldn't imagine how HP could have screwed this up, but they found a way. Even though the SSDs are made by some other manufacturer, if they are HP branded, they run HP firmware. They will work just fine in any computer, but they are not reporting SMART values. Why? Because Fuck You! That's why. HP disks report their health stats only to (some) HP SmartArray controllers, not through standard means. Yes, HP put in lots of effort making their own firmware modifications, just to vendor-lock you in. If you actually have one of those HP servers, some of them will act up if used with non-HP disks. They accept them, but all system fans will constantly spin at maximum RPM, wasting power and wearing them out faster. There is no other reason except being pure evil.

More Speed! Combining SSDs and HDDs

At the moment everything is shifting to SSDs. I would not recommend using harddisks as main drives for anything anymore, but they do still serve a need for bulk storage of data. Also ripping out all harddisks and replacing them with SSDs might be too costly. Luckily there are several methods we can use to improve the speed of harddisks, making them nearly as fast as SSDs in many practical applications. This isn't limited to harddisks though. Smaller, fast, but more pricy SSDs can be used to speed up slower SSDs.

A few years ago one could buy so called SSHDs, which do this combination of SSD and HDD directly in hardware, transparent to the operating system. I've never had one of those and I don't really like the concept. You can't easily scale it up or down, control what goes on the SSD part and what goes on the HDD part. And when either part dies, you have to throw away perfectly good flash memory or a working harddrive assembly.

This posting will focus on ZFS, as this is what I'm most experienced with and use for well over a decade for all important systems, big and small. All the methods below can be combined as well to get a bigger impact.

Get more RAM

Up to a point, this will bring you the biggest improvements in performance on any system. RAM which is not used by applications is instead utilized as disk cache. So after some "warm up" time, most requests can be served from blazing fast RAM. This should be your first upgrade. Also don't worry about dual-channel and those things. Size improvements will, except under certain workloads, always be beneficial to overall performance. At some point however a limit is reached. This limit may be set by chipset or amount of money you can spend. Also, at certain sizes the amount of diminishing returns is hit. But this depends entirely on the workload.

Use an L2ARC (Level 2 Adaptive Replacement Cache)

ZFS can use SSDs as additional read cache. While this seems pretty straight forward, it often doesn't bring that many gains in performance. Worse, if set too large it might actually slow things down as this L2ARC also needs some RAM for management, decreasing the amount of RAM available for disk caching. Depending on the workload, the cache hit rate may be very low.

Testing it is easy though, as a cache device can be added or removed at any time. Even when the L2ARC SSD dies suddenly, the zpool will just continue to function, but perhaps slower.

Use a ZIL (ZFS Intent Log, also called SLOG)

While the previous two methods speed up reading, this one speeds up writing. But not all writing! Remember the fsync() problem mentioned further above? With harddisks it's even worse as the disk head has to physically move around to the correct track, wait until the right spot has arrived under it and then write the data. Luckily there's solution similar to what datacenter SSDs are using - we just place the data somewhere else first and report completion back to the application. This is where the ZIL comes in. It's only ever used for those synchronous writes. If an SSD is used as ZIL, fsync()ed data is written to it first and later, when some I/O capacity is available, it's written out to the actual storage disk. Under normal circumstances the ZIL is never read from, only written to as the data is also held in RAM. The data is only read from the ZIL when the system crashes and the pool is reimported. Luckily the ZIL can be tiny. A few GB is typically enough for TBs of pool size. But the SSDs need to survive a lot of write cycles. Battery backed DRAM SSDs are ideal for this but since they are somewhat of a niche, they are very expensive and hard to find. Regular power loss protected, Write-Intensive SSDs will work well.

When a ZIL devices dies during normal operations, it is simply removed from the pool and everything will continue functioning, albeit slower. With previous ZFS versions it used to be a big problem if the ZIL died after a crash, basically the pool would become damaged. With recent ZFS versions it's still something that should be really avoided since it causes data loss, but it will not completely destroy the pool anymore.

Use a special device

This method also differs from the previous ones. All previous methods only provide caching, they do not provide extra storage. This is where a special metadata device comes in. Attaching a special device will cause ZFS to write all metadata to it instead of the main storage, making it faster to access. Since it becomes an integral part of the pool, it should have the same redundancy level as the main storage. Losing it will cause the entire pool to fail! In regular configurations it should be at least mirrored to two devices. ZFS can also be configured to store small files on this special device. This is done based on block size, though, not file size. The typical record size is 128KB, so setting special_small_blocks to it or higher will not have the intended effect as now everything will go to the special device instead of main storage. Also care must be taken when using ZFS volumes as they have their own block size which is typically 8 or 16KB. So setting it to that value will cause all data from zvols to go onto the special device.

Luckily an overflowing special device will not be the end of the pool, as when it runs out, data is simply stored on the main, slower, disks again. But there is no easy way to get the wrongly placed data out after fixing a misconfiguration except a zfs send / zfs receive roundtrip of the data.

Side note: Don't use dedup

It seems like an amazing feature for backups or VM storage, but it gobbles up RAM like there's no tomorrow and tanks I/O performance. It's way cheaper to just buy more disk space instead of all the RAM required to support it correctly. It's tempting, but don't fall for it. If you want deduplication for backups, try Proxmox Backup Server which does dedup internally, not relying on that ZFS feature.

Phew .. That's a lot of material. I hope you learned something from this blog post, it contains many lessons I've learned, some of them the hard way. Mostly, used server gear is better than crap you find on Amazon. Also I have a deep hate of HP, but this is very subjective, your experience with this company may vary.

If there's one single thing to take away: Don't use CEPH on consumer SSDs.