Much like my fun with Firewire, I was fooled into thinking I had solved my disk lockup problem. A random fluctuation in the time-to-failure convinced me that swaping around drives (or maybe exercising the contacts more) had fixed a flakey connection. In fact, it did not, and several hours later, my disks were freezing with a vengence, sometimes after only 20 minutes of heavy I/O. The symptom was a disk access light stuck on, and a lot of messages like this in the kernel log:
Apr 9 18:04:25 lurch ata5: command 0x25 timeout, stat 0xd1 host_stat 0x61
Apr 9 18:04:25 lurch ata5: translated ATA stat/err 0xd1/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Apr 9 18:04:25 lurch ata5: status=0xd1 { Busy }
Apr 9 18:04:25 lurch sd 6:0:0:0: SCSI error: return code = 0x8000002
Apr 9 18:04:25 lurch sdd: Current: sense key=0xb
Apr 9 18:04:25 lurch ASC=0x47 ASCQ=0x0
Apr 9 18:04:25 lurch end_request: I/O error, dev sdd, sector 63890855
Apr 9 18:04:25 lurch ATA: abnormal status 0xD1 on port 0xFFFFC2000033A487
Apr 9 18:04:25 lurch ATA: abnormal status 0xD1 on port 0xFFFFC2000033A487
Apr 9 18:04:25 lurch ATA: abnormal status 0xD1 on port 0xFFFFC2000033A487
The trigger seemed to an actual read/write error, which then degenerated into an endless sequency of “Busy” errors as the system tried over and over to reattempt the I/O operation. The drive simply would not talk to the controller after the initial problem. It sounded very similar to my Firewire problems where a bridgeboard would just fall off the bus under heavy load.
The new piece of data (and different from the firewire case) was that the bug could be cleared by resetting the computer. I did not have to power cycle the drive (which is in a separate tower and has an independent power supply). This was very curious, and pointed to a problem with my SATA host controller, or the kernel error handling procedure, rather than just a crappy IDE to SATA converter board. If I could only reset the SATA bus when this error happened, the problem would go away…
I spent some quality time with Google, the linux-ide archives, and the kernel changelogs. I learned that in general, the libata code has extremely simple error handling, which is easy to understand and also ineffective at dealing with anything but the simplest of problems. This means it won’t do anything crazy, but it also won’t reset the bus when things are very broken. I also discovered that the SiI 3114 (sata_sil module) has a bit of a propensity for getting itself into bad states, which the error handling can’t recover from.
This is when I discovered the “sledgehammer,” as Jeff Garzik called it in the changelog. There is a known erratum for the SiI 3112/3114 which causes them to interact badly with certain Seagate drives. The fix is to detect when such a drive is connected, and then set a flag (”mod15write”) to clamp all ATA commands to no more than 15 sectors at a time. After the fix was put in, people started trying it out on other drives that were having problems, and then when the mod15write flag fixed their problems, they assumed their drive also was afflicted by the SiI 3114 erratum. In fact, they had some other issue, and the mod15write flag just covered it up.
Rather than clutter up the mod15write detection table with bogus model numbers, Jeff Garzik added an explicit parameter, “slow_down”, which when set to 1, will enable mod15write on all drives connected to the controller. I don’t have one of the problematic Seagate drives, but I do have similar symptoms. So, I enabled the bug fix, and it did reduce the failure rate quite a bit. The performance hit was huge, though. Reads are now 25 MB/sec, vs. the 50+ MB/sec I was able to get before. (Note this parameter is only available as of kernel 2.6.16.)
And, the fix was not perfect. It was still possible, though much harder, to lock a drive when doing copies between two disks on the SiI 3114 controller. However, I made another another discovery, a patch which fixes the port enumeration order for my much nicer Promise SATAII150 TX4 card (that runs the main internal server HD). This is a moderately annoying bug which means that the BIOS and Linux enumerate the 4 ports on the card in different orders, so the first drive for the BIOS (which gets booted) is actually /dev/sdc once the kernel loads. The patch makes it much less annoying to put multiple drives on the Promise TX4 card, so I moved the cables for 2 of the external drives over to the TX4, lightening the load on the SiI 3114 card. Now I’ve gone almost 12 hours with no I/O jams, and will continue to test things.
For now, things are stable. I think the long-term solution here is to get another TX4 card and not use the Silicon Image 3114 card anymore. The card was $20, so I guess I should not be too surprised. Silicon Image, however, has been very good about getting documentation to Jeff Garzik on their hardware, so perhaps support will improve in the future. I also saw that patches for more sophisticated error handling (needed to support NCQ among other things) are also in the pipeline. It wasn’t clear if they would squeak through the deadline for 2.6.17, but certainly by 2.6.18 libata might be better at dealing with non-fatal errors.