Close

September 29, 2010

Ubuntu Sata drive reset VT6421 bug

Upgrade time for an older storage server (really a workstation), running Ubuntu 10.04. The old workstation did not have any Sata controllers in it, so I had to install a new MASSCOOL XWT-RC018 run-of-the-mill PCI Sata controller. Popped in two brand new Samsung Spinpoing 1 TB drives (SAMSUNG Spinpoint F3 HD103SJ 1TB 7200 RPM) and started to setup Software raid-1 in Ubuntu.

Once it was initiated I watched the progress of the drive:
watch -n5 cat /proc/mdstat

It was going to take 6300 hours to complete! Wait a minute. So I checked the /var/log/syslog and saw a continuously sprawling mess of errors similar to the following:


Sep 29 08:21:32 stargate kernel: [344767.032579] res 51/84:bf:00:00:00/84:81:00:00:00/e0 Emask 0x12 (ATA bus error)
Sep 29 08:21:32 stargate kernel: [344767.034658] ata5.00: status: { DRDY ERR }
Sep 29 08:21:32 stargate kernel: [344767.035702] ata5.00: error: { ICRC ABRT }
Sep 29 08:21:32 stargate kernel: [344767.036798] ata5: hard resetting link
Sep 29 08:21:32 stargate kernel: [344767.356060] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep 29 08:21:32 stargate kernel: [344767.372388] ata5.00: configured for UDMA/33
Sep 29 08:21:32 stargate kernel: [344767.372409] ata5: EH complete
Sep 29 08:21:32 stargate kernel: [344767.396706] ata5.00: exception Emask 0x12 SAct 0x0 SErr 0x1000500 actio n 0x6
Sep 29 08:21:32 stargate kernel: [344767.397766] ata5.00: BMDMA stat 0x5
Sep 29 08:21:32 stargate kernel: [344767.398806] ata5: SError: { UnrecovData Proto TrStaTrns }
Sep 29 08:21:32 stargate kernel: [344767.399881] ata5.00: failed command: READ DMA EXT
Sep 29 08:21:32 stargate kernel: [344767.400944] ata5.00: cmd 25/00:00:bf:81:c3/00:04:41:00:00/e0 tag 0 dma 524288 in
Sep 29 08:21:32 stargate kernel: [344767.400946] res 51/84:bf:00:00:00/84:85:00:00:00/e0 Emask 0x12

Dang! Bad drive? Ran a bunch of tests using the smartd and one drive had a few “blips”, but nothing that said it was bad. I RMA’d it just in case and received a newer one. Same issue!

Apparently there was a known bug with that controller (VIA VT6421 chip) and newer TB drives and there was a patch to fix it upstream. While this was an inconvenience, it was not urgent enough to compile my own kernel.

A few days ago the “2.6.32-25-generic” kernel was released for Lucid. I did a quick “apt-get dist-upgrade” and rebooted. All is well again! My Raid-1 sync is only going to take an hour now!