smart_alert.py: No such file or directory: '/tmp/.smartalert' when reporting Current_Pending_Sector

knormoyle · Apr 21, 2015

new system. yeah I should get better cooling.

latest freenas
FreeNAS-9.3-STABLE-201504152200
Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz
doing a big backup. notice this messaging on a pending sector error (I guess)
8165MB

the interesting thing is the smart_alert.py error
it seems to want /tmp/.smartalert

but can't find it.

nothing I did. I'm assuming this must always happen on smartalerts.

freenas install was standard, nothing special.

I can do
smartctl -a /dev/ada1
and
smartctl -a /dev/ada2
fine..ada2 has the pending sector

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 177 171 021 Pre-fail Always - 4116
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 62
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3893
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 52
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 40
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 21
194 Temperature_Celsius 0x0022 111 106 000 Old_age Always - 36
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 3

this is from /var/log/messages

Apr 20 23:48:46 mr-0xs2 smartd[2543]: Warning via /usr/local/www/freenasUI/tools/smart_alert.py to kevin@0xdata.com: failed (32-bit/8-bit exit status: 256/1)
Apr 20 23:48:46 mr-0xs2 smartd[2543]: Device: /dev/ada2, previous self-test completed with error (read test element)
Apr 20 23:48:46 mr-0xs2 smartd[2543]: Device: /dev/ada2, Self-Test Log error count increased from 0 to 1
Apr 20 23:48:51 mr-0xs2 smartd[2543]: Warning via /usr/local/www/freenasUI/tools/smart_alert.py to kevin@0xdata.com produced unexpected output (299 bytes) to STDOUT/STDERR:
Apr 20 23:48:51 mr-0xs2 smartd[2543]: Traceback (most recent call last):
Apr 20 23:48:51 mr-0xs2 smartd[2543]: File "/usr/local/www/freenasUI/tools/smart_alert.py", line 66, in <module>
Apr 20 23:48:51 mr-0xs2 smartd[2543]: main()
Apr 20 23:48:51 mr-0xs2 smartd[2543]: File "/usr/local/www/freenasUI/tools/smart_alert.py", line 45, in main
Apr 20 23:48:51 mr-0xs2 smartd[2543]: with open(SMART_FILE, 'rb') as f:
Apr 20 23:48:51 mr-0xs2 smartd[2543]: IOError: [Errno 2] No such file or directory: '/tmp/.smartalert'
Apr 20 23:48:51 mr-0xs2 smartd[2543]: Warning via /usr/local/www/freenasUI/tools/smart_alert.py to kevin@0xdata.com: failed (32-bit/8-bit exit status: 256/1)
Apr 21 00:00:00 mr-0xs2 syslog-ng[1811]: Configuration reload request received, reloading configuration;
Apr 21 00:18:46 mr-0xs2 smartd[2543]: Device: /dev/ada1, Temperature 40 Celsius reached critical limit of 40 Celsius (Min/Max 37/40)
Apr 21 00:18:47 mr-0xs2 smartd[2543]: Device: /dev/ada2, 1 Currently unreadable (pending) sectors
Apr 21 00:18:52 mr-0xs2 smartd[2543]: Warning via /usr/local/www/freenasUI/tools/smart_alert.py to kevin@0xdata.com produced unexpected output (299 bytes) to STDOUT/STDERR:
Apr 21 00:18:52 mr-0xs2 smartd[2543]: Traceback (most recent call last):
Apr 21 00:18:52 mr-0xs2 smartd[2543]: File "/usr/local/www/freenasUI/tools/smart_alert.py", line 66, in <module>
Apr 21 00:18:52 mr-0xs2 smartd[2543]: main()
Apr 21 00:18:52 mr-0xs2 smartd[2543]: File "/usr/local/www/freenasUI/tools/smart_alert.py", line 45, in main
Apr 21 00:18:52 mr-0xs2 smartd[2543]: with open(SMART_FILE, 'rb') as f:
Apr 21 00:18:52 mr-0xs2 smartd[2543]: IOError: [Errno 2] No such file or directory: '/tmp/.smartalert'

thanks
kevin
[UPDATE] added current /tmp
[root@mr-0xs2] /tmp# ls -ltra
total 36
drwxr-xr-x 29 root wheel 1160 Apr 20 17:18 ../
drwx------ 2 www wheel 0 Apr 20 17:18 firmware/
-rw-r--r-- 1 root wheel 1666 Apr 20 17:18 ixdiagnose_boot.log
-rw-r--r-- 1 root wheel 1077 Apr 20 17:20 rc.conf.freenas
-rw-r--r-- 1 root wheel 33 Apr 20 17:20 freenas_config.md5
-rw------- 1 root wheel 168 Apr 20 17:20 sessionidpvk8hfy24fri4mia3w8difzmi8mgimrc
-rw-r--r-- 1 root wheel 0 Apr 20 23:48 mr-0xs2.local-1807400.97265348862369687321562
-rw-r--r-- 1 root wheel 0 Apr 20 23:48 mr-0xs2.local-1807400.97315348862369687321562
-rw-r--r-- 2 root wheel 0 Apr 21 00:18 mr-0xs2.local-1807400.100685348862369687321562
-rw-r--r-- 2 root wheel 0 Apr 21 00:18 .smartalert.lock
-rw------- 1 root wheel 168 Apr 21 02:53 sessionidp9x351bndeg2woqchbcjuj17mjbzm0z9
drwxr-xr-x 6 www www 160 Apr 21 03:00 nginx/
-rw-rw-r-- 1 root wheel 928 Apr 21 03:11 alert
drwxrwxrwt 5 root wheel 520 Apr 21 03:12 ./
drwxrwxrwt 2 root wheel 0 Apr 21 03:12 vi.recover/

dlavigne · Apr 21, 2015

the interesting thing is the smart_alert.py error
it seems to want /tmp/.smartalert
but can't find it.

Sounds like a bug. Please report it at bugs.freenas.org and post the issue number here.

knormoyle · Apr 21, 2015

Thanks diavigne.
I created /tmp/.smartalert to make the message go away.
A secondary issue is forcing the sector remap.
I'm assuming a zpool scrub should touch the entire drive and trigger the disk to remap the bad sector (I suppose ZFS isn't involved in forcing the sector remap, just touching the sector should trigger the drive to remap?)

It's odd to me that the sector remap is delayed so much (it's still there). I shouldn't have to rewrite the whole disk.

-kevin

Bidule0hm · Apr 21, 2015

Only a write can trigger a remap (it's because the sector is maybe ok and the error was just a hiccup so it wait for a write to see if the sector is really bad or not). You can use the command dd with extreme caution to write only this sector and force the remap ;)

knormoyle · Apr 21, 2015

ok thanks..yeah wikipedia seems to say the same thing:

"Typically, automatic remapping of sectors only happens when a sector is written to. The logic behind this is presumably that even if a sector cannot be read normally, it may still be readable with data recovery methods. However, if a drive knows that a sector is bad and the drive's controller receives a command to write over it, it will not reuse that sector and will instead remap it to one of its spare-sector regions." [Wikipedia: Bad Sector]

The end of the story is:
was:
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1

now is:
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

I rebooted to single-user mode (boot, interrupt by hitting any key to get to grub menu, pick TrueOS, then Single User)
then just did the dd write of the correct 4kbytes. (see below for details of what that means. tricky I guess because of the 4k/512 byte disk mapping?)

When I first got to the single user prompt and did
smartctl -a /dev/ada2

I still saw the Current_Pending_Sector == 1
And the long test hit the error again, with the same LBA
smartctl -t long /dev/ada2

After the dd write, Current_Pending_sector went to 0, so I'm assuming my action caused that, not just some randomly delayed event finally happening.
The Current_Pending_Sector had stayed 1 thru all the below

So I think I'm good. I think the bad sector was not on the data I cared about or maybe some weird place (unused area?). In any case it's a backup nas, where I'm going to overwrite the (important) data again

I have a smartctl -t long /dev/ada2 in progress
--------------------------------------------------------------------------------------------------------------------------------------------------------------

The blow-by-blow of what I tried

(I'll wordily document here for my own possible future reuse, or maybe someone can comment if I did something wrong)

Yeah, I was reading about people saying to use badblocks with a fixed block size to figure out the correct sector, and then doing the dd write like you say.
But then I got worried about running badblocks on a mounted filesystem, and couldn't immediately figure out what services I would need to shut down so I could umount the zpool without it saying the mnt was busy ...and I was remote so didn't want to boot single user or something.

The vast majority of the disk was either empty, or two backups.
So I used a program stressdisk which fills the filesystem with 1GB test files..I figured that was a quick way to overwrite all the unused space (it does reads too of the test files).
Then I reran the backups to overwrite (wget) the other data. With the hope that that all that randomly hit the sector. But I don't seem to have cleared it.

Now that I look closer at the smartctl output, it looks like a long test I had kicked off a while back hit the problem. So maybe that LBA number is what I would need to figure out where I would need to dd write to force the remap

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 80% 3890 373354968
# 2 Extended offline Completed without error 00% 3725 -

I read that "The LBA counts sectors in units of 512 bytes starting at zero."

I guess I don't care about the partitions, in terms of figuring out where to write

fdisk -u /dev/hda2
******* Working on device /dev/ada2 *******
parameters extracted from in-core disklabel are:
cylinders=1938021 heads=16 sectors/track=63 (1008 blks/cyl)

Information from DOS bootblock is:
1: sysid 238 (0xee),(EFI GPT)
start 1, size 1953525167 (953869 Meg), flag 0
beg: cyl 0/ head 0/ sector 2;
end: cyl 1023/ head 255/ sector 63
2: <UNUSED>
3: <UNUSED>
4: <UNUSED>

surprised there are 3 unused partitions (freeNAS must have created them)

/etc/fstab has
/dev/ada1p1.eli none swap sw 0 0
/dev/ada2p1.eli none swap sw 0 0

reading http://ma.juii.net/blog/repair-bad-sector-like-seatools

I suppose I can just use the bad LBA from above, like this...but no, I don't seem to have hdparm with FreeNAS
hdparm --write-sector 373354968 /dev/ada2

or maybe this
using seek to skip over the right number of sectors (512 byte sectors, mapped to 4kbyte, so divide by 8?)
I guess I want to write to a whole 4k byte mapped sector, not just a 512 byte sector

but seems dangerous...the issue of off-by-1 seems to lurk. If it really starts at 0, then skipping == the LBA address is correct (the seek skips?)
switch to bash
$ /bin/bash
$ z=$((373354968/8))
$ echo $z
46669371

let's try reading it first (use -skip instead of -seek)
$ dd bs=4096 skip=46669371 count=1 if=/dev/ada2 of=/junk

Noticeable delay (1.9secs) , but then
1+0 records in
1+0 records out
4096 bytes transferred in 1.954302 secs (2096 bytes/sec)

Try again and now it reads fast

$ dd bs=4096 skip=46669371 count=1 if=/dev/ada2 of=/junk
1+0 records in
1+0 records out
4096 bytes transferred in 0.000318 secs (12888124 bytes/sec)

Maybe the read error is intermittent

Let's write it
$ dd bs=4096 seek=46669371 count=1 if=/dev/zero of=/dev/ada2
dd: /dev/ada2: Operation not permitted

Nope..didn't want to let me write it. (see at top, I rebooted to single user and successfully did the write)

tried these just to see (since other forum posts mentioned)

sg_verify and sg_reassign apparently are no good for my sata disk
http://www.masterzen.fr/2009/02/01/the-curse-of-bad-blocks-is-no-more/
so let's try: (Reading more, I see these are scsi commands and the ata drive isn't responding to the scsi commands??)

# sg_verify -v --lba=373354968 /dev/ada2
verify (10): transport: (pass2:ahcich2:0:0:0): VERIFY(10). CDB: 2f 00 16 40 f1 d8 00 00 01 00
(pass2:ahcich2:0:0:0): CAM status: CCB request was invalid

VERIFY(10): Sense category: -1, try '-v' option for more information
failed near lba=373354968 [0x1640f1d8]

So that didn't work

maybe this response is right

# sg_reassign --grown /dev/ada2
READ DEFECT DATA(10): Sense category: -1

Let's just try the reassign. No it didn't like that either

# sg_reassign -v --address=373354968 /dev/ada2
reassign blocks cdb: 07 00 00 00 00 00
reassign blocks: transport: (pass2:ahcich2:0:0:0): REASSIGN BLOCKS. CDB: 07 00 00 00 00 00
(pass2:ahcich2:0:0:0): CAM status: CCB request was invalid

-kevin

knormoyle · Apr 21, 2015

in hindsight, since dd was able to read the bad sector (intermittent error?) although it was very slow one time..
I should have used dd to copy the sector to itself..that would have triggered the remap, and preserved whatever data was there?

i.e (since I'm writing 4k bytes, the skip is the LBA/8 as above)

dd bs=4096 skip=46669371 count=1 if=/dev/ada2 of=/dev/ada2

I can't try this right now, since I have to boot to single user to do it, but it seems like that would be the better approach, if doing the read-only seems to work.

-kevin

Bidule0hm · Apr 21, 2015

What's the reallocated sectors value for the same drive? because a pending sector remapped evolve into a reallocated sector (yeah, like the pokemons... :p).

Also, when you write a sector in the attempt to remap it you should use his value (if=/dev/adaX of=/dev/adaX) because if it's not the right sector you doesn't do any harm that way, but if you zero it you create corruption, ZFS should take care of this if but it's still not a good idea.

knormoyle · Apr 21, 2015

Yeah, most of the forum posts out there seem to talk about zeroing.. I suppose if the dd read fails 100% you have to zero. But yeah, the next time I'll remember to copy, since it's pretty likely the read is intermittent. I thought it was interesting how one dd read I did above took 1.954302 secs, but completed. Then it completed faster after that.

here's the current smartctl. I had started up a long test, but aborted it (you can see below). I guess the error before was after 20% of the long test was done, so I should let it another long test go.

$ smartctl -a /dev/ada2

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 177 171 021 Pre-fail Always - 4108
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 64
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3906
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 54
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 41
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 22
194 Temperature_Celsius 0x0022 110 106 000 Old_age Always - 37
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 3

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Aborted by host 90% 3905 -
# 2 Extended offline Completed: read failure 90% 3905 373354968
# 3 Extended offline Completed: read failure 80% 3890 373354968
# 4 Extended offline Completed without error 00% 3725 -

Bidule0hm · Apr 21, 2015

As there's no reallocated sector then it was just a hiccup or this sector is failing but not enough to be really bad.

knormoyle said:
Yeah, most of the forum posts out there seem to talk about zeroing.. I suppose if the dd read fails 100% you have to zero. But yeah, the next time I'll remember to copy, since it's pretty likely the read is intermittent. I thought it was interesting how one dd read I did above took 1.954302 secs, but completed. Then it completed faster after that.

Yep, start with a copy and if it fails then use /dev/zero ;)

knormoyle said:
I thought it was interesting how one dd read I did above took 1.954302 secs, but completed. Then it completed faster after that.

Yeah, it's an excellent example of why TLER is useful :)

knormoyle · Apr 21, 2015

re: TLER
yes, it has that. Which is why I was surprised.

the drive is
Model Family: Western Digital RE4
Device Model: WDC WD1003FBYX-01Y7B0

It has TLER. smartctl says
SCT capabilities: (0x303f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-701338.pdf
says the 1TB is spec'ed for 1e-15 URE
so conceptually the drive has the "right stuff"
(and WD says it has TLER)

-kevin

Bidule0hm · Apr 21, 2015

TLER is generally set to 7 seconds, that's why you didn't see it in action. But if the read had taken longer then you would have seen it :)

knormoyle · Apr 22, 2015

zpool scrub completed fine on that raid. no errors
so I figured I'd look around at some other non-raid drives. Found one with 3 pending like this one. Figured I'd try the copy method.

HEADS UP to my future self reading this post.
I apparently fubared it at first. Unlike the write from /dev/zero....if you copy you have to specify BOTH -skip and -seek ?...i.e. the read/write sectors separately.

hopefully I didn't kill anything important on my first attempt with just seek, and no skip

switching to the other system, and just writing the 512 byte sector and using the LBA reported exactly (no divide by 8) plus some flags to force direct.
Using this for the read/write on the LBA reported there by the smartctl long test

LBA=3445267336
echo $LBA
sudo dd bs=512 count=1 conv=noerror ibs=512 obs=512 if=/dev/sdb of=/dev/sdb iflag=direct,sync oflag=direct,sync skip=$LBA seek=$LBA

(hmm it has GPT partitions (sdb1/sdb2) with ext4 filesystem. maybe I'm not supposed to use /dev/sdb but a partition.

(this is a none-raid disk)

hmm..can't seem to make this one go away (this is an ubuntu system)

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 5

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 6013 3445267336
# 2 Extended offline Completed: read failure 90% 6013 3445267336
# 3 Extended offline Completed: read failure 90% 6012 3445267336

knormoyle · Apr 22, 2015

ok, I managed to clear those 3 Current_Pending_Sector
evidence:

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 5

last night I kicked off a dd which read the entire disk to /dev/null. 4TB disk, took about 7 hours.
it got read errors logged in /var/kern.log (ubuntu)
so that gave me a block and sector address. Oddly it wasn't the same as the LBA reported by smartctl
I dd read that block, got the I/O error. couldn't read bytes
so I used dd write to write that block (4k) with /dev/zero, and the dd read then succeeded, and the smartctl Current_Pending_Sector decremented.
So now I knew I was on the right track.
I reran smartctl -t long /dev/sdb and it failed with an LBA that was +8 of the first LBA error.
So I figured that was in the next 4k block. Incremented the number I had from the os kern.log by 1 4k block, read that, got error, so then used dd write to write zero to that.
Current_Pending_Sector decremented.
did smartctl -t long again.
LBA reported on failure, of +8 of the last one.
so repeated the increment of my block address from kern.log, read it, got error, wrote zero, read it, no error, smartctl says Current_Pending_Sector is now zero

I should be able to rerun the full smartctl -t long now and not have errors.
yup..it's running now...instead of failing right away

Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.

So my apparent mistake, is that I couldn't use the LBA reported by smartctl directly to decide what block to write. It's different than what I needed for dd
so I should figure out how to get the right mapping from the smartctl LBA report, to the block I need for dd, to avoid having to read the entire disk with dd to get an error in the os error log with the right block addresss.

cyberjock · Apr 22, 2015

I'm just gonna let you know that once you start getting CUPS counts, you're on the fast-track to the disk failing. So you might be thinking this is the right way to go, but it's not. The only good appropriate solution is to replace the disk.

knormoyle · Apr 22, 2015

Hi cyperjock, yes you may be right.

But if we have all this software around error recovery, and all we do on any error is replace hardware, do we really test the error recovery software?

If not, then we might as well not have the error software..other than having it report "replace disk now, it seems bad" :)

I like learning the details of all this stuff though. A lot of the forum posts I read seem to have fuzzy interpretations of what really happens. I'm not sure if people are saying what they think should happen, or what they've seen happen. An example of fuzziness is "when does the disk actually remap a sector? what triggers it?"

I found this description in my ubuntu disk utility, under the current pending sector count:

"If the sector waiting to be remapped is subsequently written or read successfully, this value is decreased and the sector is not remapped. Read errors on the sector will not remap the sector, it will only be remapped on a failed write attempt"

I wonder if that generalization is correct. Since the disk firmware is doing the remapping, the behavior would be vendor dependent. Maybe by now all vendors do the same thing.

I still can't figure out why the smartctl LBA, wasn't the sector address I could use in dd read and write.

My disk was formatted with GPT not DOS partition tables (because it's 4TB). So I used gdisk and that gave me the correct info for the partition sector offsets.
But I didn't dd read/write using a /dev partition, so I didn't need to adjust for the start of a partition (unless I wanted to see what file was there.

root@mr-0xd3:~# gdisk -l /dev/sdb
GPT fdisk (gdisk) version 0.8.1

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdb: 7814037168 sectors, 3.6 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): B54D8749-DE24-4203-AA89-6C0C8D1A15DB
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 3693 sectors (1.8 MiB)

Number Start (sector) End (sector) Size Code Name
1 2048 4096002047 1.9 TiB 8300
2 4096002048 7814035455 1.7 TiB 8300

These are the sectors the os reported when I dd read the whole disk, and the address I used to clear the errors.
7740234632
7740234640
7740234648

So those were correct, because smartctl was good after that

I just can't see how to get those addresses from the smartctl LBA report. I don't think I need to use any partition information..even if I did, there's no arithmetic that seems to get the right addresses

The 3 smartctl-reported LBA's were (I got a report of one at a time, after clearing each one and redoing the smartctl -t long)
3445267336
3445267344
3445267352

i.e. from this evidence:

# 1 Extended offline Completed: read failure 90% 6022 3445267352
# 2 Extended offline Completed: read failure 90% 6022 3445267344
# 3 Short offline Completed: read failure 90% 6013 3445267336

I'm wondering if smartctl isn't reporting the LBA correctly for this situation. If that's possible, it's just a head's up for me that I should always use the OS to try to find the right sector that needs fixing. Or I'm not understanding something

-kevin

knormoyle · Apr 22, 2015

I verified those 4k blocks were not part of a file..i.e. were unused. Looks like they were, so it was fine to zero.
But smartctl finding errors on unused blocks: ...it makes you wonder if you really want to consider that "imminent drive failure"

# see if I can find an offset into sdb2 to check for file
>>> 7740234632 - 4096002048
3644232584

# 4096/512 = 8, so divide by 8 for a block address
>>> 3644232584/8
455529073

debugfs
open /dev/sdb2
icheck 455529073
455529073 <block not found>

debugfs: icheck 455529074
Block Inode number
455529074 <block not found>

debugfs: icheck 455529075
Block Inode number
455529075 <block not found>

# here's an example of a found inode
debugfs: icheck 45552905
Block Inode number
45552905 90178545

cyberjock · Apr 22, 2015

knormoyle said:
But if we have all this software around error recovery, and all we do on any error is replace hardware, do we really test the error recovery software?

If not, then we might as well not have the error software..other than having it report "replace disk now, it seems bad" :)

Yes, but there's a difference between relying on it for disaster mitigation and relying on it for day-to-day operations. If your disaster mitigation *is* your day-to-day operations, you've failed do you a good job as the admin of your server. ;)

knormoyle · Apr 22, 2015

heh.
"you've failed do you a good job as the admin of your server"
well maybe I know that, and you know that, but posting here means it's just between you and me right?
oh wait a second, this "internet" thing......

:)
just for background.
I have a qnap box that was giving me grief. I realized that once it started going haywire I didn't understand linux md raid, and so had to learn more on that.
Then I added a freenas box for backup, and I really like freenas and zfs
and so now I'm on a little obsession of looking more closely at both raid setups and the non-raid setups.
(I have maybe 20 4TB drives that are non-raid. they don't need to be.
heh, I also have some systems with drives I bought used.
So yeah, you could say I'm dancing with the devil, and it's just a matter of time, before the devil comes calling... :)

-kevin

knormoyle · Apr 22, 2015

on to my laptop. I find it has some too. (I've not looked at smartctl stuff as often as I should)

it's a dual boot laptop, windows and ubuntu. ubuntu is on sda5 partition. one disk: sda

it has 8 in smartctl

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8

so I run smartctl -t long /dev/sda to find the first one

it says LBA 1023164952.

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 30% 6605 1023164952

When I use that in dd, the dd read doesn't return data, and I get the same sector
reported in the /var/log/kern.log.
So in this case, the LBA in the smartctl, is exactly what you expect..the sector to use for dd

Interesting the log says "Unrecovered read error - auto reallocate failed" ...Which is odd, implies a read can trigger a reallocate in some cases (ubuntu?).
No wonder I'm confused about what causes reallocates on the disk...:)

no data read when I tried dd read with that lba ..it failed into /var/log/kern.log

dd bs=512 count=1 if=/dev/sda of=/dev/null skip=1023164952
dd: reading `/dev/sda': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 14.235 s, 0.0 kB/s

/var/log/kern.log

Apr 22 17:08:08 Kevin-Ubuntu4 kernel: [101460.795885] res 41/40:08:18:42:fc/00:00:3c:00:00/00 Emask 0x409 (media error) <F>
Apr 22 17:08:08 Kevin-Ubuntu4 kernel: [101460.795896] ata1.00: error: { UNC }
Apr 22 17:08:08 Kevin-Ubuntu4 kernel: [101460.798042] Add. Sense: Unrecovered read error - auto reallocate failed
Apr 22 17:08:08 Kevin-Ubuntu4 kernel: [101460.798068] end_request: I/O error, dev sda, sector 1023164952
Apr 22 17:08:08 Kevin-Ubuntu4 kernel: [101460.798077] Buffer I/O error on device sda, logical block 127895619
Apr 22 17:08:16 Kevin-Ubuntu4 kernel: [101468.837275] res 40/00:c0:18:42:fc/00:00:3c:00:00/40 Emask 0x1 (device error)
Apr 22 17:08:16 Kevin-Ubuntu4 kernel: [101468.837298] res 40/00:c0:18:42:fc/00:00:3c:00:00/40 Emask 0x1 (device error)
Apr 22 17:08:19 Kevin-Ubuntu4 kernel: [101472.138592] res 41/40:08:18:42:fc/00:00:3c:00:00/00 Emask 0x409 (media error) <F>
Apr 22 17:08:19 Kevin-Ubuntu4 kernel: [101472.138596] ata1.00: error: { UNC }
Apr 22 17:08:19 Kevin-Ubuntu4 kernel: [101472.140431] Add. Sense: Unrecovered read error - auto reallocate failed
Apr 22 17:08:19 Kevin-Ubuntu4 kernel: [101472.140438] end_request: I/O error, dev sda, sector 1023164952
Apr 22 17:08:19 Kevin-Ubuntu4 kernel: [101472.140441] Buffer I/O error on device sda, logical block 127895619

Important Announcement for the TrueNAS Community.

smart_alert.py: No such file or directory: '/tmp/.smartalert' when reporting Current_Pending_Sector

Dabbler

dlavigne

Guest

Dabbler

Server Electronics Sorcerer

Dabbler

Dabbler

Server Electronics Sorcerer

Dabbler

Server Electronics Sorcerer

Dabbler

Server Electronics Sorcerer

Dabbler

Dabbler

Inactive Account

Dabbler

Dabbler

Inactive Account

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "smart_alert.py: No such file or directory: '/tmp/.smartalert' when reporting Current_Pending_Sector"

Similar threads