Detect Hard Drive Failure in Linux using S.M.A.R.T.

In this post I will write about some techniques that can be used in a linux server to check the hard disk health status using S.M.A.R.T, and some tips about how it is possible to increase  the life of this device.

S.M.A.R.T. stands for Self-Monitoring, Analysis and Reporting Technology and is a monitoring system for hard disk to detect the low-level indicators of the general health conditions of the device: the purpose is to indicate imminent failure.

According to several studies only about the 60 % of hard drives that later broke gave S.M.A.R.T. values that revealing an imminent failure: the other 40 % of the hard drive broke without any “strange” S.M.A.R.T. value or related alarm…. but I believe that in every case it is better than nothing.

How it works ?
Each drive manufacturer defines a set of attributes, and sets threshold values beyond which attributes should not pass under normal operation.

In old revision of SMART spec disks must keep an internal list of up to 30 Attributes corresponding to different measures of performance and reliability, such as read error rates.

For each attribute we can get.

RAW VALUE
The meaning of raw value is entirely up to the drive manufacturer but often corresponds to counts or a physical unit, such as degrees Celsius or seconds.

NORMALIZED (ATTRIBUTE) VALUE
It is a normalized value which is derived from raw value using an algorithm entirely up to the drive manufacturer, which ranges from 1 to 253.

WORST VALUE
This value represent the worst recorded normalized value.

THRESOLD VALUE
Each attribute own a corresponding threshold value. If one or more of the normalized attribute values fall less than or equal to its corresponding threshold, then either the disk is expected to fail in less than 24 hours

Each drive manufacturer defines this sets of attributes, and sets the related threshold values beyond which attributes should not pass under normal operation.

Some of the attribute values are updated as the disk operates others are updated only through off-line tests that temporarily slow down disk reads/writes and, thus, must be run with a special command. All this values are stored in the controller of the related hard disk: we will see how it is possible to get this value in the next using smartmontools.

Normalized values are almost always mapped so that higher values are better, and higher raw attribute values may be better or worse depending on the attribute and manufacturer.

Starting with more recent revision SMART spec the requirement that disks maintain an internal attribute table was dropped: instead, the disks simply return an OK or NOT OK response to an inquiry about their health. Obviously a negative response indicates the disk firmware has determined that the disk is likely to fail.

It is possible, however, to get to the disk’s Attributes because most disks are backward-compatible with SMART spec and most manufacturers still support them

The SMART spec, also, added an interesting disk self-tests to the SMART command set that it is possible to use to check the hard disk health deeply.

How to enable S.M.A.R.T. ?
To get S.M.A.R.T. System working we have to use an hard disk and & BIOS that are compatible with this specs: nowaday almost all the hard disk and mother board have these characteristics and we have only to check in BIOS menù that the function is enabled (default is so). Also we have to install & configure a software which is able to take these values and analyze them: in Linux this software is smartmontools. Using this software it is possible to monitor S.M.A.R.T. attributes and run hard drive self-tests.

Att: I have verified that a lot of SSD hard disk does not suppot S.M.A.R.T. I also know that almost all Hardware RAID controllers do not support smart on their logical disks, while in case of software RAID is is possible to use S.M.A.R.T. on the related phisical disk.

smartmontools package comes with two utility programs: smartctl and the daemon smartd. smartctl is for interactive use and smartd is a daemon which continuously monitors S.M.A.R.T values.

Let’s start by taking a look at smartctl.

# smartctl -i /dev/sda

The i option permit to verify the general info related to the device.

Att.: In this case sda is only hard disk in the system: if you have different values you have to modify accordingly.

Command response

smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/</p>
=== START OF INFORMATION SECTION ===
Device Model: GB0500EAFYL
Serial Number: WCASYE699463
Firmware Version: HPG2
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Sun Oct 7 17:12:19 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

All works fine: Mother Board & Hard Disk & smartctl are perfectly aligned and all the items speak perfectly to each others. But it is also possible to have the case below, where not everything is for the best.

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/</p>
<p lang="zxx">=== START OF INFORMATION SECTION ===
Device Model: ST31000528AS
Serial Number: 9VP4N6RZ
Firmware Version: CC38
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Mon Oct 22 18:39:24 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

The smartctl program has an internal list of disks it knows about. According to the response above the version of smartctl is 5.38 and at the time of this revision the model number/disk family of disk was not in its database.

It does not matter, not really anyways, as most of the SMART parameters are the same across manufacturers and model. Only will be possible that some raw value will be with “strange” values (meaningless values), but we really don’t care about, since the manufacturers don’t tell us how to interpret them. In others words for any unidentified parameters the SMART system still works. I.e., the VALUE and TRESHOLD parameters are shown and failure is indicated correctly.

Oss.: If the version of smartcl is > 5.40 will be possible to update the database version using something like the next command.

/usr/sbin/update-smart-drivedb

Now it is possible to ask to the device to report its S.M.A.R.T. health status using H option.

# smartctl -H /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

S.M.A.R.T. status is based on information that it has gathered from online and offline tests, which were used to determine/update its SMART vendor-specific attribute values. If the device reports failing health status, this means either that the device has already failed, or that it is predicting its own failure within the next 24 hours.

In the next command we will be using c option, that it is intended to print the disk’s capabilities and the estimated time to perform short and long disk self-tests (we will see in the next).

#smartctl -c /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled.

Self-test execution status: (0) The previous self-test routine completed without error or no self-test has ever been run.

Total time to complete Offline data collection: ( 600) seconds.

Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported. General Purpose Logging supported.

Short self-test routine recommended polling time: ( 1) minutes.

Extended self-test routine recommended polling time: ( 180) minutes.

Conveyance self-test routine recommended polling time: ( 2) minutes.

SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. 

Now we can print the disk’s table attributes: pls remember that Attributes are no longer part of the S.M.A.R.T. specs, but most manufacturers still support them. Also this specs doesn’t define the meaning or interpretation of Attributes, however many have a de facto standard interpretation (i.e. attribute that tracks the internal temperature ).

# smartctl -a /dev/sda
.....
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 134855599
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 53
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 076 060 030 Pre-fail Always - 45869735
9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 18847
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 33
183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Unknown_Attribute 0x0032 100 098 000 Old_age Always - 43
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 059 049 045 Old_age Always - 41 (Lifetime Min/Max 30/45)
194 Temperature_Celsius 0x0022 041 051 000 Old_age Always - 41 (0 23 0 0)
195 Hardware_ECC_Recovered 0x001a 031 018 000 Old_age Always - 134855599
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 213309550773112
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 872098754
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 929841815

SMART Error Log Version: 1
.......

The names/meanings of Attributes and the interpretation of their raw values is not specified by any standard, and different manufacturers sometimes use the same Attribute ID for different purposes. For this reason, though rare, the interpretation of specific Attributes can be wrong especially if your disk model is not in the smartmontools database (i.e. Attribute ID 241).

The more important part of the smartctl command is a report of the self-tests run on the disk. It is possibile to run two types of self-tests, short and long. These can be run with the the following commands.

smartctl -t short /dev/sda
smartctl -t long /dev/sda

Typically, short tests take only few minutes to complete, while long tests can take several hours.

These self-tests do not corrupt data on the disk and do not interfere with the normal functioning of the disk, so the commands may be used on a running system.

If you launch a short or long test it is possible to get the status using the c option (value Self-test execution status), and at the end we can get the result with the a option.

# smartctl -a /dev/sda

........
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 175986284970872
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 872104954
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 929841815

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 18847 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
.....

From above we can see that the short self test was completed without errors: we can see, also, the LifeTime column that shows the power-on age of the disk when the self-test was run.

To use smart tools in correct mode you have to check the disk with self-test on regular basis: this tool will ask to the hard disk electronic driver if all the Attributes are “correct”. We’ll see how soon using smartd daemon.

It is possible to see in the attributes tables that each of these values can be updated Always or OffLine (see Column UPDATED): the Offline Attibutes values can be update only using a special kind of self-test, the offline test. Some disk support automatic off-line testing, enabled by the following command.

smartctl -o on /dev/<device>

This command enable an automatically off-line test every few hours. You can use the command the c parameter and check the value Offline data collection capabilities. The Offline-test do not corrupt data on the disk and do not interfere with the normal functioning of the disk, so you can enable this kind of test on a running system.

We will see, now, that it is possibile to use smartd dameon to configure long, short and off-line test on a regular basis.

By default, when smartd is started, it checks system disk on a regular basis for failing attributes, failing health status or increased numbers of ATA errors or failed selftests and logs this information with SYSLOG in /var/log/messages by default.

You can control and fine-tune the behavior of smartd using the configuration file /etc/smartd.conf. Each line contains Directives pertaining to a different disk.

Example

# /etc/smartd.conf

/dev/hda -S on -o on -a -I 194 -m admin@3tsistemi.it
/dev/hdc -S on -o on -a -I 194 -m admin@3tsistemi.it

The first column indicates the device to be monitored. The -o on Directive enables the automatic off-line testing, and the -S on Directive enables automatic Attribute autosave. The -m Directive is followed by an e-mail address to which warning messages are sent, and the -a Directive instructs smartd to monitor all S.M.A.R.T. features of the disk. In this configuration, smartd logs changes in all normalized attribute values. The -I 194 Directive means ignore changes in Attribute #194, because disk temperatures change often, and it’s annoying to have such changes logged on a regular basis.

Other Directives provide additional flexibility, such as monitoring changes in raw Attribute values: a interesting directive is M, that sends a test e-mail confirm that warning e-mail messages are delivered correctly.

/dev/hda -S on -o on -a -I 194 -M admin@3tsistemi.it

At every restart of smartd daemon a test e-mail will be sent to the specified address.

Another example.

/dev/sda -S on -o on -a -I 194 -s (S/../.././02|L/../1/./03) -m admin@3tsistemi.it

In this case we have enabled smartd to launch a self-test on a regular basis.

-s (S/../.././01|L/../1/./03)
This schedules the short and long self-tests. In this example, the short self-test will run daily at 1:00 A.M. And the long test will run on first day of every mounths at 3:00 A.M.

For more information, see the smartd.conf man page: is very clear and concise.

Very important.
If you use smartmontools you will see many times that the hard disk is not in the smartmontools database: this does not break anything. In the case that the HD is not in the DB you will have only the names of the Attributes (displayed in the ATTRIBUTE_NAME column of smartctl -a /dev/<device>) and the format of the the raw Attribute values shown in the RAW_VALUE column that may be incorrect.

This is mostly cosmetic: the essential drive health monitoring/testing functionality of smartmontools does not depend upon the database !

How is it possible to extend the HD life ?
Here some tips on how to keep hard drives from failing.

Optimize the I/O on the hard disk means that the same will last longer ! Turn off any unnecessary log or decrease their verbosity. It is possible to decrease swappiness value to 0 if the system is equipped with enough ram (The swappiness parameter in the kernel controls the tendency of the kernel to move processes out of physical memory and onto the swap disk).

Some studies have shown that lowering disk temperatures by as little as some °C significantly reduces failure rates. You can add a cooling fan that blows cooling air directly onto or past the system’s disks.

Consider the regular replacement of 3 year old disks.

Remember, smartmontools is no substitute for backing up your data. S.M.A.R.T. does not predict all disk failures !

Linkografia
Failure Trends in a Large Disk Drive Population
smartmontools