Comment reconstruire /dev/md0 ?

dominix.pf · le 6 janvier 2021

Bonjour a tous,

j'ai un problème étrange sur un DS218+. Suite a une erreur disque le volume RAID1 supportant le root FS est en erreur.

j'ai pourtant plusieurs douzaine de ces modèles mais je n'ai jamais rencontré celui ci. Premiere chose a noter c'est que les données ne sont pas altéré, le volume1 est correct et ses données sont sauvegardées. Le problème ne réside que sur /dev/md0 la partition en RAID1 qui supporte le systeme du DSM. ce volume ne veux pas se reconstruire.

Est ce que ca vous parle ?

root@NAS-WTF:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda5[0] sdb5[2]
1948683456 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
2097088 blocks [2/2] [UU]

md0 : active raid1 sdb1[2](S) sda1[0](E)
2490176 blocks [2/1] [E_]

unused devices: <none>
root@NAS-WTF:~# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Mon Jul 1 09:07:39 2019
Raid Level : raid1
Array Size : 2490176 (2.37 GiB 2.55 GB)
Used Dev Size : 2490176 (2.37 GiB 2.55 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Mon Jan 4 16:55:06 2021
State : clean, FAILED
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

UUID : d45ed910:b3eefc88:3017a5a8:c86610be
Events : 0.1137371

Number Major Minor RaidDevice State
0 8 1 0 faulty active sync /dev/sda1
- 0 0 1 removed

2 8 17 - spare /dev/sdb1

root@NAS-WTF:~# mdadm --assemble --auto=yes --force /dev/md0 /dev/sda1 /dev/sdb1
mdadm: /dev/sda1 is busy - skipping
mdadm: /dev/sdb1 is busy - skipping

root@NAS-WTF:~# dd if=/dev/sda1 of=/dev/null
4980480+0 records in
4980480+0 records out
2550005760 bytes (2.6 GB) copied, 50.1761 s, 50.8 MB/s

root@NAS-WTF:~# dd if=/dev/sdb1 of=/dev/null
4980480+0 records in
4980480+0 records out
2550005760 bytes (2.6 GB) copied, 50.9145 s, 50.1 MB/s

smartctl -a /dev/sdb
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 199 051 - 598
3 Spin_Up_Time POS--K 178 178 021 - 4100
4 Start_Stop_Count -O--CK 100 100 000 - 7
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 090 090 000 - 7693
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 7
192 Power-Off_Retract_Count -O--CK 200 200 000 - 1
193 Load_Cycle_Count -O--CK 200 200 000 - 86
194 Temperature_Celsius -O---K 118 107 000 - 29
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 14
198 Offline_Uncorrectable ----CK 100 253 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 8
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 2 Extended offline Completed without error 00% 7661 -
# 9 Extended offline Completed without error 00% 7494 -
#16 Extended offline Completed without error 00% 7326 -

smartctl -a /dev/sda
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 2976
3 Spin_Up_Time POS--K 176 175 021 - 4175
4 Start_Stop_Count -O--CK 100 100 000 - 30
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 082 082 000 - 13278
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 30
192 Power-Off_Retract_Count -O--CK 200 200 000 - 22
193 Load_Cycle_Count -O--CK 200 200 000 - 462
194 Temperature_Celsius -O---K 119 103 000 - 28
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 2
198 Offline_Uncorrectable ----CK 100 253 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 11
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

SMART Error Log Version: 1
ATA Error Count: 3
   CR = Command Register [HEX]
   FR = Features Register [HEX]
   SC = Sector Count Register [HEX]
   SN = Sector Number Register [HEX]
   CL = Cylinder Low Register [HEX]
   CH = Cylinder High Register [HEX]
   DH = Device/Head Register [HEX]
   DC = Device Command Register [HEX]
   ER = Error register [HEX]
   ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 5130 hours (213 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 38 b2 1b e6 Error: UNC 8 sectors at LBA = 0x061bb238 = 102478392

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 38 b2 1b e6 08 11d+15:10:09.653 READ DMA
ca 00 08 a0 6c 20 e0 08 11d+15:10:09.653 WRITE DMA
c8 00 08 28 e2 bb e7 08 11d+15:10:09.640 READ DMA
ca 00 38 68 6c 20 e0 08 11d+15:10:09.640 WRITE DMA

Error 2 occurred at disk power-on lifetime: 5130 hours (213 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 40 ca 1b e6 Error: UNC 8 sectors at LBA = 0x061bca40 = 102484544

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 40 ca 1b e6 08 11d+15:09:48.318 READ DMA
ca 00 88 28 6a 20 e0 08 11d+15:09:48.318 WRITE DMA
ca 00 08 f0 f7 27 e0 08 11d+15:09:48.310 WRITE DMA

Error 1 occurred at disk power-on lifetime: 5130 hours (213 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 e8 81 bb e7 Error: UNC 8 sectors at LBA = 0x07bb81e8 = 129729000

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 e8 81 bb e7 08 11d+15:09:40.095 READ DMA
ca 00 01 e8 3e 90 e0 08 11d+15:09:40.095 WRITE DMA
ca 00 08 f8 69 20 e0 08 11d+15:09:40.083 WRITE DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 3 Extended offline Completed without error 00% 13234 -
# 4 Short offline Completed without error 00% 13203 -
# 5 Short offline Completed without error 00% 13179 -
# 6 Short offline Completed without error 00% 13155 -
# 7 Short offline Completed without error 00% 13131 -
# 8 Short offline Completed without error 00% 13106 -
# 9 Short offline Completed without error 00% 13083 -
#10 Extended offline Completed without error 00% 13067 -
#17 Extended offline Completed without error 00% 12900 -

every LBA sector (+/- 10) have been tested and now read correctly.
102478392 102484544 129729000

dominix.pf · le 6 janvier 2021

je précise qu'il n'y a pas de menu "réparer" actif dans l'interface WEB.

Einsteinium · le 6 janvier 2021

197 et 200 en valeurs smart... certainement la cause du problème.

Il sera difficile de reconstruire le raid de md0 [E_] ou si tu préfères un disque en erreur et l’autre plus dans le raid, le manquant certainement en erreur avant lui aussi et la réparation automatique de ce dernier a mis en erreur le second à cause de ses valeurs.

Tu as un backup des données... je te recommande un formatage des disques avec l’utilitaire du constructeur avant réutilisation et le cas échéant un sav si encore sous garantie.

La prochaine fois fait un test badblock sur tes disques (il y a un tutoriel sur le forum), cela évite ce genre de désagrément.

dominix.pf · le 6 janvier 2021

je fait pourtant des test (smartctl -t long) chaque semaines. ...

Connexion

Comment reconstruire /dev/md0 ?

Messages recommandés

dominix.pf

Lien vers le commentaire

Partager sur d’autres sites

dominix.pf

Lien vers le commentaire

Partager sur d’autres sites

Einsteinium

Lien vers le commentaire

Partager sur d’autres sites

dominix.pf

Lien vers le commentaire

Partager sur d’autres sites

Rejoindre la conversation

Qui est en ligne 8 membres, 0 anonyme, 203 invités (Afficher la liste complète)

Contributeurs populaires

Forum

Discussions

Articles

Information importante