MDADM Raid Drive rebuild each reboot

So I have been posting a lot about a RAID build with MDADM and Ubuntu (sorry about that). Im just not understanding whats going on with my setup.

I have a RAID-5 setup, and followed this guide exactly to do so:

Anwyay, Im running into this consistent problem. I turn off the RAID at times when I dont need it up, and when I turn it back on I get stuck at a purple screen. I can get out of this screen by simply typing in exit, and it will tell me my raid is Degraded, asking if I wish to boot it or not. After that, its the normal login screen. Then, this is what I see via SSH: (typing it out)

mdadm-D /dev/md127

Then it returns to me that one of the drives is "removed"

Then I will:

mdadm --manage /dev/md127 --add /dev/sd** (** standing for the drive)

This is what I always get back

mdadm: /dev/sd** reports being an active member for /dev/md127, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sdb1 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sd**" first.

So, I will follow those instructions and then once again re-add the drive, and then the RAID will rebuild fine.

What I am frustrated with is this happens so often, and I have no idea why. Can anyone help me out? Sometimes it works fine, and sometimes it doesn't. All that I am doing is turning it off, or on. Any help is greatly appreciated, I dont understand why my RAID wont just work normally or what I am doing wrong.

18

1 Answer

I just had a similar problem when I rebooted my home file server and came looking for a similar error.

When using "smartctl --all /dev/sda" (for example) it's useful to check the value of Reallocated_Sector_Ct, if this is non zero and starts climbing dramatically then it means your disk could be failing and it's time to take a backup.

here's the values from my pair of drives (be sure to scroll to the right to see the field values)

sda:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 162 161 021 Pre-fail Always - 6875 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 50 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 065 065 000 Old_age Always - 25675 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 48
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 50
194 Temperature_Celsius 0x0022 105 099 000 Old_age Always - 45
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 5
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

and sdb:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 164 164 021 Pre-fail Always - 6775 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 38 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 066 066 000 Old_age Always - 25548 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 36
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 38
194 Temperature_Celsius 0x0022 110 099 000 Old_age Always - 40
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

the good news is that I'm getting raw value 0 for the reallocated sector counts.

hope this helps

3

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like