Finding MTBF of One Disk to Fail Amongst an Estate of Many using PowerShell

Here I present a little script I cooked up to work out the MTBF of one disk to fail amongst your entire Clustered ONTAP storage estate.

The theory behind the script is slightly shaky. Pretty much the only figure we have available to ourselves, to work out how regularly we might expect a disk to fail in our estate, are the published MTBF figures, and for the calculation, I’ve taken the view (probably wrong) that if one disk has an MTBF of 2 million hours, 2 will have an MTBF of 1 million hours, and so on …

Firstly, you’ll want to setup PowerShell to connect to all your clusters (you might consider using my CDOT PowerShell connections manager from this post to do that.) Then copy the script below into notepad/notepad++ and save as say mtbf.ps1, and load the function into PowerShell from the mtbf.ps1 script using (remember the space between dot and dot):

. .\mtbf.ps1

Finally, run the function using:

mtbf

The script scans all your clusters for various types of disks, working out a per type disk failure rate for one disk to fail in the estate. Then - la piece de resistance - is that it works out amongst all the various disk types, an MTBF calculated value for one disk to fail in the entire estate.

The Script

### START OF SCRIPT - mtbf V1.3b ###

FUNCTION mtbf {

<# The following two hashed lines contain all disk types. So as not to waste cluster CPU cycles searching for stuff that isn't there, RECOMMEND reducing the list by removing the disk types you know you definitely don't have. #>
# $diskMTBFs = @("ATA",1.2,"BSAS",1.2,"FCAL",1.6,"FSAS",1.6,"LUN",0,"MSATA",1.2,"SAS",1.6,"SATA",1.2,"SSD",2.0) # Figures in millions of hours!
# $diskRecord = @(0,0,0,0,0,0,0,0,0) # Need a 0 for each active type of disk!
$diskMTBFs = @("ATA",1.2,"BSAS",1.2,"FCAL",1.6,"FSAS",1.6,"LUN",0,"MSATA",1.2,"SAS",1.6,"SATA",1.2,"SSD",2.0)
$diskRecord = @(0,0,0,0,0,0,0,0,0)
$diskTypesCount = ($diskMTBFs.count)/2
$diskAttributes = Get-NcDisk -template
$diskAttributes.name = ""
$diskQuery = get-ncdisk -template
Initialize-NcObjectProperty -object $diskQuery -name DiskInventoryInfo
$count = 0

do {
$disks = $null
$diskQuery.DiskInventoryInfo.DiskType = $diskMTBFS[$count*2]
$disks = Get-NcDisk -Query $diskQuery -attributes $diskAttributes
if ($disks){$diskRecord[$count] = $disks.count}
$count++
} until ($count -eq $diskTypesCount)

$mtbfOutput = @()
$mtbfOutput += " "  
$mtbfOutput += "Using MTBF: For a NetApp CDOT estate roughly how much time for 1 disk to fail!"
$mtbfOutput += "##############################################################################"
$mtbfOutput += " "  
$mtbfOutput += "The following types of disks were detected in your estate:"
$mtbfOutput += " "
$count = 0
$totalDisks = 0
$diskMTBFMultiplier = 1

do {
$diskRecordCount = $diskRecord[$count]
$diskRecordType = $diskMTBFs[$count*2]
$diskRecordMTBF = $diskMTBFs[$count*2+1]
if (($diskRecordCount -ne 0) -and ($diskRecordType -eq "LUN")){
$mtbfOutput += "$diskRecordCount disks of type $diskRecordType with an unknown MTBF."}
if (($diskRecordCount -ne 0) -and ($diskRecordType -ne "LUN")){
[int]$diskRecordHours =(1000000*($diskMTBFs[$count*2+1])/$diskRecord[$count])
[int]$diskRecordDays = $diskRecordHours / 24
$mtbfOutput += "$diskRecordCount disks of type $diskRecordType with an MTBF of $diskRecordMTBF million hours each."
$mtbfOutput += "MTBF based time for one disk to fail amongst all these disks is $diskRecordHours hours ($diskRecordDays days)."
$mtbfOutput += " "
$diskMTBFMultiplier = $diskMTBFMultiplier * $diskRecordMTBF
$totalDisks = $totalDisks + $diskRecord[$count]}      
$count++
} until ($count -eq $diskTypesCount)

$count = 0
$diskRecordCountSpecial = 0
$mtbfOutput += "Combined"
$mtbfOutput += "########"
$mtbfOutput += " "  

do {
$diskRecordCount = $diskRecord[$count]
$diskRecordType = $diskMTBFs[$count*2]
$diskRecordMTBF = $diskMTBFs[$count*2+1]
if (($diskRecordCount -ne 0) -and ($diskRecordType -ne "LUN")){
$diskRecordCountSpecial = $diskRecordCountSpecial + ($diskRecordCount * $diskMTBFMultiplier / $diskRecordMTBF)}
$count++
} until ($count -eq $diskTypesCount)

[int]$diskRecordTotalHours = (1000000*$diskMTBFMultiplier/$diskRecordCountSpecial)
[int]$diskRecordTotalDays = $diskRecordTotalHours / 24
$mtbfOutput += "Considering all $totalDisks disks:"
$mtbfOutput += " "
$mtbfOutput += "MTBF based time for one disk to fail in the entire disk estate is $diskRecordTotalHours hours ($diskRecordTotalDays days)."
$mtbfOutput += " "
return $mtbfOutput}

### END OF SCRIPT ###

An Example Output

As an example of the script in action, the below doesn’t give an unreasonable figure for an estate of greater than 2000 disks! In reality we’d expect a figure a fair bit lower than given - I did say the theory behind the script was a bit shaky - as a curiosity though, it serves its purpose.

Image: An MTBF based calculation of 1 disk to fail amongst an estate of many!

Comments