An interesting problem with a P4000 SAN last week!
Due to electrical maintenance work, a 2 node P4000 SAN cluster with failover manager, needed to be completely powered down, and this was done in the correct order – hosts accessing the SAN powered down first, then the SAN was shut down via the “Shut Down Management Group...” option in the HP StorageWorks P4000 Centralized Management Console (CMC).
Something went wrong when the SAN was powered back up, and the result was:
i: The Management Group was operating in Maintenance Mode.
ii: All the volumes (including the Network RAID tolerant volumes) had a status of 'Not Available'.
iii: One of the Storage Systems had a status of Failed.
iv: Running Diagnostic tests on the "Failed" storage system indicated that the Cache Status of Cache 1 was Corrupt.
The resolution is in two parts:
Part 1: Restoring the Management Group to Normal Mode
1.1 Right-click the Management Group and select 'Edit Management Group'
1.2 In the Edit Management Group dialog box, to the right of Group Mode: Maintenance Mode, click on the button 'Set To Normal'.
Part 2: Restoring the Cache Status back from Cache 1 is Corrupt to PASS
Completing Part 1 and Setting the Management Group back to Normal will restore access to volumes and allow services on those volumes (at least the Network RAID tolerant volumes) to resume. Part 2 requires 4th-line (or Manufacturer) support, and the below is how HP support will resolve the issue.
2.1 Use PuTTY or similar to SSH to the affected storage system and login with the root password and the correct challenge s/key.
Note: Access to the underlying CentOS Linux CLI is only available to HP Techs – only HP Tech Support have access to the root password and the tool to find the correct challenge s/key password.
2.2 At the # prompt type the hpasmcli command and press return, to enter the HP management CLI for Linux
2.3 At the hpasmcli> prompt type clear iml and press return
2.4 At the hpasmcli> prompt type exit and press return
2.5 At the # prompt, enter each line in turn
rm -f /etc/configs/Controller.cache.discarded
2.6 Close and reopen the CMC and log back into the Management Group, and check the system is now healthy!
Below is the output from the PuTTY session:
login as: root
Support Key: ??:??:??:??:??:??
Using keyboard-interactive authentication.
Using keyboard-interactive authentication.
challenge s/key 99 none28381
[root@SAN02 ~]# hpasmcli
HP management CLI for Linux (v2.0)
Copyright 2008 Hewlett-Packard Development Group, L.P.
hpasmcli> clear iml
IML Log successfully cleared.
[root@SAN02 ~]# /etc/lefthand/system/./servicectl --stop-all
response: hydra ok: mgmt-gw ok: dbd_agent ok: hpclimon ok: gcagent ok: eman ok: dbd_store ok: hplogmon ok: dbd_manager ok:
[root@SAN02 ~]# rm -f /etc/configs/Controller.cache.discarded
[root@SAN02 ~]# /etc/lefthand/system/./servicectl --start-all
response: hydra ok:started mgmt-gw ok:started dbd_agent ok:started hpclimon ok:started gcagent ok:started eman ok:started dbd_store ok:started hplogmon ok:started dbd_manager ok:started
My colleagues Ekim Vopall and Veets Sejon.
HP Storage Division Tech Support.
On a similar note, I wish to mover my CMC installation to another server. Will removing management groups disconnect my current iSCSI/gateway sessions to my ESXi 5.0 servers?.
How does one move a CMC installation?
Hello Anonymous, the CMC is pretty much just a management tool, and can be installed as many times and wherever you like. Download from http://www.hp.com/go/P4000downloads. Not sure if I have the right meaning regards your removing management groups - if you remove a management group this would destroy the data on any NSMs that are inside the management group. Cheers!Delete
I don't think Back-end Commands should be in public domain.
Thank you for the comment.
All the commands do is clear the iml log, stop a service, use a linux command to delete a file, and then start a service; these are troubleshooting abilities that are commonly available to sysadmins on other systems....
It seems a shame when things are locked down in such a way that it forces a sysadmin to use manufacturer support, but I understand that this is done to prevent customers from doing serious damage to their systems.
Even with the information in this post, a customer must still call HP as this is the only way past the root and challenge key passwords.
Thank you and have a good day!
Or do what I do and mount the Virtual SAN Appliance into another Linux host and disable their SSH password changer.ReplyDelete
hi, and exactly how can I contact the HP Tech Support? I need the root's passwordReplyDelete
HP Support will not give you the root password. Best to check out HP.com for their support numbers, and there's a chat option too. Cheers!Delete
what about this?, turn on the san operating system with a linux live cd, mount the san filesystem, chroot and then "rm -f /etc/configs/Controller.cache.discarded"?ReplyDelete
Hi Anonymous, that's an absolute genius idea - thanks for sharing. I hadn't thought of that. Cheers!Delete
do you think HP's people are so naive to allow that?Delete
If HP did allow this backdoor method of entry to the SAN's underlying linux O/S filesystem, then I'd be very surprised. Would be interesting to know either way if it is possible. I was kind of thinking the original comment came from experience of having already done so.Delete
no, it's was only a blind shoot, there are many problems that have been solved by this way,Delete
there are many ways to solve problems that way and for sure you can go that way but at some point you should consider that there are good reasons why this error is put out on the table.
A corrupt cache means a serious risk that there has been a data corruption and you risk replicating corrupt data across the volumes.
Once the data is replicated you will not have any chance to correct this even not by HP Support.
Remove the node from the managment group by setting it into repair mode and reconfigure the hw raid on it per CMC, then add it back in and exchange it against the ghost IP will ensure the data consitency on your volumes.
Hi, I have this problem and I need your help. A SAN from my P4300 is returning the following message 'process "dbd_manager" is using obsolete setsockopt SO_BSDCOMPAT'. The main problem is about that node is writing this message in the CMC 'storage system offline, san/iq disconnected'. Man, I've done everything, but obviously I miss something. Can you help me?ReplyDelete
open a call with HP since the messages you see don't have anything to do with your issuesDelete
most likely hydra or the managment gateway needs to be restarted and the logs checked
Hi.. I have different situation here..hope someone already experience our problem. Suddenly the clustered Volume was not recognize by the host as NTFS Format.it became RAW and asking me to format the drive before I can use it but base on the CMC Volume was still there..Any solution to solve this issue? I will appreciate anyone's idea about this.ReplyDelete
Thank you in advance.
I have P4000 with SAN/iq version 10 installed. Can't remember the CMC admin login. how to reset to default?ReplyDelete
call up the HP Support Center, they will fix this for youDelete
This comment has been removed by the author.ReplyDelete
I have a existing cluster with 3 VSA's in it. I have added additional disks on all of the 3 VSA's so that I can expand my SAN size. On all of the 3 VSA's in CMC i can see the new disk as uninitialized. How do I expand my SAN? Thanks.ReplyDelete
Sorry Suhail, I can't help you with this one, best to speak to HP Support. Cheers!Delete
Was successfully able to delete the required file using a live CD to overcome this issue (also cleared the IML via ILO rather than using the commands).ReplyDelete
hi can you give me the exact steps / pleaseDelete