Last year I presented at the local Cisco DCUG to a warm and receptive audience about Cisco UCS Director being deployed on a global scale. At the time I was working for a global pharmaceutical company and following some organisational changes the requirements of the business and in turn IT changed to match. A key part of the changes focused on global standardisation of IT infrastructure to ensure 24 x 7 operational support. The best way to achieve that goal was to look at automation and orchestration. Cisco UCS Director was the tool chosen at the time. UCS Director is an absolute beast of a product and it reflects badly on Cisco as to how they have marketed and managed the product. It has potential to be the one stop shop for infrastructure management.
Create a global platform to enable physical and virtual automation based on standardised templates and processes.
- Drive standardisation across 14 global sites, reduce management overheads and complexities
- Put the company in a position to leverage follow the sun support for infrastructure to minimise out of hours support at each local site
- Provide a secure platform that could easily meet strict auditing guidelines
- Deliver a mechanism to allow end-users to quickly and easily request new virtual machines
- Streamline the request for infrastructure processes and remove existing bottlenecks
- Drive the business towards a Private Cloud architecture rather than individual silos
- Reduce licensing costs across the business for multiple existing automation and orchestration platforms.
- The ability to provide a cost model and service catalog and quickly inform projects on the estimated potential costs of their projects.
- Integration into the existing service management tool
- Integration into HP Quality Control for auditing and quality control purposes. This allowed for installation verification scripts to be completed.
During a recent upgrade of Cisco B200 M4 blades I got the following error:
I really wasn’t sure what was causing the issue but it turned out to be a known bug for M4 blades. More details can be found over on Cisco BugSearch Note: You’ll need a Cisco Login to access the site. Basically the issue affects B200 M4 blades upgraded to 2.2(4) or higher.
The workaround is actually quite easy and just needs to have the FlexFlash Controller reset. This can be done using the below steps:
Step 1: Select Equipment -> Chassis # -> Server # -> Inventory -> Storage -> Reset FlexFlash Controller
Step 2: Click Yes to reset the FlexFlash controller
Step 3: Click Ok on reset notification
During a recent upgrade I ran into a problem with activation of B200 M4 blade. This was following the infrastructure firmware upgrade and the next step was to upgrade the server firmware. However, before upgrading the server firmware I got the error from the B200 M4 blades showing the following error:
Activation failed and Activate Status Set to Failed
This turned out to be due to the B200 M4 blades shipping with version 7.0 of the board controller firmware. On investigation with Cisco I found that it’s a known bug – CSCuu78484
You can follow the commands to change the base board. You can find more information on that from the Cisco forums but the commands you need are below:
#scope server X/Y (chassis X blade Y)
#activate firmware version.0 force
>Select a lower version than current one
What I found was that since I was going to be upgrading the blade firmware version anyway there was no point in dropping the server firmware back and instead proceed with the upgrade which fixed the issue.
I spoke with TAC and they advised that the error could be ignored and I could proceed with the UCS upgrade. The full details of the upgrade can be found in another post.
Recently I had to upgrade our ESXi hosts from Update 2 to Update 3 due to security patch requirements. This requirement stretches across two separate physical environments, one running IBM blades and the other running on Cisco UCS blade chassis in a Flexpod configuration. The upgrade paths for both are slightly different, and they also run on different vCenter platforms. Both of these also have different upgrade paths as one is running VMware SRM and is in linked mode. I’m not going to discuss the IBM upgrades but I did need to upgrade the firmware of the Infrastructure and Servers for Cisco UCSM.
Before you being any upgrade process I highly recommend reading the release notes to make sure that a) an upgrade path exists from your current version, b) you become aware of any known issues in the new version and c) the features you want exist in the new version
UCS Upgrade Prep Work
Check the UCS Release Guides
Check the release notes to make sure all the components and modules are supported. The release notes for UCS Manager can be found on their site. The link is listed further below in the documents section.
Some of the things to check within the release notes are:
* Resolved Caveats
- UCS Version Upgrade patch
- UCS Infrastructure Hardware compatibility
- Minimum software version for UCS Blade servers
Open a Pre-Emptive Support Call
I opened a call with Cisco TAC to investigate the discrepancy in the firmware versions. The advice was to downgrade the B200 M4 server firmware down to 4.0 (1). However, as I was planning on upgrading anyway I’ve now confirmed that the best option is to upgrade to the planned 3.1 version. As part of this upgrade I will also upgrade all the ESXi hosts on that site the same day. There is a second UCS domain on another site that will be upgraded on another date.
After the recent upgrade to 5.4 I decided to bite the bullet and upgrade to 5.4.1. Go to the download software portal for Cisco. Download the 5.4.1.zip patch file. I had a number of issue with the download as the checksum didn’t match. I had to take a number of attempts to get the file in-tact. I believe the issue was the ISA that acts as our internet proxy. Death to the ISA!!!!
Once the file has been downloaded copy it to your FTP server. Now it’s time to apply the patch. log onto UCS Director via either the console or SSH using the shelladmin account. Select option 3 to stop all the services.
Cisco announced their release of UCS Director 5.4 back in November. As I’m currently running 5.3 and ran into an issue with a workflow Cisco support recommended upgrading to 5.4. I had a look over the Cisco UCS Director 5.4 Release Notes and there’s a new version of Java and the CentOS operating system are newer in the latest version. Due to this the upgrade procedure for 5.4 is different from previous version. In earlier versions it was possible to upload a patch via shelladmin and it would upgrade the software and database schema in place. 5.4 however requires new appliances to be deployed and a migration of database files etc. to be done between the 5.3 and 5.4 versions.
I really think that Cisco needs to look at using a HTML 5 console in the future as this upgrade path is overly complicated. Considering a lot of companies want you to be on the latest version when opening support calls, including Cisco, it would make sense for them to make it easier to perform the required upgrades.
The primary changes that have caused the modification to the upgrade path are:
- CentOS version 5.4 to version 6.6
- Java version 1.6 to version 1.8
Another thing to note is that version 5.54 requires 12GB RAM.
Cisco recommend standing up the new appliances beside your current UCS Director and Bare-Metal Appliances and performing a migration. In my case there’s a few firewall rule etc already been created for the existing environment so I wanted to keep the same IP addresses and machine names. I changed the IP addresses of the current appliances to be something else within the same subnet and gave the new appliances temporary names but the existing IP addresses. Once everything had been migrated and the changes confirmed I was able to rename the appliances to be the existing ones and removed the older appliances from the infrastructure. Before commencing the upgrade I also had a sold read over the UCS Director Upgrade 5.4 Guide and the UCS Director Bare-Metal Agent 5.4 Upgrade Guide
I was recently deploying new blades within the UCS chassis but found that I was unable to. In one UCS domain there were no issues but in the second UCS domain if failed with the error [FSM:Failed] Configuring service profile xxx (FSM:sam:dme:LsServerConfigure) and there were a number of other minor warnings as well. The service profile would appear in the list but it was highlighted as having an issue and it could not be assigned to any blades. After a bit of searching around I found an answer on the Cisco Communities forum.
I also created a tech support file and downloaded it to my desktop, extracted the compressed files and opened sam_techsupportinfo in notepad. I did a search for errors and found that there was an issue resolving default identified from UCS Central.
The solution was to unregister the server from UCS Central and then to deploy the Service Profile again. To unregister from UCS Central go to Admin Tab -> Communication Management -> UCS Central and select Unregister from UCS Central. Before unregistering make sure that the Policy resolution controls are how you want them to be. In my case they were all set to local so unregistering from UCS Central had to real impact. Many users will have UCS Central integration configured to work as it was designed and will use Global policies. Unregistering from UCS Central can have a knock on impact on how those policies are managed.
Once the unregister had completed I ran the service profile deployment from template and it worked this time. I believe the issue is down to a time sync issue between UCS and UCS Central. I’m currently working on a permanent work around
During a recent Cisco UCS upgrade I noticed an error for ethlanflowmon which was a critical alert. I hadn’t seen the problem before and it occurred right after I had upgraded UCS Manager firmware as per the steps listed in a previous post I wrote about UCS Firmware Upgrade. Before proceeding to upgrade the Fabric Interconnects I wanted to clear all alerts where possible. The alert for “FSM:FAILED: Ethernet traffic flow monitoring configuration error on” both switches was a cause for concern.
On further investigation I found that this is a known bug when upgrading to versions 2.2(2) and above. I was upgrading from version 2.2(1d) to 2.2(3d). Despite being a critical alert the issue does not impact any services. The new UCSM software is looking for new features on the FI that do not exist yet as it has not been upgraded. As soon as you upgrade the FIs this critical alert will go. More information about the bug can be found Cisco’s support page for the bug CSCul11595
During a recent UCS firmware upgrade I had quite a few blades show up with the error “CIMC did not detect storage”. Within UCSM I could see that the blade had a critical alert. It initially started after I upgraded UCS Manager firmware as documents in a previous post I wrote about UCS Firmware Upgrades. I did some searching around to find what may be causing the issue and the best answer I could find was to from the Cisco community forums to disassociate the blade, decommission and reseat within the chassis. I later spoke to a Cisco engineer and he advised of the same steps but that it was also possible to do without reseating the blade. This also looks like its a problem when upgrading from 2.2(1d) to other versions of UCSM but I haven’t been able to validate if it’s only that version or if it also affects others.
The full error I saw was for code F1004 and for Controller 1 on server 2/1 is inoperable. Reason: CIMC did not detect storage
Within UCSM I could see there was an issue with the Blade
Before proceeding with the upgrade of the FIs, IOMs and Blades themselves I wanted to clear any alerts within UCSM, particularly critical alerts. The steps I followed to bring the blade back online were to go to the blade and select Server Maintenance
I had a problem a while ago where UCS Director crashed during a Metrocluster failover test. It was caused by the delay in the transfer of writable disks on the storage which in turn caused the VM kernel to panic and set the disk to read only. After that problem, and due to other restore issues within the infrastructure as well as not having a backup prior to the failover test I was left with a dead UCS Director appliance. It was essentially completely buggered as the Postgres database had become corrupt. Cisco support were unable to resolve the problem and it took a lot of playing around with NetApp snapshots to pull back a somewhat clean copy of the appliance from before the failover test. Really messy and I wouldn’t recommend it.
Since then I’ve been capturing weekly backups of the UCS Director database to a FTP server so I have a copy of the DB to restore should there be any problems with the appliance again. This script is not supported by Cisco so please be aware of that before implementing it. To set up the backup create a DB_BACKUP file in /usr/local/etc with the following:
# server login password localfile remote-dir
echo "open $1"
echo "user $2 $3"
upload_script $1 $2 $3 put $4 $5 | /usr/bin/ftp -i -n -p
if [ ! -f $BKFILE ]
echo "Backup failed. "
export NEWFILE="cuic_backup_`date '+%m-%d-%Y-%H-%M-%S'`.tar.gz"
export FTPLOGIN=< ftp user name >
export FTPPASS=<ftp password>
doftpput $FTPSERVER $FTPLOGIN $FTPPASS $BKFILE $NEWFILE
nohup /opt/infra/startInfraAll.sh &
Next you’ll need to edit your cron jobs on the appliance. You can use the crontab -e command to edit the schedule settings and enter:
1 2 * * 0 /usr/local/etc/DB_BACKUP > /dev/null 2>&1
And there you go, you now have a weekly scheduled backup of your UCS Director database.