The Importance of Keeping Your Entire Solution Current

Recently I started working on a new case for a customer.   I'm trying to diagnose repeated error messages being logged by an IBM SVC Cluster that indicate problems communicating with the back-end storage that is being virtualized by the SVC.  These messages generally indicate SAN congestion problems.  The customer has Cisco MDS 9513 switches installed.  They're older switches but not all that uncommon.  What is uncommon is finding the switches at NX-OS version 5.X.X.  I see down-level firmware but this one is particularly egregious.  This revision is several years out of date. Later versions of code contain numerous bug fixes both from Cisco and for the associated upstream Linux security updates that get incorporated into NX-OS.  Also, while NX-OS versions don't officially go out of support, any new bugs identified won't be fixed as this version is no longer being actively developed

This level of firmware merits further investigation.  Looking deeper on the switches I find this partial switch module list:

Mod  Ports  Module-Type                         Model              Status
---  -----  ----------------------------------- ------------------ ----------
6    48     1/2/4 Gbps FC Module                DS-X9148           ok
7    48     1/2/4 Gbps FC Module                DS-X9148           ok
8    48     1/2/4 Gbps FC Module                DS-X9148           ok
9    48     1/2/4 Gbps FC Module                DS-X9148           ok

These modules are older than the firmware on the switches, and support ended 3 years ago.  If this customer has problems with them (or the switches they are installed in) and the problem is traced back to the modules, there is not much that IBM Support can do.  If a problem is traced to a bug in the firmware, the customer can't upgrade the firmware to something more current because of these old, unsupported modules still in the switches.  This limits IBM's ability to provide support.  The hardware is no longer supported and much of the data we can look at in the firmware was not introduced until the next major revision level of NX-OS - v6.2(13).  There were also some options and improvements added to lower thresholds and timeout values to increase the frequency of some logging for performance issues.

I could see several 2Gb devices attached to these modules, which is probably why they are still installed.  I could also see some of these slow devices zoned to the SVC which is connected to the SAN at 8Gbps.  This violates a best practice of not zoning devices together where the port speeds are greater than 2x difference.  So, a 2 Gb device should not be zoned to 8Gb.  A 4Gb device should not be zoned to 16Gb, etc.    The slow device will turn into a slow-drain device sooner rather than later.    I suspect this is the customer's problem but can't confirm it because of lack of data due to the age of the hardware and firmware.

Another reason that it is critical to keep your solution updated is that if you go too long between updates, there are often interim upgrades that need to be completed.  This is common when moving between major revisions on SAN Switch firmware.    So if a  switch is at version of firmware where  the current code level is 1 or 2 revisions above the current version, there will be at least 1 and possibly more  interim levels that are required.    This greatly raises the complexity of performing upgrades and also raises the risk because customers will want to try and condense what is normally at least a two week process per upgrade version into as little time as possible.

Lastly, with these situations where a customer has hardware or firmware that is out of support, they usually have to do emergency upgrades (sometimes several levels) to get to a supported solution.  It is always preferable to do regular upgrades as part of a planned maintenance cycle.  Getting to currency on firmware is difficult enough on a small fabric with only 1 or 2 switches.  On a large enterprise-class fabric it would be extremely complex to upgrade all the switches across multiple interim versions and maintain proper interoperability.  Each upgrade also carries a risk of impacting the production environment.


The recommendations I gave this customer:

  1. Move the applications on those slow servers to servers with a 4 or (ideally) 8Gb connection to the SAN on the other newer modules in the switches. This will allow for decommissioning of those modules and move to a best-practice solution.
  2. Decommission those old modules and upgrade them if the port density is needed.  this will allow for firmware upgrades which are beneficial for all the reasons noted above
  3. Start planning for a refresh on the switches themselves.  While the switch chassis will be supported for some time yet, they have already been end of life for a few years.

Comments

Popular posts from this blog

Troubleshooting Slow Drain Devices on Broadcom Switches

Spectrum Virtualize NPIV and Host Connectivity