Posts

How Physical Link Errors Can Cause Performance Problems

Image
  I am frequently involved in troubleshooting performance problems on storage networks.  While there are many different causes for a performance issue, they can be caused by physical layer (link) problems. Physical link issues cannot be completely (or at least permanently) prevented as the hardware involved (cabling or transceivers) degrades over time.  Loose or damaged cables, degraded transceivers or more rarely failing switch ports can all cause link layer problems.   The most common indicator of a physical layer problem is CRC errors occurring on a link.  Read  this blog post for more information on CRC errors and how to troubleshoot them.   While a few CRC errors in a day or an hour are easily handled by the fabric, if they begin incrementing rapidly enough, the physical link conditions that are present when CRC errors appear will affect performance.  Fibre-channel uses a buffer credit scheme for flow control. Under normal conditio...

Using Performance Data To See Network Problems

Image
I frequently work cases where the problem is a performance problem.  Either an entire system or an application is slow enough that users are affected.    Another frequent performance problem is with storage-side replication.  In these cases replication is not able to keep up with the production workload and RPOs are not being met.   Replication is done most commonly between sites, though I have worked a few cases with same-site (or campus) replication.     Whether you are using IBM DS8000 PPRC/Global Mirror, IBM SVC/FlashSystem Global Mirror (GM) or Global Mirror with Change Volumes (GMCV) you expect that the replicated data will be current up to a certain point in time behind the production data.  This is your Recovery Point Objective (RPO).    Your RPO is how current the replicated data needs to be.  For data that doesn't change often, an RPO of 30 minutes or an hour might be enough.  For data that frequently change...

Troubleshooting Slow Drain Devices on Broadcom Switches

Image
  Slow drain devices are one of the more common problems on storage networks.  They can occur for a variety of reasons.  For a refresher on how they can affect your storage network you should watch  this video .    In this blog post I will go through the basic steps to troubleshoot a slow drain device on a Broadcom fabric.    I will be using command line output from switches.  The CLI format lends itself better to a blog post more readily than screen shots from a GUI, and the commands are consistent across different versions of FOS.    SANnav is a huge change from Brocade or IBM Network Advisor and the screens would look quite different between the two. The first command we will be using is porterrshow.    The above output has been truncated for the ports we are interested in. The counters of interest are in the c3timeout column.  You can see that there are 2 sub-columns, 'tx and 'rx'.   'tx' means the switc...

Spectrum Virtualize NPIV and Host Connectivity

Image
 A while ago I wrote  this post  as an introduction to the Spectrum Virtualize NPIV feature.  In this follow-up post I thought I would focus more on host connectivity and the effects of NPIV.    You can watch a quick review of the NPIV feature in this IBM Systems Rockstar video:      NPIV has 3 modes: 1.  Disabled - this mode means that hosts cannot connect to the virtual World Wide Port Names  (WWPNs) on the Spectrum Virtualize cluster, regardless of the fabric zoning 2.  Transitional - this mode means hosts can connect to either the physical or virtual WWPNs on the cluster.  If a host is zoned to both, it will connect to both.   Transitional mode is meant to only be used while you are migrating to NPIV mode and rezoning your hosts to the virtual WWPNs.  It is not meant to be used permanently or even long-term.   3.  Enabled - this means hosts can only connect to the virtual WWPNs. ...

IBM Spectrum Virtualize Safeguarded Copy

Image
  Several months ago I was asked by a local organization here if I could recover files from a system that had been encrypted by a ransomware attack.  After looking at the hard drive in the system and doing some research, I told the organization that I could not.   It did not have a backup of the files, at least not a recent one.  The most critical  data loss for this organization was financial records.   It took a few months and a lot of work to recover most of the missing records.     Had the organization done something as simple as periodically plug in a USB drive, run a backup and then remove the drive, that would have saved them a lot of work.   The USB drive is somewhat of an immutable copy of the data, at least as long as it is not plugged into the computer while the computer is still infected.   However, a USB-attached drive doesn't really scale well  at the enterprise level, and it is not a true immu...

Brocade Fabric Performance Impact Notification

Image
  Brocade Fabric Performance Impact  Notification  (FPI N )  was released in Broadcom FOS v9.0.  It is available on Brocade Gen6 and Gen7 switches.    This  feature enables the switch to detect issues on a fabric such as congestion or physical link issues and then then notify the affected devices that have registered for these notifications.  FPI N  functions in a similar mechanism to RSCN.    RSCN enables the fabric to send  notifications to devices when a device they are zoned to is going offline.  The devices that receive these notifications can then proactively take steps such as path failover rather than have to react to a path being down.   FPIN provides a means to notify devices of link or other issues with a connection to a fabric or a path through it.    For both RSCN and FPI N , a device must register with fabric services to receive these notifications.  The new Brocade Gen7 har...

Long Distance Fibre Channel Link Tuning

Image
In this video  I talk about some of the variables involved in long distance link tuning of fibre-channel distance links.  In this blog post I'll detail some of the tools that are available.  I will also provide an example of estimating the number of buffer credits you will need.  Note that this tuning is only for fibre-channel links.  This does not apply to FCIP tunnels or circuits.   One critical piece of information that you will need to calculate buffer credits is the frame size.  Smaller frames means more of them can fit in the link, so you would need more buffer credits.  Of the variables that go into the formula, this is the only unknown.  Everything else is either known or is a constant.  Brocade has the 'portbuffershow' command that can tell you the average frame size for a link.  You would look at the Framesize columns for  TX and RX in the portbuffershow output to get the frame size.  The portbuffershow outp...