How Storage Insights Can Enhance Your Support Experience

An Introduction

The first week in May IBM announced IBM Storage Insights.    As of 11 June, Storage Insights has these key items:
  • IBM Blue Diamond support
  • Worldwide support for opening tickets
  • Custom dashboards
  • New dashboard table view
  • Clients can now specify whether IBM Storage Support can collect support logs.  This is done on a per-device basis.
You can get a complete list of the new features here:  Storage Insights New Features
There are some other new features such as new capacity views on the Storage Insights Dashboard.    With these new features, especially support for IBM Blue Diamond customers, Storage Insights is an increasingly important and valuable troubleshooting tool.  My team here is seeing more and more customers that are using Storage Insights.   I thought I would discuss the potential benefits of Storage Insights as a troubleshooting tool.  

Some Background

The problems my team fixes can be categorized as either:
  1. Root-cause analysis (RCA) - meaning a problem happened at some point in the past, and the customer wants to know why it happened
  2. Ongoing issue - the problem is happening now (or happens repeatedly, also called intermittent)
The above two types of problems can further be broken down into partially working or completely broken.  Of the two, partially working can be much more difficult to troubleshoot, especially if it's an intermittent issue and not constant.  As an example, some years ago my van had a misfire on one of its cylinders, but we didn't know which one.  Of course it never occurred when my mechanic was driving it.  It finally took several hours at the dealer with the dealer hooking the car up to a test rig to record the failure and identify the misfire.  Had the problem been a completely broken spark plug wire instead of partial, it would have been much easier to identify.
You can imagine the difficulty of attempting to root-cause a problem that happened hours or days ago  on a large and busy SAN if the problem is/was not severe enough to cause the switches to record any time-stamped errors or other indicators of problems.   As an example  I'm confident you've been in slow-moving traffic where the cause of the problem isn't readily apparent.   The analogy isn't perfect, but suppose the traffic cameras in your city were configured to only start recording when traffic is moving less than 30 mph for 2 minutes and/or generate an alert back to the traffic center.   They do record the number of cars passing by and the number of cars exiting and entering the freeway at each ramp but they don't timestamp these numbers.  They only timestamp the video and/or alerts.  Now further suppose you were stuck in traffic last week that was moving at 32 mph.  Since it didn't meet meet the threshold, the cameras never recorded anything  and no alerts were sent.  You could collect the statistics on the number of cars counted by the cameras but without anything recorded from last week  it would be extremely difficult to provide an explanation as to why traffic was slow, since you can't reconcile the count of cars to any specific point in time.  If traffic had been completely stopped, the cameras would have started recording and then you'd be able to see the car fire, or accident, or whatever the cause of the problem was last week.  The same limitations exist for ongoing issues.  If traffic is moving slowly but not completely stopped, then identifying the cause of the slow traffic can be difficult.  

How Storage Insights Can Help Provide Root-Cause

Storage Insights has the potential to provide an explanation for these partially working root-cause investigations by regularly sampling the performance statistics and providing timestamps on this data.  If we had something like Storage Insights regularly sampling statistics from our traffic cameras, we could go back and analyze these for the time period where you were sitting in traffic.  We might find a certain exit ramp from the freeway was congested at the time of the problem.    We could take this information and correlate that with other data to try and determine why the ramp was backed up.  We might find a concert was going on at a venue near the ramp, or some other event that caused an increase in traffic to that ramp.  

How Storage Insights Can Decrease Problem Resolution Time

For a problem that is happening now, Storage Insights can help provide resolution more quickly than without it.  Going back to our traffic example, suppose there is an accident or some other problem on a surface street that an exit ramp connects to.  Traffic eventually will back up onto the exit ramp and then onto the freeway.  Without Storage Insights, you'd have to look at each of your traffic cameras in turn and trying to figure out where the congestion starts.   With Storage Insights, since it's collecting the statistics you can filter them to find out which of your exit ramps is the congested ramp.


Popular posts from this blog

Troubleshooting Slow Drain Devices on Broadcom Switches

Spectrum Virtualize NPIV and Host Connectivity