Monday, September 14, 2020

Help IBM Storage Support Help You


 I had a client recently ask me what was the most effective thing his company could do to get me the data that would be the most helpful in troubleshooting problems in his solution.  This was after we were unable to provide a definitive root cause to a problem that occurred intermittently in his solution.   He had a fairly simple fabric that consisted of two 96-port switches, a few IBM Storage Systems and 30 or so hosts.   His problem was an issue with performance on the hosts.  At the time the best I was able to tell him was data indicated a slight correlation between host read activity and a performance problem but I was not able to confirm anything with certainty. 

    My answer was simple:  configure better event detection and system logging.  This is something I teach as a best-practice at IBM Technical University.   I also suggested that his company install at least the free version of IBM Storage Insights.   Without a performance monitoring tool, troubleshooting performance problems is very similar to trying to figure out why a traffic jam is happening using still pictures from traffic cameras.  Now imagine trying to root-cause a traffic jam that happened yesterday or last week with pictures taken today and the only other data you have is statistics such as how many cars the camera has counted since the last time you reset those counters on the camera.  Solving problems using similar data as the example is what the Support teams at IBM are asked to do, and effectively what this customer was asking.   

That said, here are the recommended actions you can take to ensure the best chance of being able to provide the data that we need to solve your problems:

  1.   Configure callhome on your products.  You can search the Knowledge Center for instructions for your specific IBM hosts, storage and switches.  Your product can monitor itself and open tickets for hardware issues that you might not necessarily be aware of.
  2.   Configure a syslog server on your products.  While this won't directly help provide data, it does preserve events if a host, storage system or switch has a system failure.  Without offloading syslog data, critical event data for these kinds of failures is lost.  Logs also wrap.  Having a syslog server configured prevents losing system events due to logs wrapping.  You can search the Knowledge Center for instructions for your specific IBM hosts and storage on how to do this.  For SAN Switches refer to the instructions from Cisco and Brocade.
  3. Configure monitoring and alerting on your SAN Switches.  This may require additional licensing but an effective monitoring policy often gives us critical timestamped data.   As an example, a recent case I worked on had several hosts losing path to storage.   Looking at the switch data, the switch ports for these hosts and a few others were seeing CRC errors.  You can read more about them and how to troubleshoot them here.  These errors are the easiest to detect and resolution is straight-forward.  Because this customer had implemented a good monitoring policy I was able to easily see the time-stamps and was able to let the customer know these errors were ongoing and needed to be resolved.    
  4. Install a performance monitoring tool, at least the free version of IBM Storage Insights.  My client did not have Storage Insights set up.  If he'd had it set up then most likely we would have been able to use the performance data to confirm the theory.   A guided tour of Storage Insights is here. If you have Spectrum Control already, Storage Insights is included for the systems you have licensed in Spectrum Control.  You get all the same monitoring and alerting features that are included in Storage Insights Pro.  Check out this post to learn how Storage Insights can enhance your IBM Storage Support experience.

For point 3,  Cisco has the port-monitor feature.  You can find a complete overview here.  I strongly recommend that you disable the slow-drain policy that is active on a newly deployed switch and at least activate the default port-monitor policy.  The default policy will alert on many more counters (19)  than the slow-drain policy does.  The two counters that the slow-drain policy alerts on are included in the default policy.    Enabling the default policy can help by providing time-stamped data for troubleshooting problems.      Brocade has the Monitoring and Alerting Policy Suite (MAPS).   MAPS can also provide the time-stamped data that is often critical to determining why a problem occurred.  You can find the FOS v8.2 MAPS user guide here and you can find a blog post on integrating Brocade Flow Vision rules into MAPs here.   Integrating Flow Vision allows you to alert for specific kinds of frames. 

Tuesday, September 8, 2020

IBM Announces IBM SANnav

IBM Announced IBM SANnav today.  You can register for a webinar to learn more about SANnav here.    

SANnav is a next-generation SAN management application.  It was built from the ground up with a simple, browser-based user interface.   It can streamline common workflows, such as configuration, zoning, deployment, troubleshooting, and reporting. The modernized GUI can improve operational efficiency by enablog enhanced monitoring capabilities, faster troubleshooting, and advanced analytics. 

Key features and capabilities include:

  1. Configuration management: You can use policy-based management to apply consistent configurations across the switches in your fabrics.  SANnav also makes zoning devices easier by providing a more intuitive interface than previous management products.  
  2. Dashboards:  You can see  at-a-glance views and summary health scores for fabrics, switches, hosts, and targets that may be contributing to performance issues within the network. You can instantly navigate to any hot spots for investigation and take corrective action. 
  3. Filter management: You can sort through large amounts of data by selecting only attributes of importance. For example, users can search for all 32 Gbps ports that are offline. This filter reduces the displayed content to only the points of interest, allowing faster identification and troubleshooting.
  4. Investigation mode: Provides intuitive views that you can navigate for key details to help them understand complex behaviors. SANnav Management Portal periodically collects metrics and stores them in a historical time-series database for further analysis. In addition, it can collect metrics more frequently (at 10-second intervals) for select ports.  This performance data is invaluable when trying to troubleshoot a problem that occurs intermittently and/or is severe enough to impact production but not severe enough to cause a complete outage.
  5. Reporting: Generates customized reports that provide graphical summaries of performance and health information, including all data captured using IBM b-type Fabric Vision technology. Reports can be configured and scheduled directly from SANnav Management Portal to show only the most relevant data, enabling administrators to more efficiently prioritize their actions and optimize network performance
  6. Autonomous SAN:    This is the feature I am most looking forward to learning more about.  As I am in the business of troubleshooting fabrics to find problems I would like to see how effective this is an how quickly the switches can detect problems and notify administrators.  Perhaps some day we'll have switches that can detect problems and automatically route traffic onto faster links (where possible).  This would be very similar to a recent drive I took where my phone's GPS program routed me around a major traffic jam.  It was slower than the main roads assuming no traffic, but was many minutes faster than driving through the congestion. 
As a reminder, you can register for the free webinar at the link above.  I hope to see you there.