Wednesday, March 6, 2019

Inside SAN Central - Some SAN Switch Data Collection Recommendations

My colleague Sebastian Thaele prepared two excellent writeups on collecting Cisco and Brocade  Switch logs.  His writeup for Cisco logs is here.  His write up for Brocade logs is here.   I won't duplicate his work with this blog post.  I was at a recent client conference, and heard a common complaint of the log upload process to IBM, and the number of times that customers are asked by IBM Support to upload logs for a PMR.   This sentiment includes all of IBM Support, not just SAN Central.  However  I decided I'd give everyone some  background on SAN Central's problem resolution process. 
Why SAN Central Requests The Logs That It Does
SAN Central (and actually Support in general) works problems that are classified in 1 of 3 ways:
1.  Provide Root-Cause: 

These are issues where an event happened - such as poor performance or hosts lost path to disk - but the problem is no longer occurring and the customer wants to know why the event happened.   The biggest obstacle for SAN Central when working this kind of problem is that the switch logs are a snapshot of what is occurring at the time the log was taken.   So if the problem occurred hours or days ago, it can be like trying to figure out why a traffic jam happened but you are working from a picture taken 3 hours after the congestion is gone.  While there is some historical data available in the logs, it is only logged or triggered if certain thresholds on the SAN are reached.  It's a bit like the camera that snapped the picture can also record video, but the video only starts recording  when traffic gets below, say 30 miles per hour.  You'll still be in traffic congestion at 35 mph but the cameras aren't recording. 

2.  Intermittent Problem That Can't Be Recreated Easily:
This is the most difficult to troubleshoot.   This is the type of problem that happens at random and a re-occurrence of the problem can't be predicted.   At one time or  another everyone has had a problem with their car that occurs at random.  You take it to your mechanic who test drives it, pulls ODBII data, etc and then says he can't find the problem.  Trust that when this happens with your storage solution, we are as frustrated as you are.   Because of the intermittent nature of the problem, there is an issue with collecting logs close enough to the time of the problem such that they are useful.   Also, because the switch logs are a snapshot, things like error counters that haven't been cleared in several days can't always be correlated to the specific time of an event.  It is like our traffic camera recording still shots at intervals but not time-stamping them, so we don't know whether a picture of a particular traffic jam is related to your problem or not.   While it's not always possible, if we can run a script to detect when the error condition occurs, we can then have a script take steps to ensure the correct data is collected.   You can also follow the steps in the below section on Switch Log Maintenance.  This may give us some clues if we can track changes over time.

3.  On-going Problem or One That Can Be Recreated:
This last type of problem includes both on-going issues (such as you are sitting in a traffic jam)  or an issue that can be predicted or recreated, such as you know there will most likely be traffic jams every weeknight at 10:00 PM, but there should not be based on the low volume of traffic.  Because this type of problem is predictable, we have more options when it comes to collecting logs.   For an ongoing (current) traffic jam, the traffic reporter overhead in a helicopter can see the cause of the problem, or you could look through traffic cameras until you find the slowdown.   For something we know will re-occur, or something we can force to happen, we can take steps to ensure we collect the correct data.   Continuing the traffic cam example, we can clear the buffers on the cameras so that they can record lots of video, and we can make sure they are running starting at 9:55 PM so that we can watch traffic building.

Best-Practice Switch Log Maintenance Recommendation
As a Best-Practice for working around the data collection issues for the type of problems that are  root-cause or intermittent, SAN Central recommends that switch logs are collected at least weekly.  Maintaining 3-4 weeks of logs is sufficient.  Port statistics on the switches should be cleared -after- the new set of logs is collected and saved, so that each new log collected has the previous week of run time for the port statistics.  If a problem does occur, logs should be collected as soon as possible after the problem ends, or if it a long-running performance problem, during the problem.   For the root-cause and intermittent issues, this at least gives us some history to work from and gives some data that we can use to establish a baseline. 

Help Us Help You

SAN Central's goal is to resolve your issue as quickly and accurately as we can.  Here are some steps you can take to help ensure this happens. 
  1. When SAN Central is engaged on a PMR, be prepared to upload (or preemptively upload before we ask) logs for at minimum the switches involved in the issue.   For instance, if it's a host performance problem, we need switch logs from the switch where the host is connected, the storage is connected and any switches that might be in between. 
  2. Your problem may not require  logs from all the  switches you may have in the fabric.   There may be something in the data we see that leads to a request for  data from additional switches,  but initially  all data that is sent in has to be checked.  Any potentially extraneous data increases the time it takes to resolve your problem.  

No comments:

Post a Comment