Help IBM Storage Support Help You
I had a client recently ask me what was the most effective thing his company could do to get me the data that would be the most helpful in troubleshooting problems in his solution. This was after we were unable to provide a definitive root cause to a problem that occurred intermittently in his solution. He had a fairly simple fabric that consisted of two 96-port switches, a few IBM Storage Systems and 30 or so hosts. His problem was an issue with performance on the hosts. At the time the best I was able to tell him was data indicated a slight correlation between host read activity and a performance problem but I was not able to confirm anything with certainty.
My answer was simple: configure better event detection and system logging. This is something I teach as a best-practice at IBM Technical University. I also suggested that his company install at least the free version of IBM Storage Insights. Without a performance monitoring tool, troubleshooting performance problems is very similar to trying to figure out why a traffic jam is happening using still pictures from traffic cameras. Now imagine trying to root-cause a traffic jam that happened yesterday or last week with pictures taken today and the only other data you have is statistics such as how many cars the camera has counted since the last time you reset those counters on the camera. Solving problems using similar data as the example is what the Support teams at IBM are asked to do, and effectively what this customer was asking.
That said, here are the recommended actions you can take to ensure the best chance of being able to provide the data that we need to solve your problems:
- Configure callhome on your products. You can search the Knowledge Center for instructions for your specific IBM hosts, storage and switches. Your product can monitor itself and open tickets for hardware issues that you might not necessarily be aware of.
- Configure a syslog server on your products. While this won't directly help provide data, it does preserve events if a host, storage system or switch has a system failure. Without offloading syslog data, critical event data for these kinds of failures is lost. Logs also wrap. Having a syslog server configured prevents losing system events due to logs wrapping. You can search the Knowledge Center for instructions for your specific IBM hosts and storage on how to do this. For SAN Switches refer to the instructions from Cisco and Brocade.
- Configure monitoring and alerting on your SAN Switches. This may require additional licensing but an effective monitoring policy often gives us critical timestamped data. As an example, a recent case I worked on had several hosts losing path to storage. Looking at the switch data, the switch ports for these hosts and a few others were seeing CRC errors. You can read more about them and how to troubleshoot them here. These errors are the easiest to detect and resolution is straight-forward. Because this customer had implemented a good monitoring policy I was able to easily see the time-stamps and was able to let the customer know these errors were ongoing and needed to be resolved.
- Install a performance monitoring tool, at least the free version of IBM Storage Insights. My client did not have Storage Insights set up. If he'd had it set up then most likely we would have been able to use the performance data to confirm the theory. A guided tour of Storage Insights is here. If you have Spectrum Control already, Storage Insights is included for the systems you have licensed in Spectrum Control. You get all the same monitoring and alerting features that are included in Storage Insights Pro. Check out this post to learn how Storage Insights can enhance your IBM Storage Support experience.
For point 3, Cisco has the port-monitor feature. You can find a complete overview here. I strongly recommend that you disable the slow-drain policy that is active on a newly deployed switch and at least activate the default port-monitor policy. The default policy will alert on many more counters (19) than the slow-drain policy does. The two counters that the slow-drain policy alerts on are included in the default policy. Enabling the default policy can help by providing time-stamped data for troubleshooting problems. Brocade has the Monitoring and Alerting Policy Suite (MAPS). MAPS can also provide the time-stamped data that is often critical to determining why a problem occurred. You can find the FOS v8.2 MAPS user guide here and you can find a blog post on integrating Brocade Flow Vision rules into MAPs here. Integrating Flow Vision allows you to alert for specific kinds of frames.