Friday, March 8, 2019

Troubleshooting IBM Storage Insights Pro Alerts

Recently, there were enhancements made to several features, including a new Alert Policy feature in IBM Storage Insights Pro. You can find out what's new about the new features here.   The Alert Policy feature lets you configure a set of alerts into a policy and apply all of them across multiple storage systems. In this way you can ensure consistency with alerts and not have to define the same alert on each individual storage system. Once you define the alerts, IBM Storage Support representatives can see the generated alerts on a storage system.


For the IBM FS9100, there are a number of alerts that are already defined. When one of those alerts is triggered,  a proactive ticket is opened and the experts at IBM Storage Support investigate the alert, then take whatever action is necessary.   With this post we'll take a look at how the IBM Storage Support Team investigates  alerts.  For this example we are using an alert for the Port Send Delay I/O Percentage.  

You can see in this picture that the storage system had several alerts on the Port Send Delay I/O percentage.  This statistic measures the ratio of the send operations that were delayed to the number of send operations for the port.   The ports listed in the alerts are the 16Gb ports for an IBM Storwize system.  This counter indicates similar conditions to the transmit buffer credit 0 for the 8Gb ports.  In this case,  more than 20% of I/O was delayed for the ports listed in the 'Internal Resource' column over a 20 minute interval. 



The next counter that Storage Support would check is the duration of each delay.  There  could be a lot of send operations getting delayed for a very short time, and it is probable that applications won't notice an impact.  However, if there are a few long delays that are triggering the alert, that would most likely impact any applications.  It's similar to heavy traffic that slows things to 65 mph (100 km/hr) but traffic keeps moving, vs some condition that slows things much more, but is more intermittent.  You will notice the impact much more for for the second scenario. 

Looking at the comparison of the delay time to the I/O Delay percentage, you can see that the delay time is not that high and it does not last for very long.  If the delay time were higher or lasted longer, Storage Support would investigate further.   There was also no impact to the customer's applications.





If further investigation were needed, the next step would be to determine what kind of traffic the delays were coming from.  Storwize and SVC nodes can send data to other nodes in the cluster (inter-node) other clusters for replication (partner) hosts, or back-end storage.  This particular storage system is an FS9100 and does not have back-end storage.  So you need to find out where most of the data is being sent.  That is the most likely cause of the alert. The picture shows that the send data rate to the host almost exactly matches the data rate for the ports.  So you know that most of the data being sent is going to the hosts, so you would start looking at host and volume statistics to narrow it down further.  You would be looking for a host or set of hosts that is consuming the majority of the data.


This performance analysis also helps the IBM SAN Central team if an assist from them is necessary.  Since you know now that most of the data is being sent to or from the hosts and you potentially have done further isolation to determine a set of hosts,  this  would allow SAN Central to focus on a few specific hosts rather than having to examine an entire SAN.  

No comments:

Post a Comment