Tuesday, March 26, 2019

Cisco Automatic Zoning

Cisco released a feature in NX-OS v8.3.1 called Automatic Zoning.  The feature does exactly what the name suggests:  it automatically configures zoning for the devices on your SAN.  You can see a video on the feature here:



What Is Zoning?

SAN (Storage Area Network) zoning is specifying which devices on the SAN  can communicate with which other devices.  Devices are added to a zone.  A zone is a group of devices that can communicate.  Zones are then added to a zoneset.  The zoneset is then activated - this is the configuration that is in effect. There can be multiple zonesets but only one active one at any given time.   By default, any device not zoned (not a member of a zone) cannot communicate with any other device.  A device that is in at least one zone is considered zoned.   Effective zoning prevents unauthorized devices from talking to each other, and minimizes disruptions on the SAN if a device misbehaves.  

How Cisco Automatic Zoning Works

When a SAN is first configured, adding devices to zones and creating zonesets can be a long process.  Cisco Automatic zoning configures this for you so that you do not have to manually configure zoning.    It works by examining which devices are logged into the fabric as initiators and which are logged in as targets, then it adds zones to the configuration where the zones included the initiators and targets.  An 'initiator' is a device such as a host.  A 'target' is a device such as storage.  Some systems (such as SVC) log in as both.  Other storage systems will log in as both types, especially if they have added services such as replication.  A storage system that is reading from or writing to another storage system is an initiator from the perspective of the remote storage system.

Automatic zoning is currently implemented so that it only runs on single-switch fabrics if it is enabled.  If it detects any inter-switch links (ISLs) it will not perform automatic zoning.  Future versions of Automatic Zoning will run on multi-switch fabrics.  If you make changes to zoning after automatic zoning is run, it will not undo those changes.  

Potential Problems and Best Practice Recommendations

You can see how automatic zoning might cause problems for systems that log in as both initiators and targets.  You would have devices communicating with each other that are not supposed to.  This risks disruption on the SAN.   If you have IBM SVC or other Storwize systems on your SAN, do not enable Automatic Zoning.    

Remember that automatic zoning will zone all initiators to all targets.  So in the scenario where you have 50 initiators and two targets but you want to split the initiators evenly between the targets, Automatic Zoning would zone all 50 initiators to both targets.  You would have to go in and manually rezone initiators away from each target.

In summary, Automatic Zoning can relieve some of the burdens on SAN Administrators during initial setup, but it should only be used (with great care) on smaller, single-switch environments with a single target (or multiple targets if all initiators will communicate with all targets).   Be wary if your storage has any replication features enabled as this means it will likely log in as both initiator and target.  

Monday, March 11, 2019

Why Increasing Buffer Credits Is Not A Cure For A Slow Drain Device


When I am working on performance problems, a frequent question I  get is why increasing buffer credits for a particular port is not a fix for a slow drain device attached to that port.  In this video, I explain the concept of congestion and illustrate why increasing the number of buffer credits is not a fix, without addressing the underlying cause of the congestion.   There are some exceptions to this rule.  The most common is when dealing with long-distance links, but that will be addressed in a future blog post (and perhaps a future video).  




As always,  you can leave feedback in the comments, or find me on LinkedIn or Twitter

Friday, March 8, 2019

Troubleshooting IBM Storage Insights Pro Alerts

Recently, there were enhancements made to several features, including a new Alert Policy feature in IBM Storage Insights Pro. You can find out what's new about the new features here.   The Alert Policy feature lets you configure a set of alerts into a policy and apply all of them across multiple storage systems. In this way you can ensure consistency with alerts and not have to define the same alert on each individual storage system. Once you define the alerts, IBM Storage Support representatives can see the generated alerts on a storage system.


For the IBM FS9100, there are a number of alerts that are already defined. When one of those alerts is triggered,  a proactive ticket is opened and the experts at IBM Storage Support investigate the alert, then take whatever action is necessary.   With this post we'll take a look at how the IBM Storage Support Team investigates  alerts.  For this example we are using an alert for the Port Send Delay I/O Percentage.  

You can see in this picture that the storage system had several alerts on the Port Send Delay I/O percentage.  This statistic measures the ratio of the send operations that were delayed to the number of send operations for the port.   The ports listed in the alerts are the 16Gb ports for an IBM Storwize system.  This counter indicates similar conditions to the transmit buffer credit 0 for the 8Gb ports.  In this case,  more than 20% of I/O was delayed for the ports listed in the 'Internal Resource' column over a 20 minute interval. 



The next counter that Storage Support would check is the duration of each delay.  There  could be a lot of send operations getting delayed for a very short time, and it is probable that applications won't notice an impact.  However, if there are a few long delays that are triggering the alert, that would most likely impact any applications.  It's similar to heavy traffic that slows things to 65 mph (100 km/hr) but traffic keeps moving, vs some condition that slows things much more, but is more intermittent.  You will notice the impact much more for for the second scenario. 

Looking at the comparison of the delay time to the I/O Delay percentage, you can see that the delay time is not that high and it does not last for very long.  If the delay time were higher or lasted longer, Storage Support would investigate further.   There was also no impact to the customer's applications.





If further investigation were needed, the next step would be to determine what kind of traffic the delays were coming from.  Storwize and SVC nodes can send data to other nodes in the cluster (inter-node) other clusters for replication (partner) hosts, or back-end storage.  This particular storage system is an FS9100 and does not have back-end storage.  So you need to find out where most of the data is being sent.  That is the most likely cause of the alert. The picture shows that the send data rate to the host almost exactly matches the data rate for the ports.  So you know that most of the data being sent is going to the hosts, so you would start looking at host and volume statistics to narrow it down further.  You would be looking for a host or set of hosts that is consuming the majority of the data.


This performance analysis also helps the IBM SAN Central team if an assist from them is necessary.  Since you know now that most of the data is being sent to or from the hosts and you potentially have done further isolation to determine a set of hosts,  this  would allow SAN Central to focus on a few specific hosts rather than having to examine an entire SAN.  

Wednesday, March 6, 2019

I Will Be At Technical University in Atlanta

IBM Tech U Atlanta 2019


This is just a quick post to say that I will be at IBM Technical University in Atlanta. I will be there from April 29 through May 3

My sessions for this event are:
  • s106417 The Path of an FC Frame Through a Cisco MDS Director
  • s106420 Proactive Monitoring of a Cisco Fabric
  • s106421 Troubleshooting Cisco SAN Performance Issues - Part 1
  • s106422 Troubleshooting Cisco SAN Performance Issues - Part 2
Find these sessions and many more at Technical University. Click the banner at the top of the page to register or for more information.

As always, if you have any questions, leave them in the comments at the end of this blog or find me on LinkedIn or Twitter.

Why Low I/O Rates Can Result In High Response Times

Why Low I/O Rates Can Result in High Response Times for Reads and Writes

As IBM Storage Insights and Storage Insights Pro become more widely adopted, many companies who weren't doing performance monitoring previously are now able to see the performance of their managed storage systems. With the Alerting features on Storage Insights Pro, companies are much more aware of performance problems within their storage networks. One common question that comes up is why a volume with low I/O rates can have very high response times. Often these high response times are present even with no obvious performance impact at the application layer.
These response time spikes generally measured in the 10s or 100s of milliseconds, but can be a second or greater. At the same time, the I/O rates are low - perhaps 10 I/Os per second or less. This can occur on either read or write I/Os. As an example, this picture shows a typical pattern of generally low I/O rates with a high response time. The volume in question is a volume used for backups, so the volume is generally only written to during backups. The blue line is the I/O rate - in this case the write I/O rate, but the same situation can happen with reads from idle volumes. The orange line is the response time. You can see a pattern of generally low I/O rates overall to the volume and that the write response time spikes up when the I/O rate goes up. It is easy to see why if you were using Storage Insights Pro to alert on response time, you might be concerned about a response time greater than 35 ms.

 
This situation happens because of the way storage systems manage internal cache. This is generally true for all (IBM or non-IBM) storage subsystems and storage virtualization engines (VE). If a volume has low I/O rates then the volume is idle or nearly idle for extended periods of time. The volume can be idle for a minute or more. The storage device or VE will then flush the cache of that volume. It does this to to free up cache space for other volumes which are actively reading or writing. The first I/O that arrives after an idle period for the volume requires re-initialization of the cache. For storage systems and VEs with redundant controllers or nodes, this also requires that the cache is synchronized across the nodes or controllers of the storage subsystem. All of this takes time. The processes are often expensive in performance terms and the first I/O after an idle period can have significant delays. Additionally for write I/O, the volume may operate in write-through mode until the cache has been fully synchronize. In write-through mode the data is written to cach and disk at the same time. This can cayse further slowdowns because each write will be reported as complete only after the update has been written to the back-end disk(s). After the cache is synchronized, each write will be reported as complete after the update has been written to cache. This is a much faster process. You can see how, depending on the caching scheme of the storage subsystem, you would see a pattern of idle or almost idle volumes having extremely high response times. Unless you are seeing applications be impacted this is generally not a concern.
Response time spikes can also occur with large transfer sizes. This picture shows response time for the same volume as in the previous picture, except as it relates to transfer size. In this case it is the size of the write. As in the above picture the orange line is the response time, the blue line is the transfer size. You can see that the transfer size is large - almost 500 KB per I/O. The volume for this performance data is not compressed. If it were compressed there could be additional delays depending on the compression engine used in the storage. Barry Whyte gives an excellent writeup of Data Reduction Pools here that details how DRP gives better performance than IBM RACE or other comppression technologies.  


If you have any questions, leave them in the comments at the end of this blog or find me on   LinkedIn or on Twitter.

Working With Groups in IBM Storage Insights Pro


IBM Storage Insights Pro Groups Feature


In addition to the Reporting and Alerting features that are not available in the free version of IBM Storage Insights, the subscription based offering, IBM Storage Insights Pro, has a very useful feature called Groups. Groups allow you to group or bundle together related storage resources for ease of management. For example, you might group a set of volumes together that are all related to a specific application, or server cluster. You might group the hosts that make up a cluster into a group. You can even group ports together - you could define a group of ports for an SVC Cluster that includes all the ports used for inter-node communication.  Such a group would be very handy for your IBM Storage Support person. It would potentially save support from having to collect a support log and dig through it. It would certainly make analysis go faster when troubleshooting an issue.
So to get started with Groups, the first thing to do is to see if you have any Groups defined. They are listed, not surprisingly, under the Groups-> General Groups and General Groups is the last entry on the Groups menu. 

This is the type of group that will be used in this blog post.
If you open the Groups menu, you can see any groups that have been defined.  For this IBM Storage Insights Pro instance, there are currently no groups defined.  


You can add any storage resources you want to a group.The process for creating a group is to add a resource to a group and select the option to create a new group.
Adding A Resource To A Group
To add a resource to a group, in IBM Storage Insights Pro, browse to a resource, right-click the resource then select Add to General Group.



In this example we are adding a volume to a group. This can be done from either the listing of Block Storage Systems where you can add a storage system to a group, or it can be done from any of the internal resources in storage when you are looking at the resources in the Storage Properties view. You can select multiple resources and add them all to a group at one time. So you might list your volumes, filter on a volume name and then select all those that are filtered and add them to a group. To add the hosts from a cluster you might do the same in the host list.  
You can mix resource types in a group.  So for example, you might have all the storage systems, volumes and hosts for an application cluster in a group.  This makes it easier to see the resources associated with that particular application - for both you and IBM Storage Support. This way, when you call IBM for a performance problem on an application, they can just look at the appropriate group to identify all the resources that are of interest for that application.
The next page that appears after you click Add to General Group is the Add to Group page.


You can either create a new group or add the selected resources to existing groups.  It is possible for a resource to be in more than one group.  
Adding Resources To A New Group
In this example, you create a group for storage resources related to the payroll database. You can a volume to it, but you could (and should) add hosts and other storage systems to the group.  In this way, all the resources used for our payroll application are in a single group and easily identifiable. 


Adding A Resource To An Existing Group
     
The following image shows the Group selection, if you opted to add a resource to an existing group. In this example, you only have one group defined, but if you had multiple groups, you could add the selected resources to multiple groups just by selecting all the groups that you want to add the resources to.




Viewing The Resources In A Group
The group listing looks much like the internal resources view of a storage system.


Sub-groups
The last feature for Groups are Sub-groups. Sub-groups are exactly what they sound like - a way to further define relationships between resources in a group.  In the following example, you can see the sub-groups for our Payroll group.



You might have hosts, volumes or other resources dedicated to different aspects of the processing Payroll group. In this example, there is a subgroup dedicated to Pensions and another dedicated to Taxes.These sub-groups will appear in the listing of groups if you want to assign resources to them.  Like all other resources, sub-groups can belong to multiple groups.  


If you have any questions, leave them in the comments at the end of this blog or find me on   LinkedIn or on Twitter.

IBM Storage Insights: Tips and Answers to Questions Frequently Asked

Answers To Some Frequently Asked Questions about IBM Storage Insights



Over the last several months I have seen some common questions that are asked about IBM Storage Insights. I started collecting them and will answer them here.  These questions are all about Storage Insights itself. Questions relating to managing specific types of storage with Storage Insights will be answered in future Blog posts. So, on to the questions.....

Q: Can I install a data collector in the same system as my IBM Spectrum Control server

A: Yes.  However you need to pay attention to memory and CPU usage of you

Data Collector Authentication

If you use usernname/password authentication configure a dedicated user ID for the Data Collector on  your  storage systems.  do not use the default or other Admin account.    This allows for effective auditing and reduces security risks. 

Q: What Are The Recommended System Specifications for the Data Collector?

A:   The hard drive space requirement has risen from the original 1GB minimum.  The Data Collector will now cache the performance data it collects in case it loses its connection to the cloud.  When it regains the connection it will uploaded the cached data.  This helps avoid gaps in performance data due to loss of connection.  The minimum specifications are 4 GB HDD space and 1 GB RAM available on the system you install it on.   For a data collector in a virtual machine, add these specifications to whatever the operating system requires

Q: Does Storage Insights Support Multi-Tenancy?

A: There is currently no support for multi-tenancy.  This means that if you are managing storage from multiple datacenters, everyone with access to your Storage Insights instance will be able to see all storage. A suggestion is to edit the properties of the storage and fill out the location. You can then create a custom dashboard for each location. Setting the location property also helps IBM Storage Support know where storage is located. This assists with troubleshooting.

Q: Does The Data Collector Need To Be Backed Up? What about Redundancy?

A: Install at least two data collectors per instance for redundancy. Install at least two in each location if you are managing storage across multiple data centers. You do not need to back up the data collector. It does not store any collected data locally. All data collected is streamed to the cloud, and the data collector is always available for download if it needs to be re-installed. Downloading it also ensures that you alwasy get the latest. If you are using a virtual machine, you may want to back up the VM image, but this is to make it easier to re-deploy if there is an issue with the VM.



Q: I Installed Multiple Redundant Data Collectors.  Which one will collect data from the Storage?

A: You have two options.  Option 1 is  that you can assign data collectors to collect data from specific storage systems.  If you choose this option, only the data collectors assigned to collect data from a given storage system will do so.  Option 2 is to leave the assignment feature turned off.  If this is done, each data collector will  test the speed of the connection to the storage systems they manage.  The Data Collector with the fastest connection speed will win.  If you have a situation where you have multiple Data Collectors and one of them is located behind an internal firewall to manage storage behind that firewall, then that Data Collector will always be used to collect data from that storage.

Q: What About Firewalls?

A: You need to open port 443  on your firewall, this is the  default HTTPS port to allow the Data Collector to communicate with the cloud service. This only needs to be for outbound traffic. IBM will never send anything down to the Data Collector. If there is a firewall between the Data Collector and the storage it is managing the firewall should be configured to pass SNMP traffic. Lastly ensure that data collector is in the VLAN used for SAN Switch and storage management, or that VLAN routing is configured to allow the data collector across VLANs

Q: You Just Said IBM Never Sends Anything To the Data Collector? How Does It Know What To Do?

A: The Data Collector is constantly checking a queue on the cloud for jobs to to do, such as a support log collection. This ensures that communication is only one-way (the data collector pushes data up to the cloud).

Q: I Have A Proxy Server. How Do I Configure The Data Collector for a Proxy Server?

A: During the installation of the Data Collector, it will ask for your proxy server configuration. The proxy server itself should not need any additional configuration.

Q: Can I Control Whether IBM Storage Support Can Collect Support Logs?

A: Yes. Instructions are here.
Some considerations when setting permissions:
  • If this is turned off, IBM Storage Support will not be able to collect logs as they need them potentially delaying problem resolution
  • If this is allowed, You are granting IBM Storage Support permission to collect support logs as-needed for troubleshooting without requesting permission each time.
  • This is a simple toggle that can be turned on and off as often as you wish
  • When you are doing maintenance on a storage system it is recommended that you turn this off for the duration of the maintenance

Q: I Want To Configure a Performance Alert. What Are Some Suggested Values for Thresholds?

A: Performance monitoring thresholds are different for every environment. Use historical performance data to  guide alerting decisions for response time and other thresholds. For new Storage Insights instances, it is recommended to wait until you have two weeks of performance data before configuring any alerts

Q: I'm a Partner and my client has given me permission to monitor his Storage Insights free dashboard. Can I get SI Pro capabilities while he stays on the free version?

A: No. You cannot see the Pro capabilities. You see exactly what your customer sees.

If you have any questions, leave them in the comments or find me on LinkedIn or on Twitter.

The Importance of Keeping Your Entire Solution Current

Recently I started working on a new case for a customer.   I'm trying to diagnose repeated error messages being logged by an IBM SVC Cluster that indicate problems communicating with the back-end storage that is being virtualized by the SVC.  These messages generally indicate SAN congestion problems.  The customer has Cisco MDS 9513 switches installed.  They're older switches but not all that uncommon.  What is uncommon is finding the switches at NX-OS version 5.X.X.  I see down-level firmware but this one is particularly egregious.  This revision is several years out of date. Later versions of code contain numerous bug fixes both from Cisco and for the associated upstream Linux security updates that get incorporated into NX-OS.  Also, while NX-OS versions don't officially go out of support, any new bugs identified won't be fixed as this version is no longer being actively developed

This level of firmware merits further investigation.  Looking deeper on the switches I find this partial switch module list:

Mod  Ports  Module-Type                         Model              Status
---  -----  ----------------------------------- ------------------ ----------
6    48     1/2/4 Gbps FC Module                DS-X9148           ok
7    48     1/2/4 Gbps FC Module                DS-X9148           ok
8    48     1/2/4 Gbps FC Module                DS-X9148           ok
9    48     1/2/4 Gbps FC Module                DS-X9148           ok

These modules are older than the firmware on the switches, and support ended 3 years ago.  If this customer has problems with them (or the switches they are installed in) and the problem is traced back to the modules, there is not much that IBM Support can do.  If a problem is traced to a bug in the firmware, the customer can't upgrade the firmware to something more current because of these old, unsupported modules still in the switches.  This limits IBM's ability to provide support.  The hardware is no longer supported and much of the data we can look at in the firmware was not introduced until the next major revision level of NX-OS - v6.2(13).  There were also some options and improvements added to lower thresholds and timeout values to increase the frequency of some logging for performance issues.

I could see several 2Gb devices attached to these modules, which is probably why they are still installed.  I could also see some of these slow devices zoned to the SVC which is connected to the SAN at 8Gbps.  This violates a best practice of not zoning devices together where the port speeds are greater than 2x difference.  So, a 2 Gb device should not be zoned to 8Gb.  A 4Gb device should not be zoned to 16Gb, etc.    The slow device will turn into a slow-drain device sooner rather than later.    I suspect this is the customer's problem but can't confirm it because of lack of data due to the age of the hardware and firmware.

The last reason that it is critical to keep your solution updated is that if you go too long between updates, there are often interim upgrades that need to be completed.  This is common when moving between major revisions on SAN Switch firmware.    So if a  switch is at version of firmware where  the current code level is 1 or 2 revisions above the current version, there will be at least 1 and possibly more  interim levels that are required.    This greatly raises the complexity of performing upgrades and also raises the risk because customers will want to try and condense what is normally at least a two week process per upgrade version into as little time as possible.


The recommendations I gave this customer:

  1. Move the applications on those slow servers to servers with a 4 or (ideally) 8Gb connection to the SAN on the other newer modules in the switches. This will allow for decommissioning of those modules and move to a best-practice solution.
  2. Decommission those old modules and upgrade them if the port density is needed.  this will allow for firmware upgrades which are beneficial for all the reasons noted above
  3. Start planning for a refresh on the switches themselves.  While the switch chassis will be supported for some time yet, they have already been end of life for a few years.

How Storage Insights Can Enhance Your Support Experience


An Introduction

The first week in May IBM announced IBM Storage Insights.    As of 11 June, Storage Insights has these key items:
  • IBM Blue Diamond support
  • Worldwide support for opening tickets
  • Custom dashboards
  • New dashboard table view
  • Clients can now specify whether IBM Storage Support can collect support logs.  This is done on a per-device basis.
You can get a complete list of the new features here:  Storage Insights New Features
There are some other new features such as new capacity views on the Storage Insights Dashboard.    With these new features, especially support for IBM Blue Diamond customers, Storage Insights is an increasingly important and valuable troubleshooting tool.  My team here is seeing more and more customers that are using Storage Insights.   I thought I would discuss the potential benefits of Storage Insights as a troubleshooting tool.  

Some Background

The problems my team fixes can be categorized as either:
  1. Root-cause analysis (RCA) - meaning a problem happened at some point in the past, and the customer wants to know why it happened
  2. Ongoing issue - the problem is happening now (or happens repeatedly, also called intermittent)
The above two types of problems can further be broken down into partially working or completely broken.  Of the two, partially working can be much more difficult to troubleshoot, especially if it's an intermittent issue and not constant.  As an example, some years ago my van had a misfire on one of its cylinders, but we didn't know which one.  Of course it never occurred when my mechanic was driving it.  It finally took several hours at the dealer with the dealer hooking the car up to a test rig to record the failure and identify the misfire.  Had the problem been a completely broken spark plug wire instead of partial, it would have been much easier to identify.
You can imagine the difficulty of attempting to root-cause a problem that happened hours or days ago  on a large and busy SAN if the problem is/was not severe enough to cause the switches to record any time-stamped errors or other indicators of problems.   As an example  I'm confident you've been in slow-moving traffic where the cause of the problem isn't readily apparent.   The analogy isn't perfect, but suppose the traffic cameras in your city were configured to only start recording when traffic is moving less than 30 mph for 2 minutes and/or generate an alert back to the traffic center.   They do record the number of cars passing by and the number of cars exiting and entering the freeway at each ramp but they don't timestamp these numbers.  They only timestamp the video and/or alerts.  Now further suppose you were stuck in traffic last week that was moving at 32 mph.  Since it didn't meet meet the threshold, the cameras never recorded anything  and no alerts were sent.  You could collect the statistics on the number of cars counted by the cameras but without anything recorded from last week  it would be extremely difficult to provide an explanation as to why traffic was slow, since you can't reconcile the count of cars to any specific point in time.  If traffic had been completely stopped, the cameras would have started recording and then you'd be able to see the car fire, or accident, or whatever the cause of the problem was last week.  The same limitations exist for ongoing issues.  If traffic is moving slowly but not completely stopped, then identifying the cause of the slow traffic can be difficult.  

How Storage Insights Can Help Provide Root-Cause

Storage Insights has the potential to provide an explanation for these partially working root-cause investigations by regularly sampling the performance statistics and providing timestamps on this data.  If we had something like Storage Insights regularly sampling statistics from our traffic cameras, we could go back and analyze these for the time period where you were sitting in traffic.  We might find a certain exit ramp from the freeway was congested at the time of the problem.    We could take this information and correlate that with other data to try and determine why the ramp was backed up.  We might find a concert was going on at a venue near the ramp, or some other event that caused an increase in traffic to that ramp.  

How Storage Insights Can Decrease Problem Resolution Time

For a problem that is happening now, Storage Insights can help provide resolution more quickly than without it.  Going back to our traffic example, suppose there is an accident or some other problem on a surface street that an exit ramp connects to.  Traffic eventually will back up onto the exit ramp and then onto the freeway.  Without Storage Insights, you'd have to look at each of your traffic cameras in turn and trying to figure out where the congestion starts.   With Storage Insights, since it's collecting the statistics you can filter them to find out which of your exit ramps is the congested ramp.


Troubleshooting SVC/Storwize NPIV Connectivity

Some Background:

A few years ago IBM introduced the virtual WWPN (NPIV) feature to the Spectrum Virtualization (SVC) and Spectrum Storwize products.  This feature allows you to zone your hosts to a virtual WWPN (vWWPN) on the SVC/Storwize cluster.  If the cluster node has a problem, or is taken offline for maintenance the vWWPN can float to the other node in the IO Group.  This provides for increased fault tolerance as the hosts no longer have to do path failover to start I/O on the other node in the I/O group. 
All of what I've read so far on this feature is from the perspective of someone who is going to be configuring this feature.  My perspective is different, as I troubleshoot issues  on the SAN connectivity side.  This post is going to talk about some of the procedures and data you can  use to troubleshoot connectivity to the SVC/Storwize when the NPIV feature is enabled, as well as some best-practice to hopefully avoid problems.

If you are unfamiliar with this feature, there is an excellent IBM RedPaper that covers both this feature and the Hot Spare Node feature:

An NPIV Feature Summary

1:  SVC/Storwize has three modes for NPIV - "Enabled", "Transitional" or "Off".   
2:  Enabled means it is enabled.   Hosts attempting to log into to the physical WWPN (pWWPN) of the Storwize port will be rejected.  Transitional means it is enabled, but the SVC/Storwize will accept logins to either the vWWPN or the pWWPN. Off means the feature is not enabled.
3:  Transitional mode is not intended to be  enabled permanently.  You would use it while you are in the process of re-zoning hosts to use the vWWPNs instead of the pPWWNs. 
4:  For the NPIV failover to work, each of the SVC/Storwize nodes has to have the same ports connected to each fabric.  For example, assuming this connection scheme for an 8-port node with redundant fabrics:
Node Port Fabric A Fabric B
1 x
2 x
3 x
4 x
5 x
6 x
7 x
8 x

All the nodes must follow the same connection scheme.   Hot-spare node failover will also fail if the nodes are mis-cabled.     To be clear I am not advocating the above scheme per se, just that all the nodes must match as to which ports are connected to which fabrics.
5:  I was asked at IBM Tech U in Orlando if the SVC/Storwize Port Masking feature is affected by the NPIV feature.  The answer is no.  Any existing port masking configuration is still in effect. 
6: pWWPNs are used for inter-node and inter-cluster (replication) as well as controller/back-end.   vWWPNs are only used for hosts.

A Suggestion:  A recommendation I heard  at IBM Technical University in May is if you are using FC Aliases in your zoning to add the vWWPN to existing alias for each SVC/Storwize cluster port so that you don't have to rezone each individual host.  While that is an easy way to transition, that creates a potential problem.  After you move the cluster from Transitional to Enabled, the cluster starts rejecting the fabric logins (FLOGI)  to the pWWPNs.  At best, all this does is fill up a log with rejected logins, at which point you call for support because you notice a host logging FLOGI rejects.  At worst this causes actual problems when the adapter and possibly multipath driver attempt to deal with the FLOGI rejects.  Prior to moving the NPIV mode to Enabled, you need to remove the pWWPN from the FC Alias, but you must first ensure you are not using the same aliases for zoning of your back-end storage. If you are and you remove the pWWN from the alias you will lose controller connectivity.  If you are using a Storwize product with internal storage and no controllers, then this will not be an issue and the pWWN can be removed from the alias.  If you do have back-end storage  and are currently using the same aliases for both host and controller zoning, it might be easier to establish new aliases for the pWWPNs and rezone the controllers to them, or just rezone the controllers to the pWWPNs before modifying the existing aliases to use the vWWPNs.  It will be less zones to modify for the controllers than for the hosts.

Troubleshooting Connectivity to the SVC/Storwize Cluster

One of the most common problems that I see with connectivity issues or path failover not working as it should is incorrect zoning.  To that end, you first need to verify the vWWPNs that you should be using.  The easiest way is to run lstargetportfc on the Cluster CLI to get a listing of these vWWPNs.  lsportfc will list the pWWPNs.  This command output is included by default in the svcout file starting at version 8.1.1.  Versions prior to that it is a separate command.  Once you have that list, you can use the Fibre Channel Connectivity Listing in the SVC/Storwize GUI and the Filtering capabilities there to filter on the vWWPNs and/or the pWWPNs to determine if you have any hosts connected to the pWWPNs.  You can also capture the output of
lsfabric --delim ,  and import that CSV into Excel or similar to get better sorting and filtering than the System GUI.  If the host is missing, or is connected to the pWWPNs, you will need to check and verify zoning.    This is also a good time to verify controllers are connecting to the pWWPNs, and that if you are using a hot-spare node, that the controllers are zoned to the ports on the hot-spare.    I had a case recently where, while it wasn't the reason the customer opened the ticket, I noticed they had not zoned one of their controllers to the hot-spare node.  In the event of a node failure, the failover would not have worked as expected. 




An Easy Way To Turn Your Flash Storage Supercar Into a Yugo

Introduction

This past summer I was brought into a SAN performance problem for a customer.  When I was initially engaged on the problem, it was a host performance problem.  A day or two after I was engaged, the customer had an outage on a Spectrum Scale cluster.   That outage was root-caused to a misconfiguration on the Spectrum Scale cluster where it did not re-drive some I/O commands that timed out.  The next logical question was why the I/O timed out.    Both the impacted hosts and Spectrum Scale cluster used an SVC cluster for storage.   I already suspected the problem was due to an extremely flawed SAN design.  More specifically,  the customer had deviated from best-practice connectivity and zoning of his SVC Cluster and Controllers.  A 'Controller' in Storwize/SVC-speak is any storage enclosure - Flash, DS8000, another Storwize product such as V7000, or perhaps non-IBM branded storage.  In this case, the customer had 3 Controllers.  Two were IBM Flash arrays, for the purposes of this blog post we will focus on those and how the customer SAN design negatively impacted their IBM Flash systems.

Best-Practice SVC/Storwize SAN Connectivity and Zoning 

The figure below depicts best-practice port connectivity and zoning for SVC and Controllers on a dual-core Fabric Design.  (This assumes you have two redundant fabrics, each of which is configured like the below).     As we can see  ideally our SVC Cluster and controllers are connected to both of our core switches.  A single-core design obviously does not have the potential for mis-configuration since all SVC and Controller ports on a given fabric are connected to the same physical switch.   A mesh design we would want to use the same basic principles of connecting SVC and controller ports to the same physical switch(es).   Zoning must be configured such that the SVC ports on each switch are zoned only to the Controller ports attached to the same switch.  The goal is to avoid unnecessary traffic flowing across the ISL between the switches.     In the example below, we have two zones.  Zone 1 includes the SVC and Controller ports attached to the left-most switch.  Zone 2 includes the SVC and controller ports attached to the right-most switch.

Customer Deviations from Best-Practice on a Dual-Core Fabric

The next figure is the design the customer had.  The switches in question  are Brocade-branded, but the design would be flawed regardless of the switch vendor.  The problem should be be obvious.  With the below design, all traffic moving from the SVC to the backend controllers has to cross the ISL, in this case it was a 32 Gbps trunk.  The switch data showed multiple ports in the trunk were congested - there were transmit discards and timeouts on frames moving in both directions, and both switches were logging bottleneck messages on the ports in the trunk.  The SVC was logging repeated instances of command timeouts and errors indicating it was having problems talking to ports on the Controllers.    Lastly, the SVC was showing elevated response times to the Flash storage.  All of this was due to the congested ISL.  With this design, the client was not  getting the ROI or response times it should have been  getting from the Flash storage.  Of course all of the error correction and recovery caused an increased load on the fabric and re-transmission of frames which made an already untenable situation worse.   The immediate fix to provide some relief was to double the bandwidth of the ISL on both fabrics.  The long-term fix was to re-connect ports and zone appropriately to get to best-practice.  

Customer Host Connectivity and a Visual of the Effect on the Fabric

The last figure shows the customer host connectivity and the effect on the fabric of this flawed fabric design.   We can see from the figure that the client had both the underperforming hosts and the GPFS/Spectrum Scale cluster connected to DCX 2 where the Controllers were connected.    With this design, we can see that data must traverse the ISL 4 times.  Traffic on the ISLs could be immediately reduced by half by moving half of the SVC ports to DCX2 and half the controller ports to DCX1 and then zoning to best practice as in the first figure in this blog post.     In addition to the unnecessary traffic on the congested ISL, redundancy is reduced since this design is vulnerable to a failure of either DCX 1 or DCX 2.  While the client did have a redundant fabric, a failure of either of those switches means a total loss of connectivity from SVC to Controllers on one of the fabrics.  That is significant.   ISL traffic could be further reduced (and reliability increased at the host level) by moving half of the GPFS cluster (and other critical host ports) to DCX 1 and zoning appropriately.  In this way, the only traffic crossing the ISLs would be hosts or other devices that don't have enough ports to be connected to both cores and whatever traffic is necessary to maintain the fabric.  Both the SVC to Controller and the host to SVC traffic would then be much less vulnerable to any delays on the ISLs or congestion in either fabric.  


Inside SAN Central - Some SAN Switch Data Collection Recommendations

My colleague Sebastian Thaele prepared two excellent writeups on collecting Cisco and Brocade  Switch logs.  His writeup for Cisco logs is here.  His write up for Brocade logs is here.   I won't duplicate his work with this blog post.  I was at a recent client conference, and heard a common complaint of the log upload process to IBM, and the number of times that customers are asked by IBM Support to upload logs for a PMR.   This sentiment includes all of IBM Support, not just SAN Central.  However  I decided I'd give everyone some  background on SAN Central's problem resolution process. 
Why SAN Central Requests The Logs That It Does
SAN Central (and actually Support in general) works problems that are classified in 1 of 3 ways:
1.  Provide Root-Cause: 

These are issues where an event happened - such as poor performance or hosts lost path to disk - but the problem is no longer occurring and the customer wants to know why the event happened.   The biggest obstacle for SAN Central when working this kind of problem is that the switch logs are a snapshot of what is occurring at the time the log was taken.   So if the problem occurred hours or days ago, it can be like trying to figure out why a traffic jam happened but you are working from a picture taken 3 hours after the congestion is gone.  While there is some historical data available in the logs, it is only logged or triggered if certain thresholds on the SAN are reached.  It's a bit like the camera that snapped the picture can also record video, but the video only starts recording  when traffic gets below, say 30 miles per hour.  You'll still be in traffic congestion at 35 mph but the cameras aren't recording. 







2.  Intermittent Problem That Can't Be Recreated Easily:
This is the most difficult to troubleshoot.   This is the type of problem that happens at random and a re-occurrence of the problem can't be predicted.   At one time or  another everyone has had a problem with their car that occurs at random.  You take it to your mechanic who test drives it, pulls ODBII data, etc and then says he can't find the problem.  Trust that when this happens with your storage solution, we are as frustrated as you are.   Because of the intermittent nature of the problem, there is an issue with collecting logs close enough to the time of the problem such that they are useful.   Also, because the switch logs are a snapshot, things like error counters that haven't been cleared in several days can't always be correlated to the specific time of an event.  It is like our traffic camera recording still shots at intervals but not time-stamping them, so we don't know whether a picture of a particular traffic jam is related to your problem or not.   While it's not always possible, if we can run a script to detect when the error condition occurs, we can then have a script take steps to ensure the correct data is collected.   You can also follow the steps in the below section on Switch Log Maintenance.  This may give us some clues if we can track changes over time.


3.  On-going Problem or One That Can Be Recreated:
This last type of problem includes both on-going issues (such as you are sitting in a traffic jam)  or an issue that can be predicted or recreated, such as you know there will most likely be traffic jams every weeknight at 10:00 PM, but there should not be based on the low volume of traffic.  Because this type of problem is predictable, we have more options when it comes to collecting logs.   For an ongoing (current) traffic jam, the traffic reporter overhead in a helicopter can see the cause of the problem, or you could look through traffic cameras until you find the slowdown.   For something we know will re-occur, or something we can force to happen, we can take steps to ensure we collect the correct data.   Continuing the traffic cam example, we can clear the buffers on the cameras so that they can record lots of video, and we can make sure they are running starting at 9:55 PM so that we can watch traffic building.





Best-Practice Switch Log Maintenance Recommendation
As a Best-Practice for working around the data collection issues for the type of problems that are  root-cause or intermittent, SAN Central recommends that switch logs are collected at least weekly.  Maintaining 3-4 weeks of logs is sufficient.  Port statistics on the switches should be cleared -after- the new set of logs is collected and saved, so that each new log collected has the previous week of run time for the port statistics.  If a problem does occur, logs should be collected as soon as possible after the problem ends, or if it a long-running performance problem, during the problem.   For the root-cause and intermittent issues, this at least gives us some history to work from and gives some data that we can use to establish a baseline. 

Help Us Help You


SAN Central's goal is to resolve your issue as quickly and accurately as we can.  Here are some steps you can take to help ensure this happens. 
  1. When SAN Central is engaged on a PMR, be prepared to upload (or preemptively upload before we ask) logs for at minimum the switches involved in the issue.   For instance, if it's a host performance problem, we need switch logs from the switch where the host is connected, the storage is connected and any switches that might be in between. 
  2. Your problem may not require  logs from all the  switches you may have in the fabric.   There may be something in the data we see that leads to a request for  data from additional switches,  but initially  all data that is sent in has to be checked.  Any potentially extraneous data increases the time it takes to resolve your problem.  

Blatant Thievery

Note:  This post was migrated from my old blog site after that hosting site decided to sunset personal blogs.

Is it still plagiarism if you are crediting the  original author?  Anyway,  this should have been my first post, or at least included in my first post. I stole  the idea from  Sebastian Thaele.

Yay!  Yet another storage-related blog!  As if the world needed another one.   The difference with this one  is it is written by me, which by itself should be enough to make everyone want to read it.   Seriously, while I think I'm pretty good at what I do, I'm not that hubristic.   I know that I don't have all the answers.  Nobody does.   That being said, here's why I created this blog.  

First, I wanted to give a bit of an insider's view into IBM Storage Support.   This previous post of mine is an example of that.

Second, I work as a world-wide Product Field Engineer for IBM SAN Central.  My team is  Level 2/Level 3 (depending on who's asking)  support on problems related to  Storage Networking.   If the other IBM product teams can't solve the problem or suspect it may be the SAN, the case is escalated to my team.       Many other professions exchange knowledge with peers from other companies, but  this rarely happens for members of support organizations. By necessity there is a lot of knowledge sharing amongst the members of my team and across the support teams inside IBM but its too often limited to just IBM.  

Third,  most of the blogs I read will focus on marketing or high-level overviews of a new product or feature, but there isn't much technical content, and what technical content there is  does not have a support perspective.   For instance, I've stumbled across more than one blog post regarding IBM announcing Spectrum Storwize code v8.1.  Among other things, the code supports a new feature for hot-spare SVC nodes.  That's a great feature, but there are some requirements to implement a hot-spare node along with some best-practice recommendations when implementing it.  I will have a future post detailing both the requirements and how to best implement the hot-spare node.  ​The best way to fix a problem is to prevent it in the first place.  I'm hoping posts like that one will help prevent future problems occuring for IBM's customers.



A New Blog and Musings on IBM Technical University in New Orleans

NOTE:  This post was migrated from my old blog site after that hosting site decided to sunset personal blogs.


Welcome To My Blog!
Thanks for stopping by, and I hope you will follow my blog.  I will use this space to give you updates on IBM Storage and Storage Networking products, as well as tips and tricks you can use for troubleshooting and verifying best-practice configurations on your storage network and storage networking products.  
For a bit of background, I work with the IBM  SAN Central support team in IBM Storage Support.  We work on most storage networks, but specialize in fibre-channel and FCoE storage networking.  My team provides root-cause analysis for events that have happened on the SAN as well as isolating issues that are ongoing.  We work with all of the IBM product support teams - if a product touches a storage network, we have most likely worked with it.   We are also the source at IBM for the XGig fibre-channel trace tool, in the event that a tool is needed to troubleshoot a problem.

Some Future Posts
Check back on this space over the next week or two.  Planned topics include the quickest way to turn your racecar Flash Storage into a Yugo with bad SAN design,  some best practices for Storwize/SVC Port Masking and how you can use the device connectivity listing for troubleshooting, and lastly some best practices for collecting Cisco MDS switch logs.  This will include what SAN Central will ask for, so that you can have the require logs uploaded before a problem gets sent to us.   A good writeup on collecting Brocade logs is here at Sebastian Thaele's blog.
IBM Technical University in New Orleans






Last week I presented at Technical University (Tech U)  in New Orleans.  The Tech U events are intended to replace the technical sessions that were previously available at IBM Edge.  Attendance was about 800 IBM clients.  My sessions were on the path an FC Frame takes through a Cisco MDS Director, and the aforementioned Storwize Port Masking Best Practice.  I also created a poster for the poster sessions on Storwize Fabric Connectivity Best Practice.  The poster is here, but as mentioned above I'll have a blog post with more detail soon. 
 Some thoughts from New Orleans:
  • It's worth attending Glenn Anderson's public speaking sessions, if you are at a conference where he is leading them. If nothing else, you'll be entertained.  
  • I think NVMe is the future, although the earliest we will see it in enterprise is 2018.  I'll have more thoughts on this in a future post. In the meantime,  This IBM RedPaper is worth reading. 
  • Spectrum Virtualize v8.1 has support for hot-spare cluster nodes.  You can read more about it here at @TonyPearson's blog.  This will be another future blog post with some technical requirements and  best-practice recommendations​ for connectivity and zoning of the spare node.
  • Since it was announced today, I can mention Spectrum Control Foundation.  The announcement is here.