Wednesday, September 25, 2019

Fabric Zoning for the IBM Spectrum Virtualize and FlashSystem NPIV Feature

Zoning Basics

Before I talk about some zoning best-practices, I should explain two different types of zoning and how they work.  There are two types of zoning:  WWPN Zoning and Switch-Port Zoning

World-wide Port Name (WWPN) Zoning

WWPN zoning is also called "soft" zoning and is based off the WWPN that is assigned to a specific port on a fibre-channel adapter.  The WWPN serves a similar function as a MAC address does on an ethernet adapter.  WWPN-based zoning uses the WWPN of devices logged into the fabric to determine which device can connect to which other devices.  Most fabrics are zoned using WWPN zoning.  It is more flexible than switch-port zoning - a device can be plugged in anywhere on the SAN (with some caveats beyond the scope of this blog post) and the device can connect to the other devices it is zoned to.    It has one distinct advantage over Switch-Port based zoning, which is that zoning can always be specified on a single WWPN level.  

Switch-Port Zoning

 Switch-Port Zoning is also called "hard" zoning.  Zoning is done based on switch ports.  When switch port zoning is used all WWPNs logged into the fabric through the switch ports in the same zone are allowed to communicate.  

Problems with Switch Port Zoning and the Spectrum Virtualize/FlashSystem NPIV Feature

If you are unfamiliar with the NPIV feature in Spectrum Virtualize or the  FlashSystems storage products you can read about it here.

Switch Port zoning works without issue until you have multiple devices logging into the same switch port such as an NPV device (VMWare, VIO, access gateway, etc) or you enable the NPIV feature on Spectrum Virtualize.   To illustrate, consider we have this simple diagram:


In the above diagram, we have an NPV device on the left with 3 virtual WWPNs (vWWPN) logged into the fabric on switch Port 1.  The NPV device could be a hypervisor, or it could be an access gateway.   We have an SVC Node with NPIV enabled on the right.  This blog post uses SVC as an example but applies to all of the Storwize and FlashSystem products that support NPIV.  The SVC node port has physical WWPN and a vWWPN logged into the fabric on switch Port 2.  If we were to use Switch-Port zoning, the solid line in the next diagram is the zone as created by the SAN administrator.  The dashed line is the effective zone.  All the devices in the effective zone can communicate with each other.  For our example, the host with  vWWPN2 is not defined as a host on the SVC cluster and has no vdisks assigned to it.


This zoning can work but has the potential to cause problems.  The first problem is  vWWPN2 should not be zoned to the SVC,  but there is no way to prevent it from connecting to both the SVC physical and virtual WWPNs,  since all host WWPNs logged into the fabric on Port 1 can communicate with all node WWPNs logged into the fabric on port 2.

Another problem as long as the I/O Group the SVC Node is in is configured for transitional mode, then the host WWPNs can connect to both the physical and vWWPNs on the node, thus increasing the path count of devices attached to the cluster.

 For most of the Storwize and FlashSystem products, the maximum is 512 connections per port per node.  This includes nodes, controllers, and hosts but you should check the configuration limits documentation for your specific product.  Using Switch-Port based zoning you are at increased risk of hitting the maximum number of fibre-channel connections on that port.   If you hit the limit, then the cluster will not allow new host, controller or cluster node connections on that port.  This connection count problem is made worse by vWWPN2 connecting to the SVC node port even though it is not defined as a host.  You can see how having a large number of vWWPNs logged in to switch Port 1 causes the connection count to the SVC node port to rise rapidly.   Running out of available connections to the  cluster is the most common problem we see with using switch-port zoning and NPIV. 

The third potential problem is you have far less control over balancing fibre-channel connections across node ports.   In the above diagram, switch port 1 has 3 NPV devices connected, so there are 6 logins to the SVC node port.   Another switch port might have a higher number.  I have seen as many as 48 devices connected via NPV to a single switch port.   For that situation, I would have 96 connections to each  cluster node port with switch-port zoning and NPIV in transitional mode on the cluster.  That is almost 1/4 of the available connections to each node port that switch port is zoned to.   With switch-port zoning I do not have the ability to balance the host WWPNs across the cluster ports.

Lastly,  vWWPN1 vWWPN2 and vWWPN3 can all communicate with each other.  This violates the general best practice recommendation of isolating hosts to their own zone where possible.   You could implement smart zoning, but that complicates your environment and is an additional step to administer and troubleshoot. 

WWPN-Based Zoning

Taking our first diagram and implementing WWPN based zoning gives us the below diagram for both the implemented and effective zoning.



You can see that  vWWPN1 and vWWPN3 are each zoned separately to only the vWWWPN on the SVC node.  They are not zoned to the physical WWPN, and vWWPN2 is not zoned at all.  This has a few advantages over switch-port zoning.  First, it reduces our connection count to the node port by four.  We would have 6 connections if we used Switch-Port based zoning - two connections for each host vWWPN.  With WWPN zoning we have two host vWPPNs connecting only to the Node vWWPN.  Second, you can balance the host connections across the SVC node ports.  Third, if the I/O group is in transitional mode, zoning the hosts exclusively to the node vWWPN minimizes disruption when the NPIV mode is changed to enabled.  

In Summary

While it is supported, it is recommended that you do not use the NPIV feature with Switch-Port zoning.  If you choose to do so, be aware of the potential problems that can impact the health of your SAN and potentially prevent new hosts from connecting to your Spectrum Virtualize, SVC or FlashSystems cluster.  




Monday, June 10, 2019

New Advisor Features in IBM Storage Insights

IBM Storage Insights was updated recently.  Two new Dashboards that were added are the Advisor Dashboard and the Notifications Dashboard.  IBM Storage Support can see the events in these dashboards. In some cases, it allows Support to more quickly identify problems or other issues that need to be addressed.  

Like all tables in Storage Insights, both the Advisor and Notifications Dashboard can be filtered and sorted.  You can also export to PDF, CSV or HTML so that you can do additional sorting, filtering or share items in each of the tables.


The Advisor Dashboard

The Advisor Dashboard is found under the Insights menu.  This dashboard lists recommendations for changes that you can make to improve the stability of your managed IBM Storage systems.  The recommendations include configuration changes, firmware upgrades, and other changes to enhance the stability and performance of your IBM Storage.    

Below you see a capture of an Advisor Dashboard and some examples of the items that are listed in the table.



Double-clicking on the value in the "Event" or "Recommendation" columns for a row in the table will open a more detailed explanation of the recommendation for that event.  The detailed view also includes a link to get more information on why the recommendation is being made.   You can acknowledge an item from the details view, or you can right-click the row in the table and click "Acknowledge".   


The Notifications Dashboard

The Notifications dashboard is found under the "Dashboards" menu.  It includes the events listed in the Advisor panel as well as other events such as when storage systems call home.  As an example, if call home events result in a ticket being opened, the ticket number is listed in the Notifications dashboard.    

As with the Advisor Dashboard, you can sort the Notifications Dashboard or export the table to CSV or HTML. 


Working With Filters

Both panels include the following filters.  The filters are a toggle.  You can click them to toggle whether events of that type are displayed.  For example, you can hide informational events, or you could choose to hide acknowledged events.  If you chose to include acknowledged events, it is recommended that you add the "Acknowledged" column to the table. This will allow you to sort or filter for events that are acknowledged.   By default acknowledged events are hidden.  If you acknowledge an event and later wish to clear that, you need to toggle acknowledged events to on, locate the event and then in the event details click the Unacknowledge button. 



Monday, May 6, 2019

Using IBM Storage Insights Pro and Alert Policies To Monitor Host Path Count

I was at a recent TechU event and had a discussion with a customer about using IBM Storage Insights to monitor host path count.  More specifically, the customer had a recent outage of a few hosts after doing some maintenance work on some storage and the affected hosts were not connected on all the expected paths.    There are options for using the storage data collections to review host connections.

For instance, you could write a script that compares the SVC/Storwize host WWPN definitions to the connected devices to see if any WWPNs are not connected to the SVC.  However, it is much more straightforward to configure an alert in IBM Storage Insights Pro.   Alert Policies were previously covered in these two videos: 





You can create an Alert Policy for an agentless host and then configure an alert to notify you when the path count changes.  You would need to create separate policies for each operating system, but after you create the first policy and define the alert, you can copy that policy and modify it for each operating systems.  In the following screenshot, I am creating a policy for Windows servers and I am adding all of my Windows servers to this policy.





In the next screenshot, I am adding an alert to the new policy. The basics of adding alerts are covered in the videos above.    You can see that I have selected a resource type of "Disks" and then "General".   Under General, I check the box next to "Paths" and then in the Condition drop-down, I select Changes.  I am interested in path count changes.  Alternatively, you could set a value for your expected path count and alert when that falls below some value, but different servers may have different expected path counts. 



Next,  I configure the email address to send the alerts to.   This may be a group email address for your team, or perhaps it is one of your host administrators.  I also set the notification frequency for once for every violation, although you could set it to notify you some number of minutes, hours or days.  I have also set the Alert level to the Warning, rather than Critical, but your requirements may be different. 


After you set those options, save the new alert.  Alerts will now be generated when the path count changes for the hosts included in your new Alert Policy.  You can remove hosts from the policy if you do not want alerts generated for them.  Perhaps you only want to monitor your more critical hosts. 

Friday, April 26, 2019

Troubleshooting CRC Errors On Fibre-channel Fabrics




There is no "Easy Button" for troubleshooting CRC errors. It is an iterative process. You make a change, you monitor your fabric, and if necessary you make more changes until the issues are resolved. I frequently have customers who want it to be a one step process. It can be, but usually takes multiple steps. Having said that, before we can fix them, we need to know what CRC errors are and why they occur.

What Are CRC Errors? When Do they Occur?

The simple answer is that CRC errors are damaged frames. The more complicated answer is that before a fibre-channel frame is sent, some math is done. The answer is added to the frame footer. When the receiver gets the frame, the receiver repeats the math. If the receiver gets a different answer then what's recorded in the frame, then the frame was changed in flight. This is a CRC error. The only time these happen is if the physical plant - cabling, SFPs is somehow defective. It is much less common, but still possible to have a bad component in the switch. Troubleshooting that will be a separate blog post, someday. What the receiver does with the damaged frame depends on whether it's a switch or end device and if it is a switch, what brand of switch.

Why Does Fixing These Matter?

At best, the effect of faulty links is a few dropped frames. Left unchecked the problem will get worse and eventually cause performance problems. Also, you will go from 1 or 2 bad links to many. A customer I have been working with for the last several months was in this situation and is finally finishing a very long process of cleaning up many faulty links. Years ago I had a customer that was experiencing extremely long delays on their Brocade fabric. They had over-redundancy (there is such a thing) on the switches and links between the hosts and the storage. Many of the links were questionable and producing CRC errors. When the storage received a bad frame, it simply dropped it and did not send an ABTS. They also had an adapter in the host with a bug in it, and it would simply sit and wait for the storage to respond. 90 or so seconds later, the application would time out and initiate recovery for a problem that should never have happened.

Why Does It Matter What Brand Of Switch It Is?

First, the different brands of switches use different commands to obtain the data you need to troubleshoot these problems. Second, the way that they check and forward frames are different. This requires a different technique depending on the brand of switch. Cisco switches are what is called store-and-forward. This means that they wait for the entire frame to be received, then they check it, then if the frame is valid it gets forwarded. If not, it is dropped. Brocade switches are cut-through. As soon as they receive enough of the frame to know where it's going, they start forwarding it. If the frame ends up being bad, they try to correct it using Foward Error Correction. If that doesn't work the frame is tagged as bad. For the most part, end devices that receive frames that are already tagged as bad simply drop the frame and initiate recovery via ABTS. Troubleshooting commands and techniques vary for Brocade vs Cisco fabrics.

Identifying CRCs on Cisco Fabrics

Since Cisco fabrics are store-and-forward, you know that frames with CRC errors will be dropped as soon as they are detected. This can be either at the switch port they arrive on, or more rarely inside the switch. This post will focus on the CRC errors detected at the switch ports. If you suspect that you have questionable links, you can use these commands to check switch ports for CRC errors:
  • 'show interface'
  • 'show interface counters'
  • 'show logging log
For the above - the 'show interface' and 'show interface counters' commands can be run specifying a switch port that you are interested in. this is done in the format of fcS/P where S is the slot and P is the port. For the 'show logging log' you are looking for messages that a port was disabled because the bit error rate was too high. This is often an indicator of a faulty link. Once you find the ports that are detecting the CRC errors, you can then proceed to the repair phase.

Identifying CRCs on Brocade Fabrics

Brocade fabrics use cut-through routing. As such, the link for the port that is detecting the CRC errors may not be the faulty link. Brocade has two statistics for CRCs: CRC and CRC_Good_EOF. If the CRC_Good_EOF counter is increasing, this means that the link it is increasing on is the source of the problem. If the CRC counter is increasing, then the frame has already been marked as bad, and the problem is occurring elsewhere on the SAN. The CRC_Good_EOF should be the only counter that increases on a device port. If the CRC_Good_EOF counter is increasing on an ISL port, the link between the sending and receiving switch is bad. If you the CRC counter is increasing on the ISL, this means the problem is occurring somewhere on the sending switch. So move to the sending switch and look for ports where CRC_Good_EOF is increasing. It is possible that both counters will increase on a link. If it is a device port, then the link is bad. If it is an ISL then the link itself is a problem, and the sending switch has other bad links attached to it.    As you can see there are a few more steps to identify the source of the CRC errors on Brocade before you can proceed to the repair phase.  The porterrshow may also show ports that do not have CRC_Good_EOF increasing, but do show a counter called PCS increasing. If so, this is also an indication of a bad link. Troubleshooting PCS errors are the same as troubleshooting CRC_Good_EOF errors.
  • 'porterrshow'
  • 'portstatsshow N'
The porterrshow command will display error stats for all ports. The portstatsshow N where N is a port index number will display more detailed stats for the specified port. If you see PCS errors increasing for a port in the porterrshow, the link on that port is bad, regardless of what CRC or CRC_Good_EOF counters there are.

Correcting the Problem

Once you have identified the port(s) that have questionable links you need to correct the problem. As I mentioned earlier, this is an iterative process. You replace a part, then clear the switch statistics, then monitor for anywhere from several hours to a day, depending on the rate of increase. Repeat the process until the errors are no longer increasing. You can replace multiple parts at once - such as replacing a cable and an SFP at the same time. Another option is to isolate further by just swapping a cable, or moving the device to a new port on the switch. Just remember that it is critical to reset the statistics immediately after any change you make.  REMEMBER THAT PATCH PANELS  ARE PART OF CABLING.  I emphasize that because customers will often replace the cable between the switch/device and the panel and forget that there is cabling between patch panels which is also suspect.  Some years ago I went onside to troubleshoot connectivity between two storage systems.  The storage systems were located at different campuses in the same city. The replication paths would not stay up.   When I got there, the client had them directly connected through several patch panels with no switching.  I assisted them in putting the cabling through switches at each campus and immediately saw CRCs showing up on the links.  They had 8 hops across patch panels between the storage systems.  We found CRCs at the second hop at each side. I stopped checking after that.  Their eventual permanent fix was to run a new direct run of cable between the two locations
If you have any questions, leave them in the comments or find me on   LinkedIn or on Twitter.

Friday, April 5, 2019

Advanced Alert Policies in IBM Storage Insights Pro

IBM released a new Alert Policies feature for IBM Storage Insights Pro the first week of March 2019.    John Langlois does an excellent job introducing the new feature here:


There are a few more advanced aspects of alert policies that John did not cover.   

First, you can add and remove managed storage from Alert Policies from the Alerts Definitions of the storage system that you want to modify.  The following video shows how to do this.

Second,  you must remember that if you add storage that has never been in an Alert Policy to an Alert Policy,  any existing alerts defined on that storage are lost and cannot be retrieved.  The following video shows an example managing of this and shows a workaround to preserve those alerts if you want to re-apply them at some point in the future.  







Tuesday, March 26, 2019

Cisco Automatic Zoning

Cisco released a feature in NX-OS v8.3.1 called Automatic Zoning.  The feature does exactly what the name suggests:  it automatically configures zoning for the devices on your SAN.  You can see a video on the feature here:



What Is Zoning?

SAN (Storage Area Network) zoning is specifying which devices on the SAN  can communicate with which other devices.  Devices are added to a zone.  A zone is a group of devices that can communicate.  Zones are then added to a zoneset.  The zoneset is then activated - this is the configuration that is in effect. There can be multiple zonesets but only one active one at any given time.   By default, any device not zoned (not a member of a zone) cannot communicate with any other device.  A device that is in at least one zone is considered zoned.   Effective zoning prevents unauthorized devices from talking to each other, and minimizes disruptions on the SAN if a device misbehaves.  

How Cisco Automatic Zoning Works

When a SAN is first configured, adding devices to zones and creating zonesets can be a long process.  Cisco Automatic zoning configures this for you so that you do not have to manually configure zoning.    It works by examining which devices are logged into the fabric as initiators and which are logged in as targets, then it adds zones to the configuration where the zones included the initiators and targets.  An 'initiator' is a device such as a host.  A 'target' is a device such as storage.  Some systems (such as SVC) log in as both.  Other storage systems will log in as both types, especially if they have added services such as replication.  A storage system that is reading from or writing to another storage system is an initiator from the perspective of the remote storage system.

Automatic zoning is currently implemented so that it only runs on single-switch fabrics if it is enabled.  If it detects any inter-switch links (ISLs) it will not perform automatic zoning.  Future versions of Automatic Zoning will run on multi-switch fabrics.  If you make changes to zoning after automatic zoning is run, it will not undo those changes.  

Potential Problems and Best Practice Recommendations

You can see how automatic zoning might cause problems for systems that log in as both initiators and targets.  You would have devices communicating with each other that are not supposed to.  This risks disruption on the SAN.   If you have IBM SVC or other Storwize systems on your SAN, do not enable Automatic Zoning.    

Remember that automatic zoning will zone all initiators to all targets.  So in the scenario where you have 50 initiators and two targets but you want to split the initiators evenly between the targets, Automatic Zoning would zone all 50 initiators to both targets.  You would have to go in and manually rezone initiators away from each target.

In summary, Automatic Zoning can relieve some of the burdens on SAN Administrators during initial setup, but it should only be used (with great care) on smaller, single-switch environments with a single target (or multiple targets if all initiators will communicate with all targets).   Be wary if your storage has any replication features enabled as this means it will likely log in as both initiator and target.  

Monday, March 11, 2019

Why Increasing Buffer Credits Is Not A Cure For A Slow Drain Device


When I am working on performance problems, a frequent question I  get is why increasing buffer credits for a particular port is not a fix for a slow drain device attached to that port.  In this video, I explain the concept of congestion and illustrate why increasing the number of buffer credits is not a fix, without addressing the underlying cause of the congestion.   There are some exceptions to this rule.  The most common is when dealing with long-distance links, but that will be addressed in a future blog post (and perhaps a future video).  




As always,  you can leave feedback in the comments, or find me on LinkedIn or Twitter

Friday, March 8, 2019

Troubleshooting IBM Storage Insights Pro Alerts

Recently, there were enhancements made to several features, including a new Alert Policy feature in IBM Storage Insights Pro. You can find out what's new about the new features here.   The Alert Policy feature lets you configure a set of alerts into a policy and apply all of them across multiple storage systems. In this way you can ensure consistency with alerts and not have to define the same alert on each individual storage system. Once you define the alerts, IBM Storage Support representatives can see the generated alerts on a storage system.


For the IBM FS9100, there are a number of alerts that are already defined. When one of those alerts is triggered,  a proactive ticket is opened and the experts at IBM Storage Support investigate the alert, then take whatever action is necessary.   With this post we'll take a look at how the IBM Storage Support Team investigates  alerts.  For this example we are using an alert for the Port Send Delay I/O Percentage.  

You can see in this picture that the storage system had several alerts on the Port Send Delay I/O percentage.  This statistic measures the ratio of the send operations that were delayed to the number of send operations for the port.   The ports listed in the alerts are the 16Gb ports for an IBM Storwize system.  This counter indicates similar conditions to the transmit buffer credit 0 for the 8Gb ports.  In this case,  more than 20% of I/O was delayed for the ports listed in the 'Internal Resource' column over a 20 minute interval. 



The next counter that Storage Support would check is the duration of each delay.  There  could be a lot of send operations getting delayed for a very short time, and it is probable that applications won't notice an impact.  However, if there are a few long delays that are triggering the alert, that would most likely impact any applications.  It's similar to heavy traffic that slows things to 65 mph (100 km/hr) but traffic keeps moving, vs some condition that slows things much more, but is more intermittent.  You will notice the impact much more for for the second scenario. 

Looking at the comparison of the delay time to the I/O Delay percentage, you can see that the delay time is not that high and it does not last for very long.  If the delay time were higher or lasted longer, Storage Support would investigate further.   There was also no impact to the customer's applications.





If further investigation were needed, the next step would be to determine what kind of traffic the delays were coming from.  Storwize and SVC nodes can send data to other nodes in the cluster (inter-node) other clusters for replication (partner) hosts, or back-end storage.  This particular storage system is an FS9100 and does not have back-end storage.  So you need to find out where most of the data is being sent.  That is the most likely cause of the alert. The picture shows that the send data rate to the host almost exactly matches the data rate for the ports.  So you know that most of the data being sent is going to the hosts, so you would start looking at host and volume statistics to narrow it down further.  You would be looking for a host or set of hosts that is consuming the majority of the data.


This performance analysis also helps the IBM SAN Central team if an assist from them is necessary.  Since you know now that most of the data is being sent to or from the hosts and you potentially have done further isolation to determine a set of hosts,  this  would allow SAN Central to focus on a few specific hosts rather than having to examine an entire SAN.  

Wednesday, March 6, 2019

I Will Be At Technical University in Atlanta

IBM Tech U Atlanta 2019


This is just a quick post to say that I will be at IBM Technical University in Atlanta. I will be there from April 29 through May 3

My sessions for this event are:
  • s106417 The Path of an FC Frame Through a Cisco MDS Director
  • s106420 Proactive Monitoring of a Cisco Fabric
  • s106421 Troubleshooting Cisco SAN Performance Issues - Part 1
  • s106422 Troubleshooting Cisco SAN Performance Issues - Part 2
Find these sessions and many more at Technical University. Click the banner at the top of the page to register or for more information.

As always, if you have any questions, leave them in the comments at the end of this blog or find me on LinkedIn or Twitter.

Why Low I/O Rates Can Result In High Response Times

Why Low I/O Rates Can Result in High Response Times for Reads and Writes

As IBM Storage Insights and Storage Insights Pro become more widely adopted, many companies who weren't doing performance monitoring previously are now able to see the performance of their managed storage systems. With the Alerting features on Storage Insights Pro, companies are much more aware of performance problems within their storage networks. One common question that comes up is why a volume with low I/O rates can have very high response times. Often these high response times are present even with no obvious performance impact at the application layer.
These response time spikes generally measured in the 10s or 100s of milliseconds, but can be a second or greater. At the same time, the I/O rates are low - perhaps 10 I/Os per second or less. This can occur on either read or write I/Os. As an example, this picture shows a typical pattern of generally low I/O rates with a high response time. The volume in question is a volume used for backups, so the volume is generally only written to during backups. The blue line is the I/O rate - in this case the write I/O rate, but the same situation can happen with reads from idle volumes. The orange line is the response time. You can see a pattern of generally low I/O rates overall to the volume and that the write response time spikes up when the I/O rate goes up. It is easy to see why if you were using Storage Insights Pro to alert on response time, you might be concerned about a response time greater than 35 ms.

 
This situation happens because of the way storage systems manage internal cache. This is generally true for all (IBM or non-IBM) storage subsystems and storage virtualization engines (VE). If a volume has low I/O rates then the volume is idle or nearly idle for extended periods of time. The volume can be idle for a minute or more. The storage device or VE will then flush the cache of that volume. It does this to to free up cache space for other volumes which are actively reading or writing. The first I/O that arrives after an idle period for the volume requires re-initialization of the cache. For storage systems and VEs with redundant controllers or nodes, this also requires that the cache is synchronized across the nodes or controllers of the storage subsystem. All of this takes time. The processes are often expensive in performance terms and the first I/O after an idle period can have significant delays. Additionally for write I/O, the volume may operate in write-through mode until the cache has been fully synchronize. In write-through mode the data is written to cach and disk at the same time. This can cayse further slowdowns because each write will be reported as complete only after the update has been written to the back-end disk(s). After the cache is synchronized, each write will be reported as complete after the update has been written to cache. This is a much faster process. You can see how, depending on the caching scheme of the storage subsystem, you would see a pattern of idle or almost idle volumes having extremely high response times. Unless you are seeing applications be impacted this is generally not a concern.
Response time spikes can also occur with large transfer sizes. This picture shows response time for the same volume as in the previous picture, except as it relates to transfer size. In this case it is the size of the write. As in the above picture the orange line is the response time, the blue line is the transfer size. You can see that the transfer size is large - almost 500 KB per I/O. The volume for this performance data is not compressed. If it were compressed there could be additional delays depending on the compression engine used in the storage. Barry Whyte gives an excellent writeup of Data Reduction Pools here that details how DRP gives better performance than IBM RACE or other comppression technologies.  


If you have any questions, leave them in the comments at the end of this blog or find me on   LinkedIn or on Twitter.

Working With Groups in IBM Storage Insights Pro


IBM Storage Insights Pro Groups Feature


In addition to the Reporting and Alerting features that are not available in the free version of IBM Storage Insights, the subscription based offering, IBM Storage Insights Pro, has a very useful feature called Groups. Groups allow you to group or bundle together related storage resources for ease of management. For example, you might group a set of volumes together that are all related to a specific application, or server cluster. You might group the hosts that make up a cluster into a group. You can even group ports together - you could define a group of ports for an SVC Cluster that includes all the ports used for inter-node communication.  Such a group would be very handy for your IBM Storage Support person. It would potentially save support from having to collect a support log and dig through it. It would certainly make analysis go faster when troubleshooting an issue.
So to get started with Groups, the first thing to do is to see if you have any Groups defined. They are listed, not surprisingly, under the Groups-> General Groups and General Groups is the last entry on the Groups menu. 

This is the type of group that will be used in this blog post.
If you open the Groups menu, you can see any groups that have been defined.  For this IBM Storage Insights Pro instance, there are currently no groups defined.  


You can add any storage resources you want to a group.The process for creating a group is to add a resource to a group and select the option to create a new group.
Adding A Resource To A Group
To add a resource to a group, in IBM Storage Insights Pro, browse to a resource, right-click the resource then select Add to General Group.



In this example we are adding a volume to a group. This can be done from either the listing of Block Storage Systems where you can add a storage system to a group, or it can be done from any of the internal resources in storage when you are looking at the resources in the Storage Properties view. You can select multiple resources and add them all to a group at one time. So you might list your volumes, filter on a volume name and then select all those that are filtered and add them to a group. To add the hosts from a cluster you might do the same in the host list.  
You can mix resource types in a group.  So for example, you might have all the storage systems, volumes and hosts for an application cluster in a group.  This makes it easier to see the resources associated with that particular application - for both you and IBM Storage Support. This way, when you call IBM for a performance problem on an application, they can just look at the appropriate group to identify all the resources that are of interest for that application.
The next page that appears after you click Add to General Group is the Add to Group page.


You can either create a new group or add the selected resources to existing groups.  It is possible for a resource to be in more than one group.  
Adding Resources To A New Group
In this example, you create a group for storage resources related to the payroll database. You can a volume to it, but you could (and should) add hosts and other storage systems to the group.  In this way, all the resources used for our payroll application are in a single group and easily identifiable. 


Adding A Resource To An Existing Group
     
The following image shows the Group selection, if you opted to add a resource to an existing group. In this example, you only have one group defined, but if you had multiple groups, you could add the selected resources to multiple groups just by selecting all the groups that you want to add the resources to.




Viewing The Resources In A Group
The group listing looks much like the internal resources view of a storage system.


Sub-groups
The last feature for Groups are Sub-groups. Sub-groups are exactly what they sound like - a way to further define relationships between resources in a group.  In the following example, you can see the sub-groups for our Payroll group.



You might have hosts, volumes or other resources dedicated to different aspects of the processing Payroll group. In this example, there is a subgroup dedicated to Pensions and another dedicated to Taxes.These sub-groups will appear in the listing of groups if you want to assign resources to them.  Like all other resources, sub-groups can belong to multiple groups.  


If you have any questions, leave them in the comments at the end of this blog or find me on   LinkedIn or on Twitter.

IBM Storage Insights: Tips and Answers to Questions Frequently Asked

Answers To Some Frequently Asked Questions about IBM Storage Insights



Over the last several months I have seen some common questions that are asked about IBM Storage Insights. I started collecting them and will answer them here.  These questions are all about Storage Insights itself. Questions relating to managing specific types of storage with Storage Insights will be answered in future Blog posts. So, on to the questions.....

Q: Can I install a data collector in the same system as my IBM Spectrum Control server

A: Yes.  However you need to pay attention to memory and CPU usage of you

Data Collector Authentication

If you use usernname/password authentication configure a dedicated user ID for the Data Collector on  your  storage systems.  do not use the default or other Admin account.    This allows for effective auditing and reduces security risks. 

Q: What Are The Recommended System Specifications for the Data Collector?

A:   The hard drive space requirement has risen from the original 1GB minimum.  The Data Collector will now cache the performance data it collects in case it loses its connection to the cloud.  When it regains the connection it will uploaded the cached data.  This helps avoid gaps in performance data due to loss of connection.  The minimum specifications are 4 GB HDD space and 1 GB RAM available on the system you install it on.   For a data collector in a virtual machine, add these specifications to whatever the operating system requires

Q: Does Storage Insights Support Multi-Tenancy?

A: There is currently no support for multi-tenancy.  This means that if you are managing storage from multiple datacenters, everyone with access to your Storage Insights instance will be able to see all storage. A suggestion is to edit the properties of the storage and fill out the location. You can then create a custom dashboard for each location. Setting the location property also helps IBM Storage Support know where storage is located. This assists with troubleshooting.

Q: Does The Data Collector Need To Be Backed Up? What about Redundancy?

A: Install at least two data collectors per instance for redundancy. Install at least two in each location if you are managing storage across multiple data centers. You do not need to back up the data collector. It does not store any collected data locally. All data collected is streamed to the cloud, and the data collector is always available for download if it needs to be re-installed. Downloading it also ensures that you alwasy get the latest. If you are using a virtual machine, you may want to back up the VM image, but this is to make it easier to re-deploy if there is an issue with the VM.



Q: I Installed Multiple Redundant Data Collectors.  Which one will collect data from the Storage?

A: You have two options.  Option 1 is  that you can assign data collectors to collect data from specific storage systems.  If you choose this option, only the data collectors assigned to collect data from a given storage system will do so.  Option 2 is to leave the assignment feature turned off.  If this is done, each data collector will  test the speed of the connection to the storage systems they manage.  The Data Collector with the fastest connection speed will win.  If you have a situation where you have multiple Data Collectors and one of them is located behind an internal firewall to manage storage behind that firewall, then that Data Collector will always be used to collect data from that storage.

Q: What About Firewalls?

A: You need to open port 443  on your firewall, this is the  default HTTPS port to allow the Data Collector to communicate with the cloud service. This only needs to be for outbound traffic. IBM will never send anything down to the Data Collector. If there is a firewall between the Data Collector and the storage it is managing the firewall should be configured to pass SNMP traffic. Lastly ensure that data collector is in the VLAN used for SAN Switch and storage management, or that VLAN routing is configured to allow the data collector across VLANs

Q: You Just Said IBM Never Sends Anything To the Data Collector? How Does It Know What To Do?

A: The Data Collector is constantly checking a queue on the cloud for jobs to to do, such as a support log collection. This ensures that communication is only one-way (the data collector pushes data up to the cloud).

Q: I Have A Proxy Server. How Do I Configure The Data Collector for a Proxy Server?

A: During the installation of the Data Collector, it will ask for your proxy server configuration. The proxy server itself should not need any additional configuration.

Q: Can I Control Whether IBM Storage Support Can Collect Support Logs?

A: Yes. Instructions are here.
Some considerations when setting permissions:
  • If this is turned off, IBM Storage Support will not be able to collect logs as they need them potentially delaying problem resolution
  • If this is allowed, You are granting IBM Storage Support permission to collect support logs as-needed for troubleshooting without requesting permission each time.
  • This is a simple toggle that can be turned on and off as often as you wish
  • When you are doing maintenance on a storage system it is recommended that you turn this off for the duration of the maintenance

Q: I Want To Configure a Performance Alert. What Are Some Suggested Values for Thresholds?

A: Performance monitoring thresholds are different for every environment. Use historical performance data to  guide alerting decisions for response time and other thresholds. For new Storage Insights instances, it is recommended to wait until you have two weeks of performance data before configuring any alerts

Q: I'm a Partner and my client has given me permission to monitor his Storage Insights free dashboard. Can I get SI Pro capabilities while he stays on the free version?

A: No. You cannot see the Pro capabilities. You see exactly what your customer sees.

If you have any questions, leave them in the comments or find me on LinkedIn or on Twitter.

The Importance of Keeping Your Entire Solution Current

Recently I started working on a new case for a customer.   I'm trying to diagnose repeated error messages being logged by an IBM SVC Cluster that indicate problems communicating with the back-end storage that is being virtualized by the SVC.  These messages generally indicate SAN congestion problems.  The customer has Cisco MDS 9513 switches installed.  They're older switches but not all that uncommon.  What is uncommon is finding the switches at NX-OS version 5.X.X.  I see down-level firmware but this one is particularly egregious.  This revision is several years out of date. Later versions of code contain numerous bug fixes both from Cisco and for the associated upstream Linux security updates that get incorporated into NX-OS.  Also, while NX-OS versions don't officially go out of support, any new bugs identified won't be fixed as this version is no longer being actively developed

This level of firmware merits further investigation.  Looking deeper on the switches I find this partial switch module list:

Mod  Ports  Module-Type                         Model              Status
---  -----  ----------------------------------- ------------------ ----------
6    48     1/2/4 Gbps FC Module                DS-X9148           ok
7    48     1/2/4 Gbps FC Module                DS-X9148           ok
8    48     1/2/4 Gbps FC Module                DS-X9148           ok
9    48     1/2/4 Gbps FC Module                DS-X9148           ok

These modules are older than the firmware on the switches, and support ended 3 years ago.  If this customer has problems with them (or the switches they are installed in) and the problem is traced back to the modules, there is not much that IBM Support can do.  If a problem is traced to a bug in the firmware, the customer can't upgrade the firmware to something more current because of these old, unsupported modules still in the switches.  This limits IBM's ability to provide support.  The hardware is no longer supported and much of the data we can look at in the firmware was not introduced until the next major revision level of NX-OS - v6.2(13).  There were also some options and improvements added to lower thresholds and timeout values to increase the frequency of some logging for performance issues.

I could see several 2Gb devices attached to these modules, which is probably why they are still installed.  I could also see some of these slow devices zoned to the SVC which is connected to the SAN at 8Gbps.  This violates a best practice of not zoning devices together where the port speeds are greater than 2x difference.  So, a 2 Gb device should not be zoned to 8Gb.  A 4Gb device should not be zoned to 16Gb, etc.    The slow device will turn into a slow-drain device sooner rather than later.    I suspect this is the customer's problem but can't confirm it because of lack of data due to the age of the hardware and firmware.

The last reason that it is critical to keep your solution updated is that if you go too long between updates, there are often interim upgrades that need to be completed.  This is common when moving between major revisions on SAN Switch firmware.    So if a  switch is at version of firmware where  the current code level is 1 or 2 revisions above the current version, there will be at least 1 and possibly more  interim levels that are required.    This greatly raises the complexity of performing upgrades and also raises the risk because customers will want to try and condense what is normally at least a two week process per upgrade version into as little time as possible.


The recommendations I gave this customer:

  1. Move the applications on those slow servers to servers with a 4 or (ideally) 8Gb connection to the SAN on the other newer modules in the switches. This will allow for decommissioning of those modules and move to a best-practice solution.
  2. Decommission those old modules and upgrade them if the port density is needed.  this will allow for firmware upgrades which are beneficial for all the reasons noted above
  3. Start planning for a refresh on the switches themselves.  While the switch chassis will be supported for some time yet, they have already been end of life for a few years.

How Storage Insights Can Enhance Your Support Experience


An Introduction

The first week in May IBM announced IBM Storage Insights.    As of 11 June, Storage Insights has these key items:
  • IBM Blue Diamond support
  • Worldwide support for opening tickets
  • Custom dashboards
  • New dashboard table view
  • Clients can now specify whether IBM Storage Support can collect support logs.  This is done on a per-device basis.
You can get a complete list of the new features here:  Storage Insights New Features
There are some other new features such as new capacity views on the Storage Insights Dashboard.    With these new features, especially support for IBM Blue Diamond customers, Storage Insights is an increasingly important and valuable troubleshooting tool.  My team here is seeing more and more customers that are using Storage Insights.   I thought I would discuss the potential benefits of Storage Insights as a troubleshooting tool.  

Some Background

The problems my team fixes can be categorized as either:
  1. Root-cause analysis (RCA) - meaning a problem happened at some point in the past, and the customer wants to know why it happened
  2. Ongoing issue - the problem is happening now (or happens repeatedly, also called intermittent)
The above two types of problems can further be broken down into partially working or completely broken.  Of the two, partially working can be much more difficult to troubleshoot, especially if it's an intermittent issue and not constant.  As an example, some years ago my van had a misfire on one of its cylinders, but we didn't know which one.  Of course it never occurred when my mechanic was driving it.  It finally took several hours at the dealer with the dealer hooking the car up to a test rig to record the failure and identify the misfire.  Had the problem been a completely broken spark plug wire instead of partial, it would have been much easier to identify.
You can imagine the difficulty of attempting to root-cause a problem that happened hours or days ago  on a large and busy SAN if the problem is/was not severe enough to cause the switches to record any time-stamped errors or other indicators of problems.   As an example  I'm confident you've been in slow-moving traffic where the cause of the problem isn't readily apparent.   The analogy isn't perfect, but suppose the traffic cameras in your city were configured to only start recording when traffic is moving less than 30 mph for 2 minutes and/or generate an alert back to the traffic center.   They do record the number of cars passing by and the number of cars exiting and entering the freeway at each ramp but they don't timestamp these numbers.  They only timestamp the video and/or alerts.  Now further suppose you were stuck in traffic last week that was moving at 32 mph.  Since it didn't meet meet the threshold, the cameras never recorded anything  and no alerts were sent.  You could collect the statistics on the number of cars counted by the cameras but without anything recorded from last week  it would be extremely difficult to provide an explanation as to why traffic was slow, since you can't reconcile the count of cars to any specific point in time.  If traffic had been completely stopped, the cameras would have started recording and then you'd be able to see the car fire, or accident, or whatever the cause of the problem was last week.  The same limitations exist for ongoing issues.  If traffic is moving slowly but not completely stopped, then identifying the cause of the slow traffic can be difficult.  

How Storage Insights Can Help Provide Root-Cause

Storage Insights has the potential to provide an explanation for these partially working root-cause investigations by regularly sampling the performance statistics and providing timestamps on this data.  If we had something like Storage Insights regularly sampling statistics from our traffic cameras, we could go back and analyze these for the time period where you were sitting in traffic.  We might find a certain exit ramp from the freeway was congested at the time of the problem.    We could take this information and correlate that with other data to try and determine why the ramp was backed up.  We might find a concert was going on at a venue near the ramp, or some other event that caused an increase in traffic to that ramp.  

How Storage Insights Can Decrease Problem Resolution Time

For a problem that is happening now, Storage Insights can help provide resolution more quickly than without it.  Going back to our traffic example, suppose there is an accident or some other problem on a surface street that an exit ramp connects to.  Traffic eventually will back up onto the exit ramp and then onto the freeway.  Without Storage Insights, you'd have to look at each of your traffic cameras in turn and trying to figure out where the congestion starts.   With Storage Insights, since it's collecting the statistics you can filter them to find out which of your exit ramps is the congested ramp.


Troubleshooting SVC/Storwize NPIV Connectivity

Some Background:

A few years ago IBM introduced the virtual WWPN (NPIV) feature to the Spectrum Virtualization (SVC) and Spectrum Storwize products.  This feature allows you to zone your hosts to a virtual WWPN (vWWPN) on the SVC/Storwize cluster.  If the cluster node has a problem, or is taken offline for maintenance the vWWPN can float to the other node in the IO Group.  This provides for increased fault tolerance as the hosts no longer have to do path failover to start I/O on the other node in the I/O group. 
All of what I've read so far on this feature is from the perspective of someone who is going to be configuring this feature.  My perspective is different, as I troubleshoot issues  on the SAN connectivity side.  This post is going to talk about some of the procedures and data you can  use to troubleshoot connectivity to the SVC/Storwize when the NPIV feature is enabled, as well as some best-practice to hopefully avoid problems.

If you are unfamiliar with this feature, there is an excellent IBM RedPaper that covers both this feature and the Hot Spare Node feature:

An NPIV Feature Summary

1:  SVC/Storwize has three modes for NPIV - "Enabled", "Transitional" or "Off".   
2:  Enabled means it is enabled.   Hosts attempting to log into to the physical WWPN (pWWPN) of the Storwize port will be rejected.  Transitional means it is enabled, but the SVC/Storwize will accept logins to either the vWWPN or the pWWPN. Off means the feature is not enabled.
3:  Transitional mode is not intended to be  enabled permanently.  You would use it while you are in the process of re-zoning hosts to use the vWWPNs instead of the pPWWNs. 
4:  For the NPIV failover to work, each of the SVC/Storwize nodes has to have the same ports connected to each fabric.  For example, assuming this connection scheme for an 8-port node with redundant fabrics:
Node Port Fabric A Fabric B
1 x
2 x
3 x
4 x
5 x
6 x
7 x
8 x

All the nodes must follow the same connection scheme.   Hot-spare node failover will also fail if the nodes are mis-cabled.     To be clear I am not advocating the above scheme per se, just that all the nodes must match as to which ports are connected to which fabrics.
5:  I was asked at IBM Tech U in Orlando if the SVC/Storwize Port Masking feature is affected by the NPIV feature.  The answer is no.  Any existing port masking configuration is still in effect. 
6: pWWPNs are used for inter-node and inter-cluster (replication) as well as controller/back-end.   vWWPNs are only used for hosts.

A Suggestion:  A recommendation I heard  at IBM Technical University in May is if you are using FC Aliases in your zoning to add the vWWPN to existing alias for each SVC/Storwize cluster port so that you don't have to rezone each individual host.  While that is an easy way to transition, that creates a potential problem.  After you move the cluster from Transitional to Enabled, the cluster starts rejecting the fabric logins (FLOGI)  to the pWWPNs.  At best, all this does is fill up a log with rejected logins, at which point you call for support because you notice a host logging FLOGI rejects.  At worst this causes actual problems when the adapter and possibly multipath driver attempt to deal with the FLOGI rejects.  Prior to moving the NPIV mode to Enabled, you need to remove the pWWPN from the FC Alias, but you must first ensure you are not using the same aliases for zoning of your back-end storage. If you are and you remove the pWWN from the alias you will lose controller connectivity.  If you are using a Storwize product with internal storage and no controllers, then this will not be an issue and the pWWN can be removed from the alias.  If you do have back-end storage  and are currently using the same aliases for both host and controller zoning, it might be easier to establish new aliases for the pWWPNs and rezone the controllers to them, or just rezone the controllers to the pWWPNs before modifying the existing aliases to use the vWWPNs.  It will be less zones to modify for the controllers than for the hosts.

Troubleshooting Connectivity to the SVC/Storwize Cluster

One of the most common problems that I see with connectivity issues or path failover not working as it should is incorrect zoning.  To that end, you first need to verify the vWWPNs that you should be using.  The easiest way is to run lstargetportfc on the Cluster CLI to get a listing of these vWWPNs.  lsportfc will list the pWWPNs.  This command output is included by default in the svcout file starting at version 8.1.1.  Versions prior to that it is a separate command.  Once you have that list, you can use the Fibre Channel Connectivity Listing in the SVC/Storwize GUI and the Filtering capabilities there to filter on the vWWPNs and/or the pWWPNs to determine if you have any hosts connected to the pWWPNs.  You can also capture the output of
lsfabric --delim ,  and import that CSV into Excel or similar to get better sorting and filtering than the System GUI.  If the host is missing, or is connected to the pWWPNs, you will need to check and verify zoning.    This is also a good time to verify controllers are connecting to the pWWPNs, and that if you are using a hot-spare node, that the controllers are zoned to the ports on the hot-spare.    I had a case recently where, while it wasn't the reason the customer opened the ticket, I noticed they had not zoned one of their controllers to the hot-spare node.  In the event of a node failure, the failover would not have worked as expected. 




An Easy Way To Turn Your Flash Storage Supercar Into a Yugo

Introduction

This past summer I was brought into a SAN performance problem for a customer.  When I was initially engaged on the problem, it was a host performance problem.  A day or two after I was engaged, the customer had an outage on a Spectrum Scale cluster.   That outage was root-caused to a misconfiguration on the Spectrum Scale cluster where it did not re-drive some I/O commands that timed out.  The next logical question was why the I/O timed out.    Both the impacted hosts and Spectrum Scale cluster used an SVC cluster for storage.   I already suspected the problem was due to an extremely flawed SAN design.  More specifically,  the customer had deviated from best-practice connectivity and zoning of his SVC Cluster and Controllers.  A 'Controller' in Storwize/SVC-speak is any storage enclosure - Flash, DS8000, another Storwize product such as V7000, or perhaps non-IBM branded storage.  In this case, the customer had 3 Controllers.  Two were IBM Flash arrays, for the purposes of this blog post we will focus on those and how the customer SAN design negatively impacted their IBM Flash systems.

Best-Practice SVC/Storwize SAN Connectivity and Zoning 

The figure below depicts best-practice port connectivity and zoning for SVC and Controllers on a dual-core Fabric Design.  (This assumes you have two redundant fabrics, each of which is configured like the below).     As we can see  ideally our SVC Cluster and controllers are connected to both of our core switches.  A single-core design obviously does not have the potential for mis-configuration since all SVC and Controller ports on a given fabric are connected to the same physical switch.   A mesh design we would want to use the same basic principles of connecting SVC and controller ports to the same physical switch(es).   Zoning must be configured such that the SVC ports on each switch are zoned only to the Controller ports attached to the same switch.  The goal is to avoid unnecessary traffic flowing across the ISL between the switches.     In the example below, we have two zones.  Zone 1 includes the SVC and Controller ports attached to the left-most switch.  Zone 2 includes the SVC and controller ports attached to the right-most switch.

Customer Deviations from Best-Practice on a Dual-Core Fabric

The next figure is the design the customer had.  The switches in question  are Brocade-branded, but the design would be flawed regardless of the switch vendor.  The problem should be be obvious.  With the below design, all traffic moving from the SVC to the backend controllers has to cross the ISL, in this case it was a 32 Gbps trunk.  The switch data showed multiple ports in the trunk were congested - there were transmit discards and timeouts on frames moving in both directions, and both switches were logging bottleneck messages on the ports in the trunk.  The SVC was logging repeated instances of command timeouts and errors indicating it was having problems talking to ports on the Controllers.    Lastly, the SVC was showing elevated response times to the Flash storage.  All of this was due to the congested ISL.  With this design, the client was not  getting the ROI or response times it should have been  getting from the Flash storage.  Of course all of the error correction and recovery caused an increased load on the fabric and re-transmission of frames which made an already untenable situation worse.   The immediate fix to provide some relief was to double the bandwidth of the ISL on both fabrics.  The long-term fix was to re-connect ports and zone appropriately to get to best-practice.  

Customer Host Connectivity and a Visual of the Effect on the Fabric

The last figure shows the customer host connectivity and the effect on the fabric of this flawed fabric design.   We can see from the figure that the client had both the underperforming hosts and the GPFS/Spectrum Scale cluster connected to DCX 2 where the Controllers were connected.    With this design, we can see that data must traverse the ISL 4 times.  Traffic on the ISLs could be immediately reduced by half by moving half of the SVC ports to DCX2 and half the controller ports to DCX1 and then zoning to best practice as in the first figure in this blog post.     In addition to the unnecessary traffic on the congested ISL, redundancy is reduced since this design is vulnerable to a failure of either DCX 1 or DCX 2.  While the client did have a redundant fabric, a failure of either of those switches means a total loss of connectivity from SVC to Controllers on one of the fabrics.  That is significant.   ISL traffic could be further reduced (and reliability increased at the host level) by moving half of the GPFS cluster (and other critical host ports) to DCX 1 and zoning appropriately.  In this way, the only traffic crossing the ISLs would be hosts or other devices that don't have enough ports to be connected to both cores and whatever traffic is necessary to maintain the fabric.  Both the SVC to Controller and the host to SVC traffic would then be much less vulnerable to any delays on the ISLs or congestion in either fabric.