Wednesday, October 14, 2020

Long Distance Fibre Channel Link Tuning

In this video I talk about some of the variables involved in long distance link tuning of fibre-channel distance links.  In this blog post I'll detail some of the tools that are available.  I will also provide an example of estimating the number of buffer credits you will need.  Note that this tuning is only for fibre-channel links.  This does not apply to FCIP tunnels or circuits.   One critical piece of information that you will need to calculate buffer credits is the frame size.  Smaller frames means more of them can fit in the link, so you would need more buffer credits.  Of the variables that go into the formula, this is the only unknown.  Everything else is either known or is a constant. 

Brocade has the 'portbuffershow' command that can tell you the average frame size for a link.  You would look at the Framesize columns for  TX and RX in the portbuffershow output to get the frame size.  The portbuffershow output is organized by logical switch and then by port.    

On a Cisco fabric, you can calculate the frame size using one of the 'show interface' commands:

Monday, October 12, 2020

Using the IBM Storage Insights Pro Grouping Features


I recently posted this post on how you can help IBM Storage Support help you by ensuring you are utilizing the full monitoring features available on your storage systems and switches.    You should also have at least the free version of IBM Storage Insights installed.   If you have Storage Insights Pro or Storage Insights for Spectrum control, there are some additional steps that you should take that will benefit both you and the IBM Support team resolve your problems as quickly as possible. 

IBM Storage Insights Pro and Storage Insights for Spectrum Control come with some powerful features for grouping and  organizing storage resources.  These features are found under the Groups menu.   You can organize your storage resources into Applications, Departments and General Groups.   

There is a hierarchy to the organization of resources.   Departments can contain sub-departments, Applications or General Groups.  Applications can contain hosts or other applications.  General Groups can contain volumes or storage systems.  

 Applications and Departments let you create the applications that are critical to your business and assign them to the same departments they are part of in your actual business.  You can define an application (such as a database) and then add the hosts to that application that run that database.  When you do this, Storage Insights automatically pulls the storage systems and volumes associated with those hosts into the application that you created.  

General Groups let you group volumes and storage systems together.  One use case is after groupinb volumes in a General Group you can define alerts for the members of that group.  The Alert Policies feature provides a similar function for storage systems and other types of resources, but not for volumes.  I recommend that you continue to use Alert Policies to manage alerts on storage systems, but you cannot currently add volumes to an Alert Policy directly.   This can be important because different types of physical storage (Flash Core vs nearline) will have different performance expectations and a response time that is valid for Flash Core is not achievable on nearline drives.   So a volume backed by nearline will see constant alerts if you were to configure an alert that is expecting flash storage.  Also, volumes with different I/O patterns will have different response time expectations.  See this post for an example of one such I/O pattern.  General Groups enable you to group volumes, storage systems, hosts or other resources however you wish.  

How Using the Grouping Features Helps IBM Storage Support

Storage systems that are organized into at least applications help IBM Storage Support more quickly identify potentially affected resources when you have a problem.  A typical performance problem statement from a client is  "our XXX database performance is very slow".    Before IBM Support can be gin to work on a problem, we need to know what hosts are affected, what storage systems are providing the storage to the hosts and and what volumes those hosts are using.    If a customer has organized the resources using the Storage Insights features, identifying the affected components is much easier than trying to do it without Storage Insights.    Here are some real-world examples that illustrate how using the grouping features is beneficial:

A customer who had a number of volumes being replicated via Global Mirror Change Volumes (GMCV) on a pair of SVC clusters.  The issue was that a subset of 20 or so volumes out of a few hundred were nearly always behind on their recovery point.  Out of the group, some would catch up, then fall behind, then a different set in the subset would catch up, etc.  So while the group of 20 was nearly constant, the volumes that were actually behind would frequently change.  We had the customer create a General Group of the 20 volumes, then on any given day the customer could tell us which volumes were behind.  It was much easier to look at those 20 and identify particular volumes than repeatedly filter them out of the few thousand that the customer had.  Over time we were able to determine that the volumes that were falling behind had an I/O pattern that was spikes of very high write I/O activity, then a longer period of very low.  The volumes would fall behind during the intense writes, then not catch up because the low I/O activity meant they were at a lower priority for replication.   Having both the Storage Insights performance data and the ability to put these volumes in a group made it much easier to diagnose the issue.

Another customer had an application that would intermittently have performance problems, and users were complaining about the slowness of the application.  The customer had several virtual hosts spread across 10 VMWare servers in a cluster.  These virtual hosts were running the application.   The virtual hosts could be running on any of VMware servers at any given time, and any of several dozen volumes could also be affected.  We had the customer create an application and add the VMWare hosts to the application.  This automatically pulled in the storage systems, volumes and backing storage for the application.  We were able to much more quickly determine the root cause of the problem as the backing storage was being overdriven  without having to repeatedly filter on hosts or volumes.    The problem would have been resolved eventually, but the pattern was more clearly seen when we were able to start with a much smaller set of volumes and hosts.

You can see how utilizing the Groups features of Storage Insights Pro can benefit both you and the the IBM Support teams.  If you want to find out more about the features, visit the Storage Insights Youtube Channel and check out the videos there that cover Departments, Applications and Groups.

Monday, September 14, 2020

Help IBM Storage Support Help You


 I had a client recently ask me what was the most effective thing his company could do to get me the data that would be the most helpful in troubleshooting problems in his solution.  This was after we were unable to provide a definitive root cause to a problem that occurred intermittently in his solution.   He had a fairly simple fabric that consisted of two 96-port switches, a few IBM Storage Systems and 30 or so hosts.   His problem was an issue with performance on the hosts.  At the time the best I was able to tell him was data indicated a slight correlation between host read activity and a performance problem but I was not able to confirm anything with certainty. 

    My answer was simple:  configure better event detection and system logging.  This is something I teach as a best-practice at IBM Technical University.   I also suggested that his company install at least the free version of IBM Storage Insights.   Without a performance monitoring tool, troubleshooting performance problems is very similar to trying to figure out why a traffic jam is happening using still pictures from traffic cameras.  Now imagine trying to root-cause a traffic jam that happened yesterday or last week with pictures taken today and the only other data you have is statistics such as how many cars the camera has counted since the last time you reset those counters on the camera.  Solving problems using similar data as the example is what the Support teams at IBM are asked to do, and effectively what this customer was asking.   

That said, here are the recommended actions you can take to ensure the best chance of being able to provide the data that we need to solve your problems:

  1.   Configure callhome on your products.  You can search the Knowledge Center for instructions for your specific IBM hosts, storage and switches.  Your product can monitor itself and open tickets for hardware issues that you might not necessarily be aware of.
  2.   Configure a syslog server on your products.  While this won't directly help provide data, it does preserve events if a host, storage system or switch has a system failure.  Without offloading syslog data, critical event data for these kinds of failures is lost.  Logs also wrap.  Having a syslog server configured prevents losing system events due to logs wrapping.  You can search the Knowledge Center for instructions for your specific IBM hosts and storage on how to do this.  For SAN Switches refer to the instructions from Cisco and Brocade.
  3. Configure monitoring and alerting on your SAN Switches.  This may require additional licensing but an effective monitoring policy often gives us critical timestamped data.   As an example, a recent case I worked on had several hosts losing path to storage.   Looking at the switch data, the switch ports for these hosts and a few others were seeing CRC errors.  You can read more about them and how to troubleshoot them here.  These errors are the easiest to detect and resolution is straight-forward.  Because this customer had implemented a good monitoring policy I was able to easily see the time-stamps and was able to let the customer know these errors were ongoing and needed to be resolved.    
  4. Install a performance monitoring tool, at least the free version of IBM Storage Insights.  My client did not have Storage Insights set up.  If he'd had it set up then most likely we would have been able to use the performance data to confirm the theory.   A guided tour of Storage Insights is here. If you have Spectrum Control already, Storage Insights is included for the systems you have licensed in Spectrum Control.  You get all the same monitoring and alerting features that are included in Storage Insights Pro.  Check out this post to learn how Storage Insights can enhance your IBM Storage Support experience.

For point 3,  Cisco has the port-monitor feature.  You can find a complete overview here.  I strongly recommend that you disable the slow-drain policy that is active on a newly deployed switch and at least activate the default port-monitor policy.  The default policy will alert on many more counters (19)  than the slow-drain policy does.  The two counters that the slow-drain policy alerts on are included in the default policy.    Enabling the default policy can help by providing time-stamped data for troubleshooting problems.      Brocade has the Monitoring and Alerting Policy Suite (MAPS).   MAPS can also provide the time-stamped data that is often critical to determining why a problem occurred.  You can find the FOS v8.2 MAPS user guide here and you can find a blog post on integrating Brocade Flow Vision rules into MAPs here.   Integrating Flow Vision allows you to alert for specific kinds of frames. 

Tuesday, September 8, 2020

IBM Announces IBM SANnav

IBM Announced IBM SANnav today.  You can register for a webinar to learn more about SANnav here.    

SANnav is a next-generation SAN management application.  It was built from the ground up with a simple, browser-based user interface.   It can streamline common workflows, such as configuration, zoning, deployment, troubleshooting, and reporting. The modernized GUI can improve operational efficiency by enablog enhanced monitoring capabilities, faster troubleshooting, and advanced analytics. 

Key features and capabilities include:

  1. Configuration management: You can use policy-based management to apply consistent configurations across the switches in your fabrics.  SANnav also makes zoning devices easier by providing a more intuitive interface than previous management products.  
  2. Dashboards:  You can see  at-a-glance views and summary health scores for fabrics, switches, hosts, and targets that may be contributing to performance issues within the network. You can instantly navigate to any hot spots for investigation and take corrective action. 
  3. Filter management: You can sort through large amounts of data by selecting only attributes of importance. For example, users can search for all 32 Gbps ports that are offline. This filter reduces the displayed content to only the points of interest, allowing faster identification and troubleshooting.
  4. Investigation mode: Provides intuitive views that you can navigate for key details to help them understand complex behaviors. SANnav Management Portal periodically collects metrics and stores them in a historical time-series database for further analysis. In addition, it can collect metrics more frequently (at 10-second intervals) for select ports.  This performance data is invaluable when trying to troubleshoot a problem that occurs intermittently and/or is severe enough to impact production but not severe enough to cause a complete outage.
  5. Reporting: Generates customized reports that provide graphical summaries of performance and health information, including all data captured using IBM b-type Fabric Vision technology. Reports can be configured and scheduled directly from SANnav Management Portal to show only the most relevant data, enabling administrators to more efficiently prioritize their actions and optimize network performance
  6. Autonomous SAN:    This is the feature I am most looking forward to learning more about.  As I am in the business of troubleshooting fabrics to find problems I would like to see how effective this is an how quickly the switches can detect problems and notify administrators.  Perhaps some day we'll have switches that can detect problems and automatically route traffic onto faster links (where possible).  This would be very similar to a recent drive I took where my phone's GPS program routed me around a major traffic jam.  It was slower than the main roads assuming no traffic, but was many minutes faster than driving through the congestion. 
As a reminder, you can register for the free webinar at the link above.  I hope to see you there. 

Tuesday, August 25, 2020

Integrating Broadcom Flow Vision Rules with MAPS

Sound monitoring and syslogging practices are the first and sometimes most important step in troubleshooting.  They also the most overlooked as they must be  configured before a problem happens.  If system logging is not configured before a problem happens, valuable information is lost. 

Broadcom has two important features that you can use to monitor the health of your Broadcom fabrics and alert you when problems are detected:  Flow Vision (the monitoring) and Monitoring And Alerting Policy Suite (MAPS), which can both monitor and alert if it detects error conditions.  In this post I'll provide a brief overview of each feature and then we'll see how we can integrate Flow Vision into MAPs to provide a comprehensive monitoring and alerting solution. 

Flow Vision

Flow Vision provides a detailed  view of the traffic between devices on your fabrics.  It captures traffic for analysis to find bottlenecks, see excessive bandwidth utilization, and look at other similar flow-based fabric connectivity.  Flow Vision can inspect the contents of a frame to gather statistics  on each frame.  Flow Vision has three main features:  Flow Monitor, Flow Generator and Flow Mirror.  In this blog post we'll take a look at Flow Monitor as that is what we will  integrate into MAPs.    Flow Monitor  provides the ability to monitor flows that you define and it gathers statistics on frames and I/Os.  Some example use cases for flows:

  •  Flows ­ through the fabric for virtual machines or standalone hosts connected via NPIV that start from from a single N_Port ID (PID) to destination targets. 
  • Flows monitoring inside logical fabrics and inter-fabric (routed) traffic passing through
  • Gaining insights into application performance through the capture of statistics for ­specified flows. 
  • Monitoring various frame types at a switch port to provide deeper insights into storage I/O access such as the various SCSI commands


 MAPS is a policy-based health monitor that allows a switch to constantly monitor itself for fault detection and performance problems (link timeouts, excessive link resets, physical link errors) and if it detects a problem, alert you via the alert options on the policy, or if they are defined, on the individual rule.   However, MAPS does not inspect the contents of the data portion of frames.  Options for alerting include email, SNMP or raslog (the system log).  You should -always- have the raslog option set as this will give IBM Support critical timestamped data if the switch detects a problem. 

Integrating Flow Vision with MAPs

Combining these two capabilities gives you a fully integrated and very powerful monitoring configuration.  You can have Flow Vision monitor for certain types of frames, or frames between a specific source/destination pair and then feed that into MAPs to take advantage of the alerting capabilities of MAPs.

In this example we're going to take advantage of the ability of Flow Vision to inspect the contents of a frame, and then we'll  add that to MAPS to utilize the alerting capabilities in MAPS.  Suppose we  want to know when a certain host sends an abort sequence (ABTS) to a storage device.  For this example, our host name is Host1.  It is connected via NPV so we can't just monitor the ingress port, as it is possible another host will send an ABTS.  We are filtering on a specific source N_Port Id.  We also want to ensure we collect all ABTS that are sent so we are not filtering on a destination ID.

Step 1:  Create the flow:

switch:admin> flow --create Host1_ABTS  -ingrport 1/10 -srcdev 0xa1b2c3 -frametype abts 

The above rule says to filter ingress port 1/10 for the source N_PORT ID A1B2C3 and filter for frametype of ABTS.  Optionally we could specify a -dstdev of "*" and Flow vision would learn which destinations the source dev is sending to.   

Step 2: Import the flow into MAPs

switch:admin> mapsconfig --import Host1_ABTS

Step 3: Verify the Flow has been imported

switch:admin> logicalgroup --show 

Group Name |Predefined|Type |Member Count|Members 
ALL_PORTS|Yes       |Port |8           |2/6,1/6-18
ALL_F_Ports|Yes |Port |5 |1/4,3/7-11
ALL_2K_QSFP |Yes |Sfp  |0 |
Host1_ABTS |No |Port |3 |Monitored Flow

Step 4: Create a Rule and add the rule to a Policy

switch:admin> mapsrule --create myRule_Host1_ABTS -group myflow_22 -monitor TX_FCNT -timebase min -op g -value 5 -action RASLOG -policy myPolicy 

Where "-timebase" is the time period to monitor the changes, "-op g" is greater than,  "-value" is the value to trigger at, and "-action" is the action to take. So this rule says to log to the raslog if the switch detects greater than 5 ABTS per minute from the source N_Port  ID that was specified in the flow.  

Next we activate the new policy:

switch:admin> mapspolicy --enable policy myPolicy

Hopefully from this example you can see the utility of being able to monitor and alert on  both the contents of frames, as well as errors or changes detected on your switches.  This example can also server as a blueprint for enabling additional logging capability when troubleshooting a problem.  Perhaps you have an intermittent issue that disappears before you can collect the necessary data.  With Flow Vision you can monitor for a condition and then trigger MAPS to alert you via email or raslog.    For more information you can review the Brocade MAPS and Flow Vision guides here:

Friday, August 21, 2020

Cisco SAN Analytics and Telemetry Streaming - Why Should I Use Them?

Are you sometimes overwhelmed by performance problems on your Storage Network?  Do you wish you had better data on how your network is performing?  If you answered yes to either of these questions, read on to find out about Cisco SAN Analytics and Telemetry Streaming.    

The Cisco SAN Analytics engine is available on Cisco 32Gbps and faster MDS 9700 series port port modules and the 32 Gbps standalone switches.   This engine is constantly sampling the traffic that is running through the switches.  It provides a wealth of statistics that can be used to analyze your Cisco or IBM c-Type fabric.  Telemetry Streaming allows you to use an external application such as Cisco DataCenter Network manager to sample and visualize the data that the analytics engine generates to find patterns in your performance data and identify problems or predict the likelihood of a problem occurring.

You can find an overview of both SAN Analytics and Telemetry Streaming here.  That link also includes a complete list of the hardware that SAN Analytics is supported on.

In this blog post we'll take a quick look at the more important reasons to use SAN Analytics and the Telemetry Streaming features.

Find The Slow Storage and Host Ports on the SAN

This is probably the most common use case for any performance monitoring software.  We want to identify the outlier storage or host ports in the path of slow IO transactions. In this case, slowness is defined as longer IO or the exchange completion time.  Both of these are measures of how long it takes to complete a write or read operation.  While there are several potential reasons for slow I/O, one of the more common ones is slow or stuck device ports.  A host or storage port continually running out of buffer credits is a common cause of performance issues.  SAN Analytics makes it much easier to identify these slow ports. 


Find The Busiest Host and Storage Ports

You can use SAN Analytics to identify the busy ports on your SAN. This enables you to monitor the busy devices and proactively plan capacity expansion to address the high utilization before impact application performance.  If you have a host that has very high port utilization, you need to know this before adding more workload to the host.  If you have storage ports that have very high port utilization, perhaps you can load balance your hosts differently to spread the load across the storage ports so that a few ports aren't busier than the rest of them.  


It is important to note that busy ports are not automatically slow ports.  Your busy ports may be keeping up with the current load that is placed on them.   However, if the load increases, or a storage system where all ports are busy has a link fail, the remaining ports may not have enough available capacity to meet that demand.  SAN Analytics can help identify these ports.  

Related to this is verifying that multi-pathing (MPIO) is working properly on your hosts.  SAN Analytics can help you determine if all host paths to storage are active, and if they are, whether the utilization is uniform across all of the paths.


Discover if Application Problems are Related to Storage Access

SAN Analytics enables you to monitor the Exchange Completion Time (ECT) for an exchange.  This is a measure of how long a command takes to complete.  An overly long ECT can be caused by a few different problems, including slow device ports.  However, if SAN Analytics is reporting a long ECT on write commands when there are no issues indicated on the SAN, this often means that the problem is inside the storage.    

Identify the Specific Problematic Hosts Connected to an NPV Port

Hypervisors such as VIOS, VMware or Hyper-V all use N-Port Virtualization to have virtual machines log into the same set of physical switch ports on a fabric.  Customers frequently also have physical hosts connecting through an NPV device.  An example of this is a Cisco UCS chassis with several hosts connecting through a fabric extender to a set of switch ports.  In these situations, getting a traffic breakdown per server or virtual HBA from just data available in the switch data collection is challenging.   It is even more so when you are trying to troubleshoot a slow drain situation.  Switch statistics can point to a physical port, but if multiple virtual hosts are connected to that port it is often difficult to determine which host is at fault.  There are a few Cisco commands that can be run to try to determine this, but they need to be run in real-time, and on a large and busy switch you can often miss when the problem is happening as the data from these commands can wrap every few seconds. 

The SAN Analytics engine collects this data on each of the separate hosts.  This gives you the ability to drill down to specific hosts in minute detail.  Once you identify a specific switch port that is slow drain, you can then use the data available in SAN Analytics to determine which of the hosts attached to that port is the culprit.  

If you want to learn more:

The Cisco and IBM C-Type Family

How IBM and Cisco are working together

Thursday, August 20, 2020

Implementing a Cisco Fabric for Spectrum Virtualize Hyperswap Clusters

 I wrote this previous post on the general requirements for SAN Design for Spectrum Virtualize Hyperswap and Stretched clusters.  In this  follow-on post, we'll look at a sample implementation on a Cisco or IBM C-type fabric.  While there are several variations on implementation (FCIP vs Fibre-Channel ISL is one example) the basics shown here can be readily adapted to any specific design.  This implementation will also show you how to avoid one of the most common errors that IBM SAN Central sees on Hyperswap clusters - where the ISLs on a Cisco private VSAN are allowed to carry traffic for multiple VSANs.

We will implement the below design, where the public fabric is VSAN 6, and the private fabric is VSAN 5. The below diagram is a picture of one of two redundant fabrics.  The quorum that is depicted can be either an IP quorum or a third-site quorum.   For the purposes of this blog post, VSAN 6 has already been created and has devices in it.  We'll be creating VSAN 5, adding the internode ports to it and ensuring that the Port-Channels are configured correctly.  We'll also verify that Port-Channel3 on the public side is configured correctly to ensure VSAN 5 stays dedicated as a private fabric.   For the examples below, Switch1 is at Failure Domain 1.  Switch 2 is at Failure Domain 2.  

Hyperswap SAN Design

Before we get started, the  Spectrum virtualize ports should have the local port mask set such that there is at least 1 port per node per fabric dedicated to internode.   Below is the recommended port masking configuration for Spectrum Virtualize clusters.   This blog post assumes that has already been completed.  

Recommended Port Masking

Now let's get started by creating the private VSAN:

switch1(config)# conf t
switch1(config)# vsan database
switch1(config-vsan-db)# vsan 5 name private

switch2(config)# conf t
switch2(config)# vsan database
switch2(config-vsan-db)# vsan 5 name private

Next, we'll add the internode ports for our cluster.  For simplicity in this example, we're working with a 4 node cluster, and the ports we want to use are connected to the first two ports of Modules 1 and 2 on each switch.   We're only adding 1 port per node here.  Remember that there is a redundant private fabric to configure which will have the remaining internode ports attached to it.  

switch1(config)# conf t
switch1(config)# vsan database
switch1(config-vsan-db)# vsan 5 interface fc1/1, fc2/1

switch2(config)# conf t
switch2(config)# vsan database
switch2(config-vsan-db)# vsan 5 interface fc1/1, fc2/1

Next we need to build the Port-channel for VSAN 5.  Setting the trunk mode to 'off' ensures that the port-channel will only carry traffic from the  single VSAN we specify.   For Cisco 'trunking' means carrying traffic from multiple VSANs.  By turning trunking off, no other VSANs can traverse the port-channel on the private VSAN.   Having multiple VSANs traversing the ISLs on the private fabric is one of the most common issues that SAN Central finds on Cisco fabrics.  This is because trunking is allowed by default, and adding all VSANs to all ISLs is also a default when ISLs are created  We also will set the allowed VSANs parameter to only allow traffic for VSAN 5.  Lastly to keep things tidy we'll add the port-channel to VSAN 5 on each switch

switch1(config)# conf t
switch1(config)# int port-channel4
switch1(config-vsan-db)# vsan 5 interface port-channel4
switch1(config)# conf t
switch1(config)# int port-channel4
switch1((config-if)# switchport mode E
switch1((config-if)# switchport trunk mode off
switch1((config-if)switchport trunk allowed vsan 5
switch1((config-if)int fc1/14
switch1((config-if)# channel-group 4
switch1((config-if)int fc2/14
switch1((config-if)# channel-group 4
switch1((config-if)vsan database

switch2(config)# conf t
switch2(config)# int port-channel4
switch2(config-vsan-db)# vsan 5 interface port-channel4
switch2(config)# conf t
switch2(config)# int port-channel4
switch2((config-if)# switchport mode E
switch2((config-if)# switchport trunk mode off
switch2((config-if)switchport trunk allowed vsan 5
switch2((config-if)int fc1/14
switch2((config-if)# channel-group 4
switch2((config-if)int fc2/14
switch2((config-if)# channel-group 4
switch2((config-if)vsan database

The next steps would be bring up Port-Channel 4 and the underlying interfaces on Switch 1 and Switch 2, ensure the VSANs have merged correctly and lastly zone the  Spectrum Virtualize node ports together. 

We also need to examine Port-channel 3 on the public fabric to ensure it is not carrying traffic for the private VSAN.  To do this:  

switch1# show interface port-channel3



Admin port mode is auto, trunk mode is auto

Port vsan is 1 

Trunk vsans (admin allowed and active) (1,3,5,6)

Unlike the private VSAN, the trunk mode can be in auto or on. This is the public VSAN so there may be multiple VSANs using this Port-Channel.   The problem is highlighted in red.  We are allowing private VSAN 5 to traverse this Port-Channel.  This must be corrected, using the commands above that set the trunk allowed parameters.  Your vsans allowed statement would include all of the current VSANs except 
VSAN 5.  On a side note,  it is a good idea to review which VSANs are allowed to be trunked across port-channels or ISLs. Allowed VSANs that are not defined on the remote switch for a given ISL will show up as Isolated on the switch you run the above command on.  The only VSANs that should be allowed are the ones that should be running across the ISL. You would need to perform the same check on Port-Channel 3 on Switch 2. 

Lastly, the above commands can be used for FCIP interfaces or standalone FC ISLs.  You would just substitute the interface name for port-channel4 in the above example.  A note for standalone ISLs is that it is recommended that they be configured as port-channels.   You can read more about that here.  

 I hope this answers your questions and helps you with implementing your next Spectrum Virtualize Hyperswap cluster.  If you have any questions find me on LinkedIn or Twitter or post in the comments.  

Wednesday, July 1, 2020

Physical Switch SAN Implementation for an SVC Hyperswap Cluster

In February 2020 I wrote this post  on the supported SAN design for SVC and Spectrum Virtualize Hyperswap clusters.  In that post I covered some of the problems that arise with improper SAN design for SVC clusters in a Hyperswap configuration.   The requirement at it's most basic when using Hyperswap is to have completely separate fabrics for private traffic, where the private traffic is used for only the inter-node communication within the cluster and there are one or more public fabrics for everything else.  There are various ways that SANs can be implemented to meet that requirement.    This is one in a series of blog posts that will discuss some of the options for fabric design within that framework and provide some implementation details on Cisco and Brocade fabrics.  I will also show you some of the common mistakes that are made in the SAN implementation.  As with that post, while I may only reference SVC in this series (for exanple, the diagram below depicts an SVC cluster) any recommendations made apply equally to SVC, Spectrum Virtualize and IBM FlashSystem products.   Also remember that this applies to Stretched clusters as well. 

Using Physical Switches

The first design we will look at is the simplest in both concept and execution.  You can see in the figure below that there are 8 physical switches in this design. 

The overall SAN consists of 4 two-switch fabrics.  There are two redundant private fabrics for the inter-node ports and there are two redundant public fabrics for everything else.  This diagram includes host and storage related to the cluster, but the public fabrics could include other devices that are not connected to this cluster.  

Note the third site for a quorum.  You can also implement the quorum as an IP quorum and not need storage at  third site for a disk-based quorum, however it is recommended that the quorum be at a third site, even if it is an IP quorum. 

Also note that there are two fabrics on one provider, and two fabrics on a second provider.   The links between the two sites will be Fibre-Channel over IP (FCIP) or they will be a technology such as DWDM, ONS or simply dark fibre connections to carry the native Fibre-Channel traffic.  Implementing the fabrics in this way by splitting them across the providers provides full redundancy in the event one of the providers has a problem.  The next best option would be to use a single provider but ensure that the links between the sites take different routes through the provider's infrastructure.

The above design could use Cisco or Brocade switches.   This design has these advantages:
  1.  Simplicity for configuration as each of the 4 fabrics is a basic design
  2.  Since it uses physical switches it eliminates the possibility of sharing the ISLs between the sites with the public and private fabrics.  That is a common mistake and future blog posts will have more detail.
Using Virtual Switches

The above design is more expensive than the other implementation options given that it requires 8 switches.   Because of the expense, this is an atypical design.   An alternative design that uses only 4 switches is shown below.  This design looks similar to the above. Note the continued presence of the dedicated ISLs between the sites.  As before, there are redundant fabrics on two different providers.  However, in this new design each SAN switch is configured into logical virtual switches.  Cisco calls these VSANs, Brocade calls them Virtual fabrics, but the idea is the same.  Each of the virtual switches is a switch unto itself and when linked with other virtual switches will create a virtual fabric (or VSAN) within the physical fabric.  These virtual fabrics are distinct entities and have separate zoning, name server databases and nearly everything else that a physical fabric has.  The below design is the most common that IBM SAN Central sees.  In future blog posts I will show you the most common problem we see and how to avoid it.

Variations on both of the above implementations include having separate public fabrics for hosts and controllers,  or having a third fabric that is dedicated to replication, if the SVC or Spectrum Virtualize storage system is replicating data to another cluster.  However, the key requirement is the presence of the dedicated fabric for the inter-node communication.  If that requirement is not met, then the design is invalid.

For information on sizing the inter-site links you can read Jordan Fincher's blog here.

One final note:  it is strongly recommended that you do not use  Cisco IVR, Brocade LSAN zoning or other fibre-channel routing features on the private fabrics.   As the SVC node ports will be the only thing on the private fabrics, letting the fabrics merge between the sites will have minimal effect, even in lower bandwidth environments.  Once zoning is configured, it is unlikely to change.  There should not be any other fabric changes occurring frequently enough to justify adding IVR or LSAN zoning.  Both features increase the complexity of the solution, both for configuration and for troubleshooting.  It also introduces a potential failure point. 

Tuesday, June 23, 2020

To Trunk or Not To Trunk, That Is the Question

Separate ISLs

Trunked ISLs

I have had several conversations recently with customers who have asked the question that when they have multiple inter-switch links (ISLs) between switches should those links be aggregated into a single logical link.  Above we have the two possible configurations for links between switches.

On a Broadcom switch these are called trunks. On a Cisco switch these are called port-channels.   The word 'trunk' has a different meaning on a Cisco switch.  For Cisco, an inter-switch link (ISL)  is trunking when it is carrying traffic for multiple VSANs.  This applies to both single link ISLs and to port-channels.  If it is only carrying traffic for a single VSAN it is not trunking.  This blog post uses 'trunk' to mean link aggregation. 

The first image above depicts three ISLs configured as separate, standalone ISLs.  The second image depicts the three links aggregated as a single logical link.  When link aggregation is configured, the switches treat the separate links as a single logical ISL for the purposes of load balancing, fabric rebuilds and routing frames,  among other fabric-related services.

Why Use Trunked Links

Load Balancing

Load balancing is done by the switches when traffic crosses the ISLs between switches.  Traffic is balanced across the links in a trunk or across the separate ISLs.  Load balancing is more effective when done across the links in a trunk as compared to separate links.  For the diagrams above, load balancing would be better on average for the trunked links than the separate ISLs.   There is less of a tendency to stack traffic on the same ISL until it is full when using a trunk.   Load balancing is also dependent on the type of routing that is configured on the switch or the Cisco VSAN.  Different VSANs on the same switch can have different routing policies.  

There are two types of  routing used by most SAN switches.  The routing types are source-destination based and exchange-based routing.  Exchange-based  routing is the default for both Brocade and Cisco.  All the frames in a given exchange between two device ports will follow the same route through the fabric.   The next exchange between the same two device ports may traverse a different physical link in a trunk. 

Fibre-channel breaks transmissions up into frames (the smallest unit), sequences and exchanges.  Sequences contain frames.  Exchanges contain sequences.  A close approximation  of the relationship is if we consider frames to be spoken words, sequences are sentences and exchanges are a conversation.  

When source-destination routing is used all of the frames between the same two device ports will follow the same path through the fabric.  New exchanges will all follow the same path.  Using this type of routing is not recommended except for a specific set of use cases.   If the same physical link is used for  data traveling between different source/destination pairs, the effect is frames stack up on the same one or two links in a trunk.   This results in the other links in the trunk going underutilized and it can cause congestion and delays on the ISLs.

It is important to note that source-destination and exchange-based are routing policies, and not load-balancing policies.  While routing policies can affect the effectiveness of load-balancing they are not load-balancing per-se.  

Fabric Changes

Fabrics undergo changes - devices leaving the fabric, zoning changes, etc.   Periodically switches will join a fabric, or leave a fabric when they are decommissioned.  Links between switches in the fabric will go offline due to issues on the link such as a reset due to buffer credit recovery.  An administrator may take some action during planned maintenance on the fabric.    When an ISL between switches goes down, or comes back up,  the fabric will recompute all of the routing tables and rebroadcast this to all the switches in the fabric.    On a fabric configured with standalone ISLs, this would happen each time an ISL goes offline, or comes back online.   If a link occasionally goes offline, this won't cause much of an impact.  However, if the link goes into a flapping state for buffer credit recovery, the repeated fabric rebuilds can cause an impact to production.

  When trunks are used, this fabric rebuild does not occur unless the entire trunk goes offline.  For the purposes of routing, the trunk is considered the route from one switch to its neighbor.  So if a single link in the trunk starts flapping, the fabric rebuilds do not occur and the link can simply be placed offline until the problem is resolved.  

Flapping Links

An ISL like any other link on a fabric can have problems - bad cabling or faulty optics or perhaps it becomes a congestion point due to a slow drain device on the fabric.  When this happens, the link can start flapping, or going up and down repeatedly.  If this is a standalone ISL, this flapping can cause congestion or other problems due to the repeated fabric changes that occur each time the link comes up or goes down.  Any frames in flight on that link will need to be re-sent, which causes additional error recovery to happen on hosts and other end devices.     If the flapping link is a member of a trunk, then the effects on the fabric are usually much less severe.  The fabric can route around the failing link and the fabric changes are mostly attenuated.  

For the reasons explained in this post,  unless there is a specific requirement in the solution to have separate ISLs, trunking is the preferred option. 

Thursday, February 27, 2020

SAN Design Best-Practices for IBM SVC and FlashSystem Stretched and Hyperswap Clusters

I recently worked with a customer who had their SAN implemented as depicted in the diagram for a Hyperswap V7000 cluster.   An SVC or FlashSystem cluster that is configured for Hyperswap has half of the nodes at one site, and half of the nodes at the other site.  The I/O groups are configured so that nodes at each site are in the same I/O group.  In the example from the diagram, the nodes at Site 1 were in one I/O group, the nodes at the other were in another I/O group.  A stretched cluster also has the nodes in a cluster at two sites, however each I/O group is made up of nodes from each site.  So for our diagram below,  in a Stretched configuration, a node from Site 1 and a node from Site 2 would be in an I/O Group. 

 From the diagram we see that each site had two V7000 nodes.  Each node had connections to two switches at each site.  The switches were connected as pictured in the diagram to create two redundant fabrics that spanned the sites.   The customer had hosts and third-party storage also connected to the switches, although the V7000 was not virtualizing the storage.  The storage was zoned directly to hosts.   Several of these hosts had the same host ports zoned to both the third-party storage and the V7000. 

The issue the customer had was that one or more of the V7000 nodes would periodically restart due to lease expiration to other nodes in the cluster.   This means that the node that restarted could no longer communicate with 1 or more of the other nodes in the cluster, so it restarts itself both to protect the cluster and in an attempt to re-establish communication.   Lease expiration is almost always due to congestion somewhere on the fabric.    The SAN design in the above diagram has several problems and does not follow the supported design specifications for Hyperswap or Stretched  clusters. 

Hyperswap requires isolation of the SVC or FlashSystem node ports used for internode communication to be on their own dedicated fabric.  This is called the 'private' SAN.  The 'public' SAN is used for host and controller connections to the Spectrum Virtualize  and all other SAN devices.  This private SAN is in addition to isolating these ports through configuring the  port-masking feature.   The private  fabric can be either virtual (Cisco VSANs or Brocade virtual fabrics) on the same physical switches as the public SAN, or it can be separate physical switches, but it must also include separate ISLs between the sites.    The above diagram does not have a private SAN, either physical or virtual.   As such, the internode traffic is sharing the same ISLs as the general traffic between the sites.   A common related mistake I see is a customer will implement a private SAN using virtual SANs but then allow those to traverse the same physical links as the public SAN.   This is not correct.  The private SAN must be completely isolated.

This isolation is a requirement for two reasons:

In a standard topology cluster,  the keep-alives and other cluster-related traffic that is transmitted between the nodes does not cross ISLs.  Best practice for a standard cluster is to co-locate all node ports on the same physical switches.   In a Hyperswap configuration, because the cluster is multi-site,  internode traffic must be communicated across the ISLs to nodes at each site.

All writes by a host to volumes configured for Hyperswap are mirrored to the remote site.  This is done over the internode ports on the private SAN.  If these writes cannot get through, then performance to the host suffers as good status for a write will not be sent to the host until the write completes at the remote site.

If the shared ISLs get congested, then there is a risk that  the volume mirroring and cluster communications will be adversely impacted.    For this customer, that was precisely what happened.  The customer had hosts at site 2 that were zoned to both third-party storage and V7000 node ports at site 1.  The third-party storage was having severe physical link issues and became a slow drain device.  The host communication was impacted enough to eventually congest the ISLs on one of the fabrics.  When this happened, nodes began to assert because the internode ports started to become congested due to the Hyperswap writes having to cross a severely congested ISL.