Tuesday, August 25, 2020

Integrating Broadcom Flow Vision Rules with MAPS



Sound monitoring and syslogging practices are the first and sometimes most important step in troubleshooting.  They also the most overlooked as they must be  configured before a problem happens.  If system logging is not configured before a problem happens, valuable information is lost. 

Broadcom has two important features that you can use to monitor the health of your Broadcom fabrics and alert you when problems are detected:  Flow Vision (the monitoring) and Monitoring And Alerting Policy Suite (MAPS), which can both monitor and alert if it detects error conditions.  In this post I'll provide a brief overview of each feature and then we'll see how we can integrate Flow Vision into MAPs to provide a comprehensive monitoring and alerting solution. 

Flow Vision

Flow Vision provides a detailed  view of the traffic between devices on your fabrics.  It captures traffic for analysis to find bottlenecks, see excessive bandwidth utilization, and look at other similar flow-based fabric connectivity.  Flow Vision can inspect the contents of a frame to gather statistics  on each frame.  Flow Vision has three main features:  Flow Monitor, Flow Generator and Flow Mirror.  In this blog post we'll take a look at Flow Monitor as that is what we will  integrate into MAPs.    Flow Monitor  provides the ability to monitor flows that you define and it gathers statistics on frames and I/Os.  Some example use cases for flows:

  •  Flows ­ through the fabric for virtual machines or standalone hosts connected via NPIV that start from from a single N_Port ID (PID) to destination targets. 
  • Flows monitoring inside logical fabrics and inter-fabric (routed) traffic passing through
  • Gaining insights into application performance through the capture of statistics for ­specified flows. 
  • Monitoring various frame types at a switch port to provide deeper insights into storage I/O access such as the various SCSI commands

MAPS

 MAPS is a policy-based health monitor that allows a switch to constantly monitor itself for fault detection and performance problems (link timeouts, excessive link resets, physical link errors) and if it detects a problem, alert you via the alert options on the policy, or if they are defined, on the individual rule.   However, MAPS does not inspect the contents of the data portion of frames.  Options for alerting include email, SNMP or raslog (the system log).  You should -always- have the raslog option set as this will give IBM Support critical timestamped data if the switch detects a problem. 


Integrating Flow Vision with MAPs


Combining these two capabilities gives you a fully integrated and very powerful monitoring configuration.  You can have Flow Vision monitor for certain types of frames, or frames between a specific source/destination pair and then feed that into MAPs to take advantage of the alerting capabilities of MAPs.

In this example we're going to take advantage of the ability of Flow Vision to inspect the contents of a frame, and then we'll  add that to MAPS to utilize the alerting capabilities in MAPS.  Suppose we  want to know when a certain host sends an abort sequence (ABTS) to a storage device.  For this example, our host name is Host1.  It is connected via NPV so we can't just monitor the ingress port, as it is possible another host will send an ABTS.  We are filtering on a specific source N_Port Id.  We also want to ensure we collect all ABTS that are sent so we are not filtering on a destination ID.

Step 1:  Create the flow:


switch:admin> flow --create Host1_ABTS  -ingrport 1/10 -srcdev 0xa1b2c3 -frametype abts 

The above rule says to filter ingress port 1/10 for the source N_PORT ID A1B2C3 and filter for frametype of ABTS.  Optionally we could specify a -dstdev of "*" and Flow vision would learn which destinations the source dev is sending to.   

Step 2: Import the flow into MAPs

switch:admin> mapsconfig --import Host1_ABTS

Step 3: Verify the Flow has been imported

switch:admin> logicalgroup --show 


------------------------------------------------------------------ 
Group Name |Predefined|Type |Member Count|Members 
------------------------------------------------------------------
ALL_PORTS|Yes       |Port |8           |2/6,1/6-18
ALL_F_Ports|Yes |Port |5 |1/4,3/7-11
ALL_2K_QSFP |Yes |Sfp  |0 |
Host1_ABTS |No |Port |3 |Monitored Flow

Step 4: Create a Rule and add the rule to a Policy

switch:admin> mapsrule --create myRule_Host1_ABTS -group myflow_22 -monitor TX_FCNT -timebase min -op g -value 5 -action RASLOG -policy myPolicy 

Where "-timebase" is the time period to monitor the changes, "-op g" is greater than,  "-value" is the value to trigger at, and "-action" is the action to take. So this rule says to log to the raslog if the switch detects greater than 5 ABTS per minute from the source N_Port  ID that was specified in the flow.  

Next we activate the new policy:

switch:admin> mapspolicy --enable policy myPolicy

Hopefully from this example you can see the utility of being able to monitor and alert on  both the contents of frames, as well as errors or changes detected on your switches.  This example can also server as a blueprint for enabling additional logging capability when troubleshooting a problem.  Perhaps you have an intermittent issue that disappears before you can collect the necessary data.  With Flow Vision you can monitor for a condition and then trigger MAPS to alert you via email or raslog.    For more information you can review the Brocade MAPS and Flow Vision guides here:


Friday, August 21, 2020

Cisco SAN Analytics and Telemetry Streaming - Why Should I Use Them?


Are you sometimes overwhelmed by performance problems on your Storage Network?  Do you wish you had better data on how your network is performing?  If you answered yes to either of these questions, read on to find out about Cisco SAN Analytics and Telemetry Streaming.    


The Cisco SAN Analytics engine is available on Cisco 32Gbps and faster MDS 9700 series port port modules and the 32 Gbps standalone switches.   This engine is constantly sampling the traffic that is running through the switches.  It provides a wealth of statistics that can be used to analyze your Cisco or IBM c-Type fabric.  Telemetry Streaming allows you to use an external application such as Cisco DataCenter Network manager to sample and visualize the data that the analytics engine generates to find patterns in your performance data and identify problems or predict the likelihood of a problem occurring.


You can find an overview of both SAN Analytics and Telemetry Streaming here.  That link also includes a complete list of the hardware that SAN Analytics is supported on.


In this blog post we'll take a quick look at the more important reasons to use SAN Analytics and the Telemetry Streaming features.


Find The Slow Storage and Host Ports on the SAN

This is probably the most common use case for any performance monitoring software.  We want to identify the outlier storage or host ports in the path of slow IO transactions. In this case, slowness is defined as longer IO or the exchange completion time.  Both of these are measures of how long it takes to complete a write or read operation.  While there are several potential reasons for slow I/O, one of the more common ones is slow or stuck device ports.  A host or storage port continually running out of buffer credits is a common cause of performance issues.  SAN Analytics makes it much easier to identify these slow ports. 

 

Find The Busiest Host and Storage Ports

You can use SAN Analytics to identify the busy ports on your SAN. This enables you to monitor the busy devices and proactively plan capacity expansion to address the high utilization before impact application performance.  If you have a host that has very high port utilization, you need to know this before adding more workload to the host.  If you have storage ports that have very high port utilization, perhaps you can load balance your hosts differently to spread the load across the storage ports so that a few ports aren't busier than the rest of them.  

 

It is important to note that busy ports are not automatically slow ports.  Your busy ports may be keeping up with the current load that is placed on them.   However, if the load increases, or a storage system where all ports are busy has a link fail, the remaining ports may not have enough available capacity to meet that demand.  SAN Analytics can help identify these ports.  


Related to this is verifying that multi-pathing (MPIO) is working properly on your hosts.  SAN Analytics can help you determine if all host paths to storage are active, and if they are, whether the utilization is uniform across all of the paths.

 

Discover if Application Problems are Related to Storage Access


SAN Analytics enables you to monitor the Exchange Completion Time (ECT) for an exchange.  This is a measure of how long a command takes to complete.  An overly long ECT can be caused by a few different problems, including slow device ports.  However, if SAN Analytics is reporting a long ECT on write commands when there are no issues indicated on the SAN, this often means that the problem is inside the storage.    


Identify the Specific Problematic Hosts Connected to an NPV Port


Hypervisors such as VIOS, VMware or Hyper-V all use N-Port Virtualization to have virtual machines log into the same set of physical switch ports on a fabric.  Customers frequently also have physical hosts connecting through an NPV device.  An example of this is a Cisco UCS chassis with several hosts connecting through a fabric extender to a set of switch ports.  In these situations, getting a traffic breakdown per server or virtual HBA from just data available in the switch data collection is challenging.   It is even more so when you are trying to troubleshoot a slow drain situation.  Switch statistics can point to a physical port, but if multiple virtual hosts are connected to that port it is often difficult to determine which host is at fault.  There are a few Cisco commands that can be run to try to determine this, but they need to be run in real-time, and on a large and busy switch you can often miss when the problem is happening as the data from these commands can wrap every few seconds. 


The SAN Analytics engine collects this data on each of the separate hosts.  This gives you the ability to drill down to specific hosts in minute detail.  Once you identify a specific switch port that is slow drain, you can then use the data available in SAN Analytics to determine which of the hosts attached to that port is the culprit.  


If you want to learn more:


The Cisco and IBM C-Type Family


How IBM and Cisco are working together

Thursday, August 20, 2020

Implementing a Cisco Fabric for Spectrum Virtualize Hyperswap Clusters



 I wrote this previous post on the general requirements for SAN Design for Spectrum Virtualize Hyperswap and Stretched clusters.  In this  follow-on post, we'll look at a sample implementation on a Cisco or IBM C-type fabric.  While there are several variations on implementation (FCIP vs Fibre-Channel ISL is one example) the basics shown here can be readily adapted to any specific design.  This implementation will also show you how to avoid one of the most common errors that IBM SAN Central sees on Hyperswap clusters - where the ISLs on a Cisco private VSAN are allowed to carry traffic for multiple VSANs.

We will implement the below design, where the public fabric is VSAN 6, and the private fabric is VSAN 5. The below diagram is a picture of one of two redundant fabrics.  The quorum that is depicted can be either an IP quorum or a third-site quorum.   For the purposes of this blog post, VSAN 6 has already been created and has devices in it.  We'll be creating VSAN 5, adding the internode ports to it and ensuring that the Port-Channels are configured correctly.  We'll also verify that Port-Channel3 on the public side is configured correctly to ensure VSAN 5 stays dedicated as a private fabric.   For the examples below, Switch1 is at Failure Domain 1.  Switch 2 is at Failure Domain 2.  

Hyperswap SAN Design

Before we get started, the  Spectrum virtualize ports should have the local port mask set such that there is at least 1 port per node per fabric dedicated to internode.   Below is the recommended port masking configuration for Spectrum Virtualize clusters.   This blog post assumes that has already been completed.  

Recommended Port Masking


Now let's get started by creating the private VSAN:

switch1(config)# conf t
switch1(config)# vsan database
switch1(config-vsan-db)# vsan 5 name private

switch2(config)# conf t
switch2(config)# vsan database
switch2(config-vsan-db)# vsan 5 name private

Next, we'll add the internode ports for our cluster.  For simplicity in this example, we're working with a 4 node cluster, and the ports we want to use are connected to the first two ports of Modules 1 and 2 on each switch.   We're only adding 1 port per node here.  Remember that there is a redundant private fabric to configure which will have the remaining internode ports attached to it.  

switch1(config)# conf t
switch1(config)# vsan database
switch1(config-vsan-db)# vsan 5 interface fc1/1, fc2/1

switch2(config)# conf t
switch2(config)# vsan database
switch2(config-vsan-db)# vsan 5 interface fc1/1, fc2/1

Next we need to build the Port-channel for VSAN 5.  Setting the trunk mode to 'off' ensures that the port-channel will only carry traffic from the  single VSAN we specify.   For Cisco 'trunking' means carrying traffic from multiple VSANs.  By turning trunking off, no other VSANs can traverse the port-channel on the private VSAN.   Having multiple VSANs traversing the ISLs on the private fabric is one of the most common issues that SAN Central finds on Cisco fabrics.  This is because trunking is allowed by default, and adding all VSANs to all ISLs is also a default when ISLs are created  We also will set the allowed VSANs parameter to only allow traffic for VSAN 5.  Lastly to keep things tidy we'll add the port-channel to VSAN 5 on each switch

switch1(config)# conf t
switch1(config)# int port-channel4
switch1(config-vsan-db)# vsan 5 interface port-channel4
switch1(config)# conf t
switch1(config)# int port-channel4
switch1((config-if)# switchport mode E
switch1((config-if)# switchport trunk mode off
switch1((config-if)switchport trunk allowed vsan 5
switch1((config-if)int fc1/14
switch1((config-if)# channel-group 4
switch1((config-if)int fc2/14
switch1((config-if)# channel-group 4
switch1((config-if)vsan database


switch2(config)# conf t
switch2(config)# int port-channel4
switch2(config-vsan-db)# vsan 5 interface port-channel4
switch2(config)# conf t
switch2(config)# int port-channel4
switch2((config-if)# switchport mode E
switch2((config-if)# switchport trunk mode off
switch2((config-if)switchport trunk allowed vsan 5
switch2((config-if)int fc1/14
switch2((config-if)# channel-group 4
switch2((config-if)int fc2/14
switch2((config-if)# channel-group 4
switch2((config-if)vsan database


The next steps would be bring up Port-Channel 4 and the underlying interfaces on Switch 1 and Switch 2, ensure the VSANs have merged correctly and lastly zone the  Spectrum Virtualize node ports together. 

We also need to examine Port-channel 3 on the public fabric to ensure it is not carrying traffic for the private VSAN.  To do this:  

switch1# show interface port-channel3

......

......

Admin port mode is auto, trunk mode is auto

Port vsan is 1 

Trunk vsans (admin allowed and active) (1,3,5,6)


Unlike the private VSAN, the trunk mode can be in auto or on. This is the public VSAN so there may be multiple VSANs using this Port-Channel.   The problem is highlighted in red.  We are allowing private VSAN 5 to traverse this Port-Channel.  This must be corrected, using the commands above that set the trunk allowed parameters.  Your vsans allowed statement would include all of the current VSANs except 
VSAN 5.  On a side note,  it is a good idea to review which VSANs are allowed to be trunked across port-channels or ISLs. Allowed VSANs that are not defined on the remote switch for a given ISL will show up as Isolated on the switch you run the above command on.  The only VSANs that should be allowed are the ones that should be running across the ISL. You would need to perform the same check on Port-Channel 3 on Switch 2. 

Lastly, the above commands can be used for FCIP interfaces or standalone FC ISLs.  You would just substitute the interface name for port-channel4 in the above example.  A note for standalone ISLs is that it is recommended that they be configured as port-channels.   You can read more about that here.  


 I hope this answers your questions and helps you with implementing your next Spectrum Virtualize Hyperswap cluster.  If you have any questions find me on LinkedIn or Twitter or post in the comments.