Tuesday, June 23, 2020

To Trunk or Not To Trunk, That Is the Question



Separate ISLs



Trunked ISLs

I have had several conversations recently with customers who have asked the question that when they have multiple inter-switch links (ISLs) between switches should those links be aggregated into a single logical link.  Above we have the two possible configurations for links between switches.

On a Broadcom switch these are called trunks. On a Cisco switch these are called port-channels.   The word 'trunk' has a different meaning on a Cisco switch.  For Cisco, an inter-switch link (ISL)  is trunking when it is carrying traffic for multiple VSANs.  This applies to both single link ISLs and to port-channels.  If it is only carrying traffic for a single VSAN it is not trunking.  This blog post uses 'trunk' to mean link aggregation. 

The first image above depicts three ISLs configured as separate, standalone ISLs.  The second image depicts the three links aggregated as a single logical link.  When link aggregation is configured, the switches treat the separate links as a single logical ISL for the purposes of load balancing, fabric rebuilds and routing frames,  among other fabric-related services.

Why Use Trunked Links

Load Balancing

Load balancing is done by the switches when traffic crosses the ISLs between switches.  Traffic is balanced across the links in a trunk or across the separate ISLs.  Load balancing is more effective when done across the links in a trunk as compared to separate links.  For the diagrams above, load balancing would be better on average for the trunked links than the separate ISLs.   There is less of a tendency to stack traffic on the same ISL until it is full when using a trunk.   Load balancing is also dependent on the type of routing that is configured on the switch or the Cisco VSAN.  Different VSANs on the same switch can have different routing policies.  

There are two types of  routing used by most SAN switches.  The routing types are source-destination based and exchange-based routing.  Exchange-based  routing is the default for both Brocade and Cisco.  All the frames in a given exchange between two device ports will follow the same route through the fabric.   The next exchange between the same two device ports may traverse a different physical link in a trunk. 

Fibre-channel breaks transmissions up into frames (the smallest unit), sequences and exchanges.  Sequences contain frames.  Exchanges contain sequences.  A close approximation  of the relationship is if we consider frames to be spoken words, sequences are sentences and exchanges are a conversation.  

When source-destination routing is used all of the frames between the same two device ports will follow the same path through the fabric.  New exchanges will all follow the same path.  Using this type of routing is not recommended except for a specific set of use cases.   If the same physical link is used for  data traveling between different source/destination pairs, the effect is frames stack up on the same one or two links in a trunk.   This results in the other links in the trunk going underutilized and it can cause congestion and delays on the ISLs.

It is important to note that source-destination and exchange-based are routing policies, and not load-balancing policies.  While routing policies can affect the effectiveness of load-balancing they are not load-balancing per-se.  

Fabric Changes

Fabrics undergo changes - devices leaving the fabric, zoning changes, etc.   Periodically switches will join a fabric, or leave a fabric when they are decommissioned.  Links between switches in the fabric will go offline due to issues on the link such as a reset due to buffer credit recovery.  An administrator may take some action during planned maintenance on the fabric.    When an ISL between switches goes down, or comes back up,  the fabric will recompute all of the routing tables and rebroadcast this to all the switches in the fabric.    On a fabric configured with standalone ISLs, this would happen each time an ISL goes offline, or comes back online.   If a link occasionally goes offline, this won't cause much of an impact.  However, if the link goes into a flapping state for buffer credit recovery, the repeated fabric rebuilds can cause an impact to production.

  When trunks are used, this fabric rebuild does not occur unless the entire trunk goes offline.  For the purposes of routing, the trunk is considered the route from one switch to its neighbor.  So if a single link in the trunk starts flapping, the fabric rebuilds do not occur and the link can simply be placed offline until the problem is resolved.  

Flapping Links

An ISL like any other link on a fabric can have problems - bad cabling or faulty optics or perhaps it becomes a congestion point due to a slow drain device on the fabric.  When this happens, the link can start flapping, or going up and down repeatedly.  If this is a standalone ISL, this flapping can cause congestion or other problems due to the repeated fabric changes that occur each time the link comes up or goes down.  Any frames in flight on that link will need to be re-sent, which causes additional error recovery to happen on hosts and other end devices.     If the flapping link is a member of a trunk, then the effects on the fabric are usually much less severe.  The fabric can route around the failing link and the fabric changes are mostly attenuated.  


For the reasons explained in this post,  unless there is a specific requirement in the solution to have separate ISLs, trunking is the preferred option. 


No comments:

Post a Comment