Brocade Fabric Performance Impact Notification (FPIN) was released in Broadcom FOS v9.0. It is available on Brocade Gen6 and Gen7 switches. This feature enables the switch to detect issues on a fabric such as congestion or physical link issues and then then notify the affected devices that have registered for these notifications. FPIN functions in a similar mechanism to RSCN. RSCN enables the fabric to send notifications to devices when a device they are zoned to is going offline. The devices that receive these notifications can then proactively take steps such as path failover rather than have to react to a path being down.
FPIN provides a means to notify devices of link or other issues with a connection to a fabric or a path through it. For both RSCN and FPIN, a device must register with fabric services to receive these notifications. The new Brocade Gen7 hardware can send hardware or software signal notifications. Gen6 can only send software notification. Both the hardware and software notifications require FOS v9.0 on the switches.
Hardware signals can be sent from the switch to the adapter in the device. The adapter can then decide what to do about the notification. Software signals are sent higher up in the Fibre-Channel stack, and the adapter driver would then decide how to handle the notification. One advantage to notifications in hardware is reaction time - the adapter can process the notifications and react more quickly than the driver can. Another is that the hardware-based notification is a fibre-channel primitive. This means that even if buffer credits are depleted the signal can still be delivered to the device on the other end of the link.
Primitives are not frames so do not need buffer credits to be sent. The software layer signal is an ELS frame, so can be affected by buffer credit depletion and other link congestion. Whether the signal sent is hardware or software, how the devices handle the notifications is up to the vendor of the adapter. Some may log the notification, some may take action. The action that an adapter takes is also vendor specific.
FPIN can alert devices about these events:
End Device Congestion
Device Link Integrity (CRC)
If FPIN is enabled, these events are still monitored via MAPS. Enabling FPIN won't change your existing MAPS configuration for the above events. With FPIN, notifications are sent to the affected devices that register for them. How the devices handle the notification is vendor specific. They may just log the event or they may take other steps such as starting link recovery or slowing traffic on a congested link and re-routing out an un-congested port. As a last resort, the device may shut down a troublesome link.
Some vendors that support FPIN today are:
Linux Multi-Path in RHEL 8.2
Emulex - supports Congestion and Link Integrity notifications on Linux
Marvell - will register for FPIN and log the notifications, these could be used as a source of log data for troubleshooting
AIX - will register for Link Integrity and Congestion notifications
but we expect that more HBA and Storage Controller vendors will add support for FPIN in the future.
One use case for FPIN is if a switch detects congestion on an ISL or path between devices, it could potentially notify the device sending data so that device could try sending data down another path without waiting for timeouts and path failover to happen.
A common cause of congestion occurs when two devices are zoned together with a speed mismatch. In these cases, the faster device can throttle back and send data at a slower rate to the slower device. Some caveats here are that it would be vender specific for storage systems or host adapters, and in the case of throttling data rates, this would only work on the host side, unless a storage system could selectively throttle depending on the destination address.
Another use case is the with link integrity issues. If a link is accumulating CRC or Invalid Transmission Words (ITWs) the physical link has a faulty component. A fibre-channel cable can be bad in only one direction. So it is possible that the device at one end of a link is not aware of any issues. The Link Integrity FPIN will notify the host adapter if a path is compromised. The adapter can then determine whether it should try another path by having the multi-path driver fail over. This would happen at the hardware level, long before the problem bubbled up to the software layer.
One final note, remember that an FPIN can be sent from any device that supports it. Potentially the storage, the host or the switch can share this information and if they were all to have the capability to re-route data based on these notifications, the SAN is that much closer to an autonomous, self-healing SAN that routes data around blockages as best it can.
Slow drain devices are one of the more common problems on storage networks. They can occur for a variety of reasons. For a refresher on how they can affect your storage network you should watch this video . In this blog post I will go through the basic steps to troubleshoot a slow drain device on a Broadcom fabric. I will be using command line output from switches. The CLI format lends itself better to a blog post more readily than screen shots from a GUI, and the commands are consistent across different versions of FOS. SANnav is a huge change from Brocade or IBM Network Advisor and the screens would look quite different between the two. The first command we will be using is porterrshow. The above output has been truncated for the ports we are interested in. The counters of interest are in the c3timeout column. You can see that there are 2 sub-columns, 'tx and 'rx'. 'tx' means the switch is trying to send frames to the device attached to that p
Troubleshooting CRC Errors On Fibre-channel Fabrics There is no "Easy Button" for troubleshooting CRC errors. It is an iterative process. You make a change, you monitor your fabric, and if necessary you make more changes until the issues are resolved. I frequently have customers who want it to be a one step process. It can be, but usually takes multiple steps. Having said that, before we can fix them, we need to know what CRC errors are and why they occur. What Are CRC Errors? When Do they Occur? The simple answer is that CRC errors are damaged frames. The more complicated answer is that before a fibre-channel frame is sent, some math is done. The answer is added to the frame footer. When the receiver gets the frame, the receiver repeats the math. If the receiver gets a different answer then what's recorded in the frame, then the frame was changed in flight. This is a CRC error. The only time these happen is if the physical plant - cabling, SFPs is somehow
Zoning Basics Before I talk about some zoning best-practices, I should explain two different types of zoning and how they work. There are two types of zoning: WWPN Zoning and Switch-Port Zoning World-wide Port Name (WWPN) Zoning WWPN zoning is also called "soft" zoning and is based off the WWPN that is assigned to a specific port on a fibre-channel adapter. The WWPN serves a similar function as a MAC address does on an ethernet adapter. WWPN-based zoning uses the WWPN of devices logged into the fabric to determine which device can connect to which other devices. Most fabrics are zoned using WWPN zoning. It is more flexible than switch-port zoning - a device can be plugged in anywhere on the SAN (with some caveats beyond the scope of this blog post) and the device can connect to the other devices it is zoned to. It has one distinct advantage over Switch-Port based zoning, which is that zoning can always be specified on a single WWPN level. Switch-Port Z