Friday, April 26, 2019

Troubleshooting CRC Errors On Fibre-channel Fabrics




There is no "Easy Button" for troubleshooting CRC errors. It is an iterative process. You make a change, you monitor your fabric, and if necessary you make more changes until the issues are resolved. I frequently have customers who want it to be a one step process. It can be, but usually takes multiple steps. Having said that, before we can fix them, we need to know what CRC errors are and why they occur.

What Are CRC Errors? When Do they Occur?

The simple answer is that CRC errors are damaged frames. The more complicated answer is that before a fibre-channel frame is sent, some math is done. The answer is added to the frame footer. When the receiver gets the frame, the receiver repeats the math. If the receiver gets a different answer then what's recorded in the frame, then the frame was changed in flight. This is a CRC error. The only time these happen is if the physical plant - cabling, SFPs is somehow defective. It is much less common, but still possible to have a bad component in the switch. Troubleshooting that will be a separate blog post, someday. What the receiver does with the damaged frame depends on whether it's a switch or end device and if it is a switch, what brand of switch.

Why Does Fixing These Matter?

At best, the effect of faulty links is a few dropped frames. Left unchecked the problem will get worse and eventually cause performance problems. Also, you will go from 1 or 2 bad links to many. A customer I have been working with for the last several months was in this situation and is finally finishing a very long process of cleaning up many faulty links. Years ago I had a customer that was experiencing extremely long delays on their Brocade fabric. They had over-redundancy (there is such a thing) on the switches and links between the hosts and the storage. Many of the links were questionable and producing CRC errors. When the storage received a bad frame, it simply dropped it and did not send an ABTS. They also had an adapter in the host with a bug in it, and it would simply sit and wait for the storage to respond. 90 or so seconds later, the application would time out and initiate recovery for a problem that should never have happened.

Why Does It Matter What Brand Of Switch It Is?

First, the different brands of switches use different commands to obtain the data you need to troubleshoot these problems. Second, the way that they check and forward frames are different. This requires a different technique depending on the brand of switch. Cisco switches are what is called store-and-forward. This means that they wait for the entire frame to be received, then they check it, then if the frame is valid it gets forwarded. If not, it is dropped. Brocade switches are cut-through. As soon as they receive enough of the frame to know where it's going, they start forwarding it. If the frame ends up being bad, they try to correct it using Foward Error Correction. If that doesn't work the frame is tagged as bad. For the most part, end devices that receive frames that are already tagged as bad simply drop the frame and initiate recovery via ABTS. Troubleshooting commands and techniques vary for Brocade vs Cisco fabrics.

Identifying CRCs on Cisco Fabrics

Since Cisco fabrics are store-and-forward, you know that frames with CRC errors will be dropped as soon as they are detected. This can be either at the switch port they arrive on, or more rarely inside the switch. This post will focus on the CRC errors detected at the switch ports. If you suspect that you have questionable links, you can use these commands to check switch ports for CRC errors:
  • 'show interface'
  • 'show interface counters'
  • 'show logging log
For the above - the 'show interface' and 'show interface counters' commands can be run specifying a switch port that you are interested in. this is done in the format of fcS/P where S is the slot and P is the port. For the 'show logging log' you are looking for messages that a port was disabled because the bit error rate was too high. This is often an indicator of a faulty link. Once you find the ports that are detecting the CRC errors, you can then proceed to the repair phase.

Identifying CRCs on Brocade Fabrics

Brocade fabrics use cut-through routing. As such, the link for the port that is detecting the CRC errors may not be the faulty link. Brocade has two statistics for CRCs: CRC and CRC_Good_EOF. If the CRC_Good_EOF counter is increasing, this means that the link it is increasing on is the source of the problem. If the CRC counter is increasing, then the frame has already been marked as bad, and the problem is occurring elsewhere on the SAN. The CRC_Good_EOF should be the only counter that increases on a device port. If the CRC_Good_EOF counter is increasing on an ISL port, the link between the sending and receiving switch is bad. If you the CRC counter is increasing on the ISL, this means the problem is occurring somewhere on the sending switch. So move to the sending switch and look for ports where CRC_Good_EOF is increasing. It is possible that both counters will increase on a link. If it is a device port, then the link is bad. If it is an ISL then the link itself is a problem, and the sending switch has other bad links attached to it.    As you can see there are a few more steps to identify the source of the CRC errors on Brocade before you can proceed to the repair phase.  The porterrshow may also show ports that do not have CRC_Good_EOF increasing, but do show a counter called PCS increasing. If so, this is also an indication of a bad link. Troubleshooting PCS errors are the same as troubleshooting CRC_Good_EOF errors.
  • 'porterrshow'
  • 'portstatsshow N'
The porterrshow command will display error stats for all ports. The portstatsshow N where N is a port index number will display more detailed stats for the specified port. If you see PCS errors increasing for a port in the porterrshow, the link on that port is bad, regardless of what CRC or CRC_Good_EOF counters there are.

Correcting the Problem

Once you have identified the port(s) that have questionable links you need to correct the problem. As I mentioned earlier, this is an iterative process. You replace a part, then clear the switch statistics, then monitor for anywhere from several hours to a day, depending on the rate of increase. Repeat the process until the errors are no longer increasing. You can replace multiple parts at once - such as replacing a cable and an SFP at the same time. Another option is to isolate further by just swapping a cable, or moving the device to a new port on the switch. Just remember that it is critical to reset the statistics immediately after any change you make.  REMEMBER THAT PATCH PANELS  ARE PART OF CABLING.  I emphasize that because customers will often replace the cable between the switch/device and the panel and forget that there is cabling between patch panels which is also suspect.  Some years ago I went onside to troubleshoot connectivity between two storage systems.  The storage systems were located at different campuses in the same city. The replication paths would not stay up.   When I got there, the client had them directly connected through several patch panels with no switching.  I assisted them in putting the cabling through switches at each campus and immediately saw CRCs showing up on the links.  They had 8 hops across patch panels between the storage systems.  We found CRCs at the second hop at each side. I stopped checking after that.  Their eventual permanent fix was to run a new direct run of cable between the two locations
If you have any questions, leave them in the comments or find me on   LinkedIn or on Twitter.

1 comment:

  1. I agree with a lot of the points you made in this article. If you are looking for the Fibre Deals, then visit Accelerit Premium Fibre. I appreciate the work you have put into this and hope you continue writing on this subject.

    ReplyDelete