How Physical Link Errors Can Cause Performance Problems
I am frequently involved in troubleshooting performance problems on storage networks. While there are many different causes for a performance issue, they can be caused by physical layer (link) problems. Physical link issues cannot be completely (or at least permanently) prevented as the hardware involved (cabling or transceivers) degrades over time. Loose or damaged cables, degraded transceivers or more rarely failing switch ports can all cause link layer problems. The most common indicator of a physical layer problem is CRC errors occurring on a link. Read this blog post for more information on CRC errors and how to troubleshoot them.
While a few CRC errors in a day or an hour are easily handled by the fabric, if they begin incrementing rapidly enough, the physical link conditions that are present when CRC errors appear will affect performance. Fibre-channel uses a buffer credit scheme for flow control. Under normal conditions, after the sender transmits a frame, it decrements its buffer count which controls how many frames it can send. The receiver receives the frame and processes it. A host will send the frame up the stack to the application, a switch will forward the frame to the destination. Once the frame has been processed and offloaded from the adapter, the receiver sends a special primitive (R_RDY) to the sender. The sender then increments its buffer count. If its buffer count reaches 0, it cannot send any more frames. This is buffer credit exhaustion. For more information on buffer credits and how they can affect performance, watch this video.
While the R_RDY is a primitive and does not have a CRC to check, the same conditions that generate CRC errors can also corrupt the R_RDY. If this happens, the sender may not recognize the R_RDY so will not increment its buffer count. As such, it will completely deplete the buffers. The receiver of the frame won't send another R_RDY because it already sent one for all the frames it received. The sender can't send frames since the buffer count is 0. If the buffer credit count remains at 0 for long enough it will reset the link to recover credits.
The buffer credits at 0 and the repeated link resets all contribute to the performance problem. However, physical layer issues can be easily detected and resolved before they become severe enough to affect performance. Ways you can proactively monitor your fabrics include:
- Broadcom MAPS feature (Broadcom switches only)
- Cisco Port-Monitoring Policies (Cisco switches only)
- IBM Storage Insights (storage, host and SAN Switch)
- IBM Storage Control on-prem (storage, host and SAN switch)
Check out this blog post with further information on all of the above options.
Physical link errors can seriously impact performance, similar to how a faulty zipper affects the function of the Best Mens Jackets. Ensuring a stable connection is crucial for seamless data transmission, just as a well-functioning zipper ensures the jacket stays securely fastened. Identifying and resolving these errors promptly can prevent disruptions and maintain optimal system performance.
ReplyDelete