Using Performance Data To See Network Problems



I frequently work cases where the problem is a performance problem.  Either an entire system or an application is slow enough that users are affected.    Another frequent performance problem is with storage-side replication.  In these cases replication is not able to keep up with the production workload and RPOs are not being met.   Replication is done most commonly between sites, though I have worked a few cases with same-site (or campus) replication.    

Whether you are using IBM DS8000 PPRC/Global Mirror, IBM SVC/FlashSystem Global Mirror (GM) or Global Mirror with Change Volumes (GMCV) you expect that the replicated data will be current up to a certain point in time behind the production data.  This is your Recovery Point Objective (RPO).    Your RPO is how current the replicated data needs to be.  For data that doesn't change often, an RPO of 30 minutes or an hour might be enough.  For data that frequently changes, an RPO of a few minutes might be required.  For weekly reporting, your RPO could be a few days or a week. 

On IBM SVC or IBM Flashsystem, both GM and GMCV are asynchronous replication, meaning the production data is replicated at some point in time after the write is complete by the host.  A technology like Metro Mirror (MM) is synchronous.  This means the data is replicated as it is written.  The good status on the write is not returned to the host until the data has been successfully replicated.  While MM provides an always current RPO, it does so at the risk of a performance problem on the network affecting production.    Since most replication is site-to-site and across links that have at least some distance,  this risk increases with Metro Mirror.

For fibre-channel networks the distance links can be either fibre-channel native protocol running on DWDM, ONS, or some other underlying physical topology that is transparent to the switches, or it can be fibre-channel over IP (FCIP).    FCIP uses TCPIP networks to transmit the fibre-channel frames.  The scenarios we will talk about in this blog post can happen on native FC networks but are much more common on FCIP.  This is because fibre-channel is a lossless protocol that expects a lossless network.  This is a fancy way of saying the protocol assumes the network transmission medium will not lose data during transmission.   The data error checking and retransmission is done by the end devices.  TCP/IP is the opposite.  It assumes a lossy network, so there is a lot more overhead built into the protocol and network itself.  

Frequently when a performance problem manifests on FCIP infrastructure, it is not obvious on the fibre-channel routers that are performing the FCIP function.  You can look for clues such as the switches logging tunnel drops or other error messages.  You can also do thinks like look at the IP statistics.  However, those might not point to anything that clearly shows where the problem is occurring. 

In the diagrams below, we will look at performance data collected from some storage systems to illustrate how we can show the problem is in the network.  All of the charts below were captured from IBM Spectrum Virtualize (SVC or Flashsytem) storage systems.  They all are looking at the Port To Remote Node Send Response time, which is a measure of how long it is taking the remote cluster to respond to replication commands, and the Port to Remote Node Send Data Rate.   They are showing 3 slightly different manifestations of the same scenario.   Also, all of the data rates shown are well below the capacity of the networks so none of these are a case of an overworked network.    

An important note:  This blog post assumes you have already looked at the response times for the partner cluster and ruled it out as the source of the problem.  Before assuming the network is the issue, you have to look at the partner cluster to see what it's port to remote node receive response times are.  If they are elevated, then you would need to look at the cluster first.

In this first example, the solid lines are the response time metric.  The dashed lines are the data rates.  Note that for the first 1/3 of the chart, the data rates are low.  However, you can see that the response time is variable.  You can also see that the peaks in response time correspond to peaks in the data rates.  The response time should ideally be a flat, or nearly flat line most of the time.  The peaks in response time indicate variable latency on a regular basis.  Since the peaks correspond to workload (increased data rate) we can see this is workload related.   We know that this network has more capacity than what the workload is.  The conclusion then is that there are problems on the underlying LAN/WAN that the FCIP tunnels are running on.   



In Example number two, the scenario is a bit different and is less obvious.  The dashed lines are the response times, the solid lines with the higher peaks are the data rates.  We can see the data rates mostly under 100 MB/sec, with peaks over 400 MB/sec.   This translates to peaks of about 3 Gbps.  The underlying network was rated at 10 Gbps, so this workload is well within the specifications.   The response time looks a little better - it is more close to flat, however it should not vary this much with workload.  Response times are increasing by 3-5ms each time workload goes up.  


The last example is the most clear example of this scenario.  The dashed lines represent the response times.  The solid line with the much larger peaks is the workload.  You can see the highly variable response time.  This is a solid indication of a problem somewhere on the underlying network.  A 5ms variance in latency doesn't sound like much but the latency should not be this variable on a regular basis.  


Hopefully this shows you how you can use performance data to help identify the source of some replication problems.    While you would still need to do troubleshooting on the network itself, this at least should help you determine where the problem is, and give you something to show your network team to confirm the source of a problem.   All of the above charts came from IBM Storage Insights.  I strongly recommend using Storage Insights to help manage your storage and fabric.  You can find out more about Storage Insights here:

Getting Started with IBM Storage Insights

Comments

  1. This article was a real eye-opener about the build quality of these switches WS-C2960X-48FPD-L effortlessly. Experience style, quality, and convenience.. I had no idea they were so modular inside. It got me thinking, what are some common issues that people face with ProCurve 2824 switches, and how can they troubleshoot or repair them?

    ReplyDelete
  2. In my extensive experience, addressing performance issues has become a routine aspect of my work. I often encounter cases involving sluggishness in entire systems or applications, significantly impacting user experience. Another recurring challenge is related to storage-side replication, particularly when it fails to match the pace of the production workload, leading to unmet RPOs. Interestingly, I've handled instances where replication hiccups occurred in same-site scenarios, underlining the complexity of the issue. Network switches also play a crucial role in these scenarios, requiring meticulous attention to ensure seamless data flow and optimal performance.

    ReplyDelete
  3. Analyzing performance data is crucial for identifying and resolving network problems efficiently. Utilizing the best network router ensures optimal connectivity and minimizes disruptions. By leveraging performance metrics, you can pinpoint bottlenecks, latency issues, or bandwidth constraints, enabling proactive troubleshooting for a seamless network experience.





    ReplyDelete
  4. In my line of work, tackling performance issues is a common thread, often dealing with sluggish systems or applications impacting user experience. Storage-side replication emerges as another challenge, particularly when it struggles to match the production workload, leading to unmet RPOs. Typically, these issues arise in cross-site scenarios, though I've encountered a few cases involving same-site replication. Addressing such intricacies emphasizes the crucial role of robust network accessories in ensuring seamless operations and meeting performance expectations.

    ReplyDelete

  5. Optimizing network performance is as crucial as finding the perfect dress. Just like dress shops focus on details, monitoring performance data helps identify and address network problems promptly. Ensure seamless connectivity for a flawless online shopping experience.

    ReplyDelete
  6. Analyzing performance data is crucial in identifying and resolving network issues. By leveraging real-time metrics, we can swiftly detect anomalies and enhance overall efficiency. This approach is particularly vital in environments where rapid response is paramount, such as data centers and industrial facilities. Implementing cutting-edge solutions like the Halon 2402 fire extinguisher for sale ensures not only network reliability but also safeguards the environment against potential risks. Utilizing performance data alongside advanced safety measures is key for a resilient and secure network infrastructure.





    ReplyDelete
  7. In my work, addressing performance issues in systems and applications is a frequent task, especially when dealing with storage-side replication challenges. The struggle to meet RPOs due to replication lag is common, whether between sites or within the same campus. Emphasizing the importance of reliable Best PC components is crucial for sustaining optimal performance in these scenarios.

    ReplyDelete
  8. Unlocking insights from performance data is key to identifying and resolving network issues. This blog on Using Performance Data To See Network Problems offers valuable strategies. For businesses in ADGM Abu Dhabi , these insights are crucial for maintaining seamless operations in a dynamic environment, enhancing overall efficiency.

    ReplyDelete
  9. Analyzing performance data is key to spotting network issues early. By leveraging this data, you can address problems swiftly and ensure smooth connectivity. Don't wait for disruptions, invest in quality solutions like buy iPhone 15 cases to safeguard your device and stay connected hassle-free.

    ReplyDelete

Post a Comment

Popular posts from this blog

Troubleshooting Slow Drain Devices on Broadcom Switches

Spectrum Virtualize NPIV and Host Connectivity