Posts

Showing posts from 2019

Fabric Zoning for the IBM Spectrum Virtualize and FlashSystem NPIV Feature

Image
Zoning Basics Before I talk about some zoning best-practices, I should explain two different types of zoning and how they work.  There are two types of zoning:  WWPN Zoning and Switch-Port Zoning World-wide Port Name (WWPN) Zoning WWPN zoning is also called "soft" zoning and is based off the WWPN that is assigned to a specific port on a fibre-channel adapter.  The WWPN serves a similar function as a MAC address does on an ethernet adapter.  WWPN-based zoning uses the WWPN of devices logged into the fabric to determine which device can connect to which other devices.  Most fabrics are zoned using WWPN zoning.  It is more flexible than switch-port zoning - a device can be plugged in anywhere on the SAN (with some caveats beyond the scope of this blog post) and the device can connect to the other devices it is zoned to.    It has one distinct advantage over Switch-Port based zoning, which is that zoning can always be specified on a single WWPN level.   Switch-Port Z

New Advisor Features in IBM Storage Insights

Image
IBM Storage Insights was updated recently.  Two new Dashboards that were added are the Advisor Dashboard and the Notifications Dashboard.  IBM Storage Support can see the events in these dashboards. In some cases, it allows Support to more quickly identify problems or other issues that need to be addressed.   Like all tables in Storage Insights, both the Advisor and Notifications Dashboard can be filtered and sorted.  You can also export to PDF, CSV or HTML so that you can do additional sorting, filtering or share items in each of the tables. The Advisor Dashboard The Advisor Dashboard is found under the Insights menu.  This dashboard lists recommendations for changes that you can make to improve the stability of your managed IBM Storage systems.  The recommendations include configuration changes, firmware upgrades, and other changes to enhance the stability and performance of your IBM Storage.     Below you see a capture of an Advisor Dashboard and some examples of the items

Using IBM Storage Insights Pro and Alert Policies To Monitor Host Path Count

Image
I was at a recent TechU event and had a discussion with a customer about using IBM Storage Insights to monitor host path count.  More specifically, the customer had a recent outage of a few hosts after doing some maintenance work on some storage and the affected hosts were not connected on all the expected paths.    There are options for using the storage data collections to review host connections. For instance, you could write a script that compares the SVC/Storwize host WWPN definitions to the connected devices to see if any WWPNs are not connected to the SVC.  However, it is much more straightforward to configure an alert in IBM Storage Insights Pro.   Alert Policies were previously covered in these two videos:  You can create an Alert Policy for an agentless host and then configure an alert to notify you when the path count changes.  You would need to create separate policies for each operating system, but after you create the first policy and define the alert, you can
Image
Troubleshooting CRC Errors On Fibre-channel Fabrics There is no "Easy Button" for troubleshooting CRC errors. It is an iterative process. You make a change, you monitor your fabric, and if necessary you make more changes until the issues are resolved. I frequently have customers who want it to be a one step process. It can be, but usually takes multiple steps. Having said that, before we can fix them, we need to know what CRC errors are and why they occur. What Are CRC Errors? When Do they Occur? The simple answer is that CRC errors are damaged frames. The more complicated answer is that before a fibre-channel frame is sent, some math is done. The answer is added to the frame footer. When the receiver gets the frame, the receiver repeats the math. If the receiver gets a different answer then what's recorded in the frame, then the frame was changed in flight. This is a CRC error. The only time these happen is if the physical plant - cabling, SFPs is somehow

Advanced Alert Policies in IBM Storage Insights Pro

Image
IBM released a new Alert Policies feature for IBM Storage Insights Pro the first week of March 2019.    John Langlois does an excellent job introducing the new feature here: There are a few more advanced aspects of alert policies that John did not cover.    First, you can add and remove managed storage from Alert Policies from the Alerts Definitions of the storage system that you want to modify.  The following video shows how to do this. Second,  you must remember that if you add storage that has never been in an Alert Policy to an Alert Policy,  any existing alerts defined on that storage are lost and cannot be retrieved.  The following video shows an example managing of this and shows a workaround to preserve those alerts if you want to re-apply them at some point in the future.  

Cisco Automatic Zoning

Image
Cisco released a feature in NX-OS v8.3.1 called Automatic Zoning.  The feature does exactly what the name suggests:  it automatically configures zoning for the devices on your SAN.  You can see a video on the feature here: What Is Zoning? SAN (Storage Area Network) zoning is specifying which devices on the SAN  can communicate with which other devices.  Devices are added to a zone.  A zone is a group of devices that can communicate.  Zones are then added to a zoneset.  The zoneset is then activated - this is the configuration that is in effect. There can be multiple zonesets but only one active one at any given time.   By default, any device not zoned (not a member of a zone) cannot communicate with any other device.  A device that is in at least one zone is considered zoned.   Effective zoning prevents unauthorized devices from talking to each other, and minimizes disruptions on the SAN if a device misbehaves.   How Cisco Automatic Zoning Works When a SAN is first con

Why Increasing Buffer Credits Is Not A Cure For A Slow Drain Device

When I am working on performance problems, a frequent question I  get is why increasing buffer credits for a particular port is not a fix for a slow drain device attached to that port.  In this video, I explain the concept of congestion and illustrate why increasing the number of buffer credits is not a fix, without addressing the underlying cause of the congestion.   There are some exceptions to this rule.  The most common is when dealing with long-distance links, but that will be addressed in a future blog post (and perhaps a future video).   As always,  you can leave feedback in the comments, or find me on  LinkedIn  or  Twitter

Troubleshooting IBM Storage Insights Pro Alerts

Image
Recently, there were enhancements made to several features, including a new Alert Policy feature in IBM Storage Insights Pro. You can find out what's new about the new features here.   The Alert Policy feature lets you configure a set of alerts into a policy and apply all of them across multiple storage systems. In this way you can ensure consistency with alerts and not have to define the same alert on each individual storage system. Once you define the alerts, IBM Storage Support representatives can see the generated alerts on a storage system. For the IBM FS9100, there are a number of alerts that are already defined. When one of those alerts is triggered,  a proactive ticket is opened and the experts at IBM Storage Support investigate the alert, then take whatever action is necessary.   With this post we'll take a look at how the IBM Storage Support Team investigates  alerts.  For this example we are using an alert for the Port Send Delay I/O Percentage.  

I Will Be At Technical University in Atlanta

Image
IBM Tech U Atlanta 2019 This is just a quick post to say that I will be at IBM Technical University in Atlanta. I will be there from April 29 through May 3 My sessions for this event are: s106417 The Path of an FC Frame Through a Cisco MDS Director s106420 Proactive Monitoring of a Cisco Fabric s106421 Troubleshooting Cisco SAN Performance Issues - Part 1 s106422  Troubleshooting Cisco SAN Performance Issues - Part 2 Find these sessions and many more at Technical University. Click the banner at the top of the page to register or for more information. As always, if you have any questions, leave them in the comments at the end of this blog or find me on LinkedIn or Twitter.

Why Low I/O Rates Can Result In High Response Times

Image
Why Low I/O Rates Can Result in High Response Times for Reads and Writes As IBM Storage Insights and Storage Insights Pro become more widely adopted, many companies who weren't doing performance monitoring previously are now able to see the performance of their managed storage systems. With the Alerting features on Storage Insights Pro, companies are much more aware of performance problems within their storage networks. One common question that comes up is why a volume with low I/O rates can have very high response times. Often these high response times are present even with no obvious performance impact at the application layer. These response time spikes generally measured in the 10s or 100s of milliseconds, but can be a second or greater. At the same time, the I/O rates are low - perhaps 10 I/Os per second or less. This can occur on either read or write I/Os. As an example, this picture shows a typical pattern of generally low I/O rates with a high response time. The volume

Working With Groups in IBM Storage Insights Pro

Image
IBM Storage Insights Pro Groups Feature In addition to the Reporting and Alerting features that are not available in the free version of IBM Storage Insights, the subscription based offering, IBM Storage Insights Pro, has a very useful feature called Groups. Groups allow you to group or bundle together related storage resources for ease of management. For example, you might group a set of volumes together that are all related to a specific application, or server cluster. You might group the hosts that make up a cluster into a group. You can even group ports together - you could define a group of ports for an SVC Cluster that includes all the ports used for inter-node communication.  Such a group would be very handy for your IBM Storage Support person. It would potentially save support from having to collect a support log and dig through it. It would certainly make analysis go faster when troubleshooting an issue. So to get started with Groups, the first thing to do is to s

IBM Storage Insights: Tips and Answers to Questions Frequently Asked

Image
Answers To Some Frequently Asked Questions about IBM Storage Insights Over the last several months I have seen some common questions that are asked about IBM Storage Insights. I started collecting them and will answer them here.  These questions are all about Storage Insights itself. Questions relating to managing specific types of storage with Storage Insights will be answered in future Blog posts. So, on to the questions..... Q: Can I install a data collector in the same system as my IBM Spectrum Control server A: Yes.  However you need to pay attention to memory and CPU usage of you Data Collector Authentication If you use usernname/password authentication configure a dedicated user ID for the Data Collector on  your  storage systems.  do not use the default or other Admin account.    This allows for effective auditing and reduces security risks.  Q: What Are The Recommended System Specifications for the Data Collector? A:    The hard drive space requirem

The Importance of Keeping Your Entire Solution Current

Recently I started working on a new case for a customer.   I'm trying to diagnose repeated error messages being logged by an IBM SVC Cluster that indicate problems communicating with the back-end storage that is being virtualized by the SVC.  These messages generally indicate SAN congestion problems.  The customer has Cisco MDS 9513 switches installed.  They're older switches but not all that uncommon.  What is uncommon is finding the switches at NX-OS version 5.X.X.  I see down-level firmware but this one is particularly egregious.  This revision is several years out of date. Later versions of code contain numerous bug fixes both from Cisco and for the associated upstream Linux security updates that get incorporated into NX-OS.  Also, while NX-OS versions don't officially go out of support, any new bugs identified won't be fixed as this version is no longer being actively developed This level of firmware merits further investigation.  Looking deeper on the switche

How Storage Insights Can Enhance Your Support Experience

Image
An Introduction The first week in May IBM announced  IBM Storage Insights .    As of 11 June, Storage Insights has these key items: IBM Blue Diamond support Worldwide support for opening tickets Custom dashboards New dashboard table view Clients can now specify whether IBM Storage Support can collect support logs.  This is done on a per-device basis. You can get a complete list of the new features here:   Storage Insights New Features There are some other new features such as new capacity views on the Storage Insights Dashboard.    With these new features, especially support for IBM Blue Diamond customers, Storage Insights is an increasingly important and valuable troubleshooting tool.  My team here is seeing more and more customers that are using Storage Insights.   I thought I would discuss the potential benefits of Storage Insights as a troubleshooting tool.   Some Background The problems my team fixes can be categorized as either: Root-cause analysis (RCA)

Troubleshooting SVC/Storwize NPIV Connectivity

Image
Some Background: A few years ago IBM introduced the virtual WWPN (NPIV) feature to the Spectrum Virtualization (SVC) and Spectrum Storwize products.  This feature allows you to zone your hosts to a virtual WWPN (vWWPN) on the SVC/Storwize cluster.  If the cluster node has a problem, or is taken offline for maintenance the vWWPN can float to the other node in the IO Group.  This provides for increased fault tolerance as the hosts no longer have to do path failover to start I/O on the other node in the I/O group.  All of what I've read so far on this feature is from the perspective of someone who is going to be configuring this feature.  My perspective is different, as I troubleshoot issues  on the SAN connectivity side.  This post is going to talk about some of the procedures and data you can  use to troubleshoot connectivity to the SVC/Storwize when the NPIV feature is enabled, as well as some best-practice to hopefully avoid problems. If you are unfamiliar with this fea