Posts

Showing posts from March, 2019

Cisco Automatic Zoning

Image
Cisco released a feature in NX-OS v8.3.1 called Automatic Zoning.  The feature does exactly what the name suggests:  it automatically configures zoning for the devices on your SAN.  You can see a video on the feature here: What Is Zoning? SAN (Storage Area Network) zoning is specifying which devices on the SAN  can communicate with which other devices.  Devices are added to a zone.  A zone is a group of devices that can communicate.  Zones are then added to a zoneset.  The zoneset is then activated - this is the configuration that is in effect. There can be multiple zonesets but only one active one at any given time.   By default, any device not zoned (not a member of a zone) cannot communicate with any other device.  A device that is in at least one zone is considered zoned.   Effective zoning prevents unauthorized devices from talking to each other, and minimizes disruptions on the SAN if a device misbehaves.   How Cisco Automatic Zoning Works When a SAN is first con

Why Increasing Buffer Credits Is Not A Cure For A Slow Drain Device

When I am working on performance problems, a frequent question I  get is why increasing buffer credits for a particular port is not a fix for a slow drain device attached to that port.  In this video, I explain the concept of congestion and illustrate why increasing the number of buffer credits is not a fix, without addressing the underlying cause of the congestion.   There are some exceptions to this rule.  The most common is when dealing with long-distance links, but that will be addressed in a future blog post (and perhaps a future video).   As always,  you can leave feedback in the comments, or find me on  LinkedIn  or  Twitter

Troubleshooting IBM Storage Insights Pro Alerts

Image
Recently, there were enhancements made to several features, including a new Alert Policy feature in IBM Storage Insights Pro. You can find out what's new about the new features here.   The Alert Policy feature lets you configure a set of alerts into a policy and apply all of them across multiple storage systems. In this way you can ensure consistency with alerts and not have to define the same alert on each individual storage system. Once you define the alerts, IBM Storage Support representatives can see the generated alerts on a storage system. For the IBM FS9100, there are a number of alerts that are already defined. When one of those alerts is triggered,  a proactive ticket is opened and the experts at IBM Storage Support investigate the alert, then take whatever action is necessary.   With this post we'll take a look at how the IBM Storage Support Team investigates  alerts.  For this example we are using an alert for the Port Send Delay I/O Percentage.  

I Will Be At Technical University in Atlanta

Image
IBM Tech U Atlanta 2019 This is just a quick post to say that I will be at IBM Technical University in Atlanta. I will be there from April 29 through May 3 My sessions for this event are: s106417 The Path of an FC Frame Through a Cisco MDS Director s106420 Proactive Monitoring of a Cisco Fabric s106421 Troubleshooting Cisco SAN Performance Issues - Part 1 s106422  Troubleshooting Cisco SAN Performance Issues - Part 2 Find these sessions and many more at Technical University. Click the banner at the top of the page to register or for more information. As always, if you have any questions, leave them in the comments at the end of this blog or find me on LinkedIn or Twitter.

Why Low I/O Rates Can Result In High Response Times

Image
Why Low I/O Rates Can Result in High Response Times for Reads and Writes As IBM Storage Insights and Storage Insights Pro become more widely adopted, many companies who weren't doing performance monitoring previously are now able to see the performance of their managed storage systems. With the Alerting features on Storage Insights Pro, companies are much more aware of performance problems within their storage networks. One common question that comes up is why a volume with low I/O rates can have very high response times. Often these high response times are present even with no obvious performance impact at the application layer. These response time spikes generally measured in the 10s or 100s of milliseconds, but can be a second or greater. At the same time, the I/O rates are low - perhaps 10 I/Os per second or less. This can occur on either read or write I/Os. As an example, this picture shows a typical pattern of generally low I/O rates with a high response time. The volume

Working With Groups in IBM Storage Insights Pro

Image
IBM Storage Insights Pro Groups Feature In addition to the Reporting and Alerting features that are not available in the free version of IBM Storage Insights, the subscription based offering, IBM Storage Insights Pro, has a very useful feature called Groups. Groups allow you to group or bundle together related storage resources for ease of management. For example, you might group a set of volumes together that are all related to a specific application, or server cluster. You might group the hosts that make up a cluster into a group. You can even group ports together - you could define a group of ports for an SVC Cluster that includes all the ports used for inter-node communication.  Such a group would be very handy for your IBM Storage Support person. It would potentially save support from having to collect a support log and dig through it. It would certainly make analysis go faster when troubleshooting an issue. So to get started with Groups, the first thing to do is to s

IBM Storage Insights: Tips and Answers to Questions Frequently Asked

Image
Answers To Some Frequently Asked Questions about IBM Storage Insights Over the last several months I have seen some common questions that are asked about IBM Storage Insights. I started collecting them and will answer them here.  These questions are all about Storage Insights itself. Questions relating to managing specific types of storage with Storage Insights will be answered in future Blog posts. So, on to the questions..... Q: Can I install a data collector in the same system as my IBM Spectrum Control server A: Yes.  However you need to pay attention to memory and CPU usage of you Data Collector Authentication If you use usernname/password authentication configure a dedicated user ID for the Data Collector on  your  storage systems.  do not use the default or other Admin account.    This allows for effective auditing and reduces security risks.  Q: What Are The Recommended System Specifications for the Data Collector? A:    The hard drive space requirem

The Importance of Keeping Your Entire Solution Current

Recently I started working on a new case for a customer.   I'm trying to diagnose repeated error messages being logged by an IBM SVC Cluster that indicate problems communicating with the back-end storage that is being virtualized by the SVC.  These messages generally indicate SAN congestion problems.  The customer has Cisco MDS 9513 switches installed.  They're older switches but not all that uncommon.  What is uncommon is finding the switches at NX-OS version 5.X.X.  I see down-level firmware but this one is particularly egregious.  This revision is several years out of date. Later versions of code contain numerous bug fixes both from Cisco and for the associated upstream Linux security updates that get incorporated into NX-OS.  Also, while NX-OS versions don't officially go out of support, any new bugs identified won't be fixed as this version is no longer being actively developed This level of firmware merits further investigation.  Looking deeper on the switche

How Storage Insights Can Enhance Your Support Experience

Image
An Introduction The first week in May IBM announced  IBM Storage Insights .    As of 11 June, Storage Insights has these key items: IBM Blue Diamond support Worldwide support for opening tickets Custom dashboards New dashboard table view Clients can now specify whether IBM Storage Support can collect support logs.  This is done on a per-device basis. You can get a complete list of the new features here:   Storage Insights New Features There are some other new features such as new capacity views on the Storage Insights Dashboard.    With these new features, especially support for IBM Blue Diamond customers, Storage Insights is an increasingly important and valuable troubleshooting tool.  My team here is seeing more and more customers that are using Storage Insights.   I thought I would discuss the potential benefits of Storage Insights as a troubleshooting tool.   Some Background The problems my team fixes can be categorized as either: Root-cause analysis (RCA)

Troubleshooting SVC/Storwize NPIV Connectivity

Image
Some Background: A few years ago IBM introduced the virtual WWPN (NPIV) feature to the Spectrum Virtualization (SVC) and Spectrum Storwize products.  This feature allows you to zone your hosts to a virtual WWPN (vWWPN) on the SVC/Storwize cluster.  If the cluster node has a problem, or is taken offline for maintenance the vWWPN can float to the other node in the IO Group.  This provides for increased fault tolerance as the hosts no longer have to do path failover to start I/O on the other node in the I/O group.  All of what I've read so far on this feature is from the perspective of someone who is going to be configuring this feature.  My perspective is different, as I troubleshoot issues  on the SAN connectivity side.  This post is going to talk about some of the procedures and data you can  use to troubleshoot connectivity to the SVC/Storwize when the NPIV feature is enabled, as well as some best-practice to hopefully avoid problems. If you are unfamiliar with this fea

An Easy Way To Turn Your Flash Storage Supercar Into a Yugo

Image
Introduction This past summer I was brought into a SAN performance problem for a customer.  When I was initially engaged on the problem, it was a host performance problem.  A day or two after I was engaged, the customer had an outage on a Spectrum Scale cluster.   That outage was root-caused to a misconfiguration on the Spectrum Scale cluster where it did not re-drive some I/O commands that timed out.  The next logical question was why the I/O timed out.    Both the impacted hosts and Spectrum Scale cluster used an SVC cluster for storage.   I already suspected the problem was due to an extremely flawed SAN design.  More specifically,  the customer had deviated from best-practice connectivity and zoning of his SVC Cluster and Controllers.  A 'Controller' in Storwize/SVC-speak is any storage enclosure - Flash, DS8000, another Storwize product such as V7000, or perhaps non-IBM branded storage.  In this case, the customer had 3 Controllers.  Two were IBM Flash arrays, for the p

Inside SAN Central - Some SAN Switch Data Collection Recommendations

Image
My colleague Sebastian Thaele prepared two excellent writeups on collecting Cisco and Brocade  Switch logs.  His writeup for Cisco logs is  here.   His write up for Brocade logs is  here.    I won't duplicate his work with this blog post.  I was at a recent client conference, and heard a common complaint of the log upload process to IBM, and the number of times that customers are asked by IBM Support to upload logs for a PMR.   This sentiment includes all of IBM Support, not just SAN Central.  However  I decided I'd give everyone some  background on SAN Central's problem resolution process.  Why SAN Central Requests The Logs That It Does SAN Central (and actually Support in general) works problems that are classified in 1 of 3 ways: 1.  Provide Root-Cause:  These are issues where an event happened - such as poor performance or hosts lost path to disk - but the problem is no longer occurring and the customer wants to know why the event happened.   The biggest obs

Blatant Thievery

Note:  This post was migrated from my old blog site after that hosting site decided to sunset personal blogs. Is it still plagiarism if you are crediting the  original author?  Anyway,  this should have been my first post, or at least included in my first post. I stole  the idea from   Sebastian Thaele . Yay!  Yet another storage-related blog!  As if the world needed another one.   The difference with this one  is it is written by me, which by itself should be enough to make everyone want to read it.   Seriously, while I think I'm pretty good at what I do, I'm not that hubristic.   I know that I don't have all the answers.  Nobody does.   That being said, here's why I created this blog.   First, I wanted to give a bit of an insider's view into IBM Storage Support.    This previous post  of mine is an example of that. Second, I work as a world-wide Product Field Engineer for IBM SAN Central.  My team is  Level 2/Level 3 (depending on who's asking

A New Blog and Musings on IBM Technical University in New Orleans

NOTE:  This post was migrated from my old blog site after that hosting site decided to sunset personal blogs. Welcome To My Blog! Thanks for stopping by, and I hope you will follow my blog.  I will use this space to give you updates on IBM Storage and Storage Networking products, as well as tips and tricks you can use for troubleshooting and verifying best-practice configurations on your storage network and storage networking products.   For a bit of background, I work with the IBM  SAN Central support team in IBM Storage Support.  We work on most storage networks, but specialize in fibre-channel and FCoE storage networking.  My team provides root-cause analysis for events that have happened on the SAN as well as isolating issues that are ongoing.  We work with all of the IBM product support teams - if a product touches a storage network, we have most likely worked with it.   We are also the source at IBM for the XGig fibre-channel trace tool, in the event that a tool is nee