Setting the queue depth or Execution throttle parameters – Clariion / VNX

Each FC storage array port has a maximum queue depth of 2048. For performance reasons we’ll have to do the math with 1600. Suppose a large number of HBAs (initiators) are generating IOs, a specific port queue can fill up to the maximum. The host’s HBA will notice this by getting queue full (QFULL) messages and very poor response times. It depends on the Operating system how this is dealt with. Older OSs could loose access to it’s drives or even freeze or get a blue screen. Modern OSs will throttle IOs down to a minimum to get rid of this inconvenience. VMware ESX for example decreases it’s LUN queue depth down to 1. When the number of queue full messages disappear, ESX will increase the queue depth a bit until it’s back at the configured value. This could take up to around a minute.
During the QFULL events the hosts may experience some timeouts, even if the overall performance of the CLARiiON is OK. The response to a QFULL is HBA dependent, but it typically results in a suspension of activity for more than one second. Though rare, this can have serious consequences on throughput if this happens repeatedly.

Some Operating Systems and (HBA) drivers can set a ceiling on the queue depths per LUN. This is commonly referred to as the target queue depth. VMware ESX limits the queue depth on a per LUN basis for each path.

An EMC CLARiiON (as well as many other storage arrays) will return a QFULL flow control command under the following conditions:

1 The total number of concurrent I/O requests on the Front-End FC port is greater than 1600.
2 The total requests for a LUN is greater than it maximum queue depth (32+(14*LUN’s data drive quantity)). For example, for a 5 drive RAID 5 (4+1), maximum queue depth=32+4*14=88
The HBA execution throttle thresholds on the hosts may be set at too high a value (such as 256).

On a Windows machine using QLogic HBA’s use the SANsurfer utility to change the Execution Throttle for each HBA. This can be done on-line. In new versions of SANsurfer, the Execution Throttle is found on the “Advanced HBA Settings” – select a HBA port, then Parameters, then on the “Select Settings section” drop down. The default setting for Execution Throttle in an EMC environment is 256 – if this is higher than 256, then change to 256; if the setting is 256 try lowering it to 32.
The same target queue length restrictions apply to all other HBA makes and models. With Emulex, these settings could be changed used HBA Anywhere.

In order to avoid hammering the storage processor FE FC ports, you can calculate the maximum queue depth using a combination of the number of initiators per Storage Port and the number of LUNs ESX uses. Other initiators are likely to be sharing the same SP ports, so these will also need to have their queue depths limited. The math to calculate the maximum queue depth is:

QD = 1600 / (Initiators * LUNs)

QD = the required Queue Depth or Execution Throttle, which is the maximum number of simultaneous I/O for each LUN any particular path to the SP.
Initiators = the number of initiators (HBAs) per Storage Port, which is normally equivalent to the number of ESX hosts, plus all other hosts sharing the same SP ports.
LUNs = the quantity of LUNs for ESX which are sharing the same paths, which is equivalent to the number LUNs in the ESX storage group.

Two ESX parameters should be set to this Q value. These are the queue depth of the storage adapter and “Disk.SchedNumReqOutstanding.” Most of the time, the “Disk.SchedNumReqOutstanding” is set to a lower value than the HBA queue depth in order to prevent any particular virtual machine from completely filling up the HBA queue and therefor not allowing other HBAs to perform any IO requests. If this is currently the case in the ESX environment, these settings should evenly be decreased. For example, if the HBA queue depth is 64 and “Disk.SchedNumReqOutstanding” is 32 (the default setting), then reducing to reduce the QFULL, the HBA queue depth could be set to 32 and ‘Disk.SchedNumReqOutstanding’ set to 16.

For example, a farm of 16 ESX servers have four paths to the CLARiiON (via two HBA each) and these FC ports are dedicated for use by ESX (which makes keeping queue depths under control easier). There are multiple storage groups in this example to keep each ESX servers boot LUN private, but each storage group has 5 LUNs.

This leads to the following queue depth:

QD = 1600 / (16 * 5) = 20

In practice a certain amount of over-subscription would be fine because all LUNs on all servers are unlikely to be busy at the same time, especially if load balancing is used. So in the example above, a queue depth of 32 should still not cause QFULL events under normal circumstances.

 

Also see the following VMware knowledge base articles:

http://kb.vmware.com/kb/1267

http://kb.vmware.com/kb/1268

  1. Thank you very much for the post. I have been reading about this for a while and this is the site where these concepts are best described.
    On the other hand, I still have a doubt regarding your example: during the Queue depth calculation, I see that the number of paths to the storage does not need to be taken into account, but what happens with the multipath policy? Does it matter if we have fixed, mru or Round Robin?

    Kind regards

  2. Thats where it gets fuzzy. I don’t think it really matters if you have RR on all paths, since the load will be equal on all ports. If you have fixed, it’s your responsibility to manually load balance the datastores over the existing storage ports.
    With MRU I have a trick: set if to fixed briefly, pin it to a certain port and set it back to MRU again. The down side of this is that MRU doesn’t guarantee that a datastore will be using that same path a day later, so IMHO it’s best to switch to fixed or RR as fast as possible.
    For the rest: there’s a nice Primus case which describes the very same topic and if you really want to know all the ins and outs the performance workshop is THE place to ask questions like this 🙂

    I tried to locate this Primus case, but I can’t find it this fast, but be aware that there is such a case about “queue depth” and “execution throttle” and it also mentions the queue depth of 88.

  3. On the EMC website you can find a somewhat newer article (000053727) on the subject here

  4. Hi,
    I have a question:
    Lets say that we have 1 ESX server with 2 HBA cards, connected to 4 Target ports (2 paths for each HBA) in a VMAX controller for example to 1 LUN.
    The QD for each HBA is set to 32 and the Disk.SchedNumReqOutstanding is set 32 as well.
    will I be able to get 64 Queues in total (32 for each HBA) or only 32 queues because Disk.SchedNumReqOutstanding is limiting us?

    If so wouldn’t it be smart to limit each HBA to 16 QD while the Disk.SchedNumReqOutstanding stays at 32?

    Thanks in advance,
    Ben Raz.

  5. And I forgot to mention – we are working in a round-robin fashion in this example

Would you like to comment on this post?