Re: ESXi + LIO + Ceph RBD problem
Martin Svec <martin.svec <at> zoner.cz>
2015-08-19 11:02:02 GMT
thank you for sharing all the interesting tips and ideas. I agree with Steve that there're two
different issues. It makes sense to reduce default_cmdsn_depth if the backend storage is overloaded
and cannot handle more outstanding I/O in a timely manner. However, this doesn't help in case of
temporary backend outages like RAID disk or Ceph node failure where we know we definitely exceed 5
secs timeout and want to reset the sessions. ESXi does quite well when recovering from APD
conditions but it seems not to be this situation.
Steve, I was testing the same Pacemaker+DRBD setup as you in 2011 and decided to rewrite target
resource agent from scratch. The original one was too unreliable and slow. (Sorry, I cannot provide
it to public.) Note that I never saw ABORT_TASKs when running this setup on our Dell hardware.
Dne 19.8.2015 v 10:22 Steve Beaudry napsal(a):
> Thanks Nicolas,
> I'll modify the resource agent script so that the default_cmdsn_depth can be set, and reduce
> the value to 16, based on your recommendation, amd see what impact it has.
> I do still believe that we're talking about two different problems, one being performance and
> outstanding IOs timing out, while the other being a seeming incompatibility between LIO and ESX
> with regards to handling sessions when VMWare decides to restart a session which it does for a
> number of reasons (really, in response to any number of SCSI errors that pop).
> -------- Original message --------