Kenneth Kalmer | 17 Feb 2011 21:11
Picon
Gravatar

High performance AoE

Dear list

We currently have a AoE SAN running in production that needs several
refinements, which I'll tackle in individual mails over the coming
days.

Basic layout is as follows:

Infortrend storage array with SAS & SATA drives, connected via FC to
two storage controllers which export logical volumes with AoE to Xen
hosts. Each AoE target is used as a virtual block device, either
containing the OS (stored on SATA) or additional storage for working
data (SAS for databases & mail, SATA for websites, etc). AoE target is
vblade, secondary storage is in "cold standby" mode (toggle FC port
states on the switch and start vblade's to take over). Switches are
Extreme Networks' Summit 7i.

At present we have 30 AoE targets running, when our full migration is
done we'll have well over a 100 and have plans to scale up way past
that. We run a huge private cloud (we're in the wholesale ISP
business) as well as managed private clouds for clients. We're ramping
up for a full public cloud offering.

Assuming the only optimizations I have done are the following:

* AoE in private tag-based VLAN.
* Bumped the MTU's to 9000 for all VLAN interfaces and switch ports.
* Gigabit ethernet.
* Leveraging decent switches with full non-blocking architectures.

(Continue reading)

Yacine Kheddache | 17 Feb 2011 22:28
Favicon
Gravatar

Re: High performance AoE

Hi,

Just a question based on what you have listed: why don't you use Coraid goods instead of vblade on top of
Infortrend!!! If the goal is to improve AoE performances then IMHO just test the SR and/or SRX line of
products and you will be able to overcommit your performance objectives and it will make your life
easier... 

But that just the point of view of someone which love the KISS principle ;-)

Yacine kheddache / www.alyseo.com

Le 17 févr. 2011 à 21:11, Kenneth Kalmer <kenneth.kalmer <at> gmail.com> a écrit :

> Dear list
> 
> We currently have a AoE SAN running in production that needs several
> refinements, which I'll tackle in individual mails over the coming
> days.
> 
> Basic layout is as follows:
> 
> Infortrend storage array with SAS & SATA drives, connected via FC to
> two storage controllers which export logical volumes with AoE to Xen
> hosts. Each AoE target is used as a virtual block device, either
> containing the OS (stored on SATA) or additional storage for working
> data (SAS for databases & mail, SATA for websites, etc). AoE target is
> vblade, secondary storage is in "cold standby" mode (toggle FC port
> states on the switch and start vblade's to take over). Switches are
> Extreme Networks' Summit 7i.
> 
(Continue reading)

Kenneth Kalmer | 18 Feb 2011 08:04
Picon
Gravatar

Re: High performance AoE

On Thu, Feb 17, 2011 at 11:28 PM, Yacine Kheddache <yacine@...> wrote:
> Hi,
>
> Just a question based on what you have listed: why don't you use Coraid goods instead of vblade on top of
Infortrend!!! If the goal is to improve AoE performances then IMHO just test the SR and/or SRX line of
products and you will be able to overcommit your performance objectives and it will make your life easier...

Well Yacine, the procurement process for the hardware was out of my
control and now I have to use it. The gear is pretty performant, we
have a couple of volumes that bypass the AoE stack and access the
LUN's directly, and they are super fast. The thing is I'll need to
bring them back into the AoE stack to maintain a level of elasticity
in our cloud.

> But that just the point of view of someone which love the KISS principle ;-)

+1, that is why I'm running AoE in the first place, and doing this
research to get additional clarity on things that are currently just
mentioned in passing on other sites. If I didn't believe in KISS, I
would have bowed to all the iSCSI pressure I'm under already...

Best

--

-- 
Kenneth Kalmer
kenneth.kalmer@...
http://opensourcery.co.za
 <at> kennethkalmer

------------------------------------------------------------------------------
(Continue reading)

Adi Kriegisch | 18 Feb 2011 15:27
Picon
Favicon

Re: High performance AoE

Hi!

> We currently have a AoE SAN running in production that needs several
> refinements, which I'll tackle in individual mails over the coming
> days.
[SNIP]
> My next question is on leveraging multiple gigabit connections, which
> leads me to the following questions:
> 
> Since vblade uses a specified device, should I use channel bonding to
> aggregate multiple links together for more performance ? If yes, is
> 802.3ad the best bonding method since the switch is involved in
> deciding down which link the ethernet frames are sent, or am I missing
> the plot on this one. I currently have 4 GBE ports per storage
> controller that I can leverage, and am considering jumping to dual 10
> GBE interfaces to the switch.
Hmm... I think the main issue is caused by using vblade. I'd consider
vblade kind of a reference implementation. Starting multiple vblade
processes will not help either because it will just eat up available iops
be introducing unnecessary, uncoordinated reads and writes. When talking
about performance you have two options:
Buy Coraid hardware -- as already suggested or
choose a different implementation like ggaoed[1]. There you may specify
multiple interfaces so you will get automatic load balancing and so on.

Using link aggregation (802.3ad or linux bonding drivers) will not help in
any way to improve performance between a storage server and a single
frontend: only one lane will be used, based on source and destination mac
(though it might work when using 2 different nics on the frontend -- but
this will lead to endless issues when the switch decides to choose a
(Continue reading)

Tracy Reed | 18 Feb 2011 20:14

Re: High performance AoE

On Fri, Feb 18, 2011 at 03:27:35PM +0100, Adi Kriegisch spake thusly:
> vblade kind of a reference implementation. Starting multiple vblade
> processes will not help either because it will just eat up available iops
> be introducing unnecessary, uncoordinated reads and writes. When talking

Wouldn't the block layer aggregate and coordinate these reads and writes?
vblade is just reading/writing from a file like any other process, no?  I find
vblade performance to be pretty decent. Especially since the "thundering herd"
problem of all vblade processes being awakened by every packet was solved. I am
never limited by CPU and very rarely even network, only by disk performance
itself.

--

-- 
Tracy Reed           Digital signature attached for your safety.
Copilotco            Professionally Managed PCI Compliant Secure Hosting
866-MY-COPILOT x101  http://copilotco.com
------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@...
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss
(Continue reading)

Tracy Reed | 17 Feb 2011 22:49

Re: High performance AoE

On Thu, Feb 17, 2011 at 10:11:18PM +0200, Kenneth Kalmer spake thusly:
> Since vblade uses a specified device, should I use channel bonding to
> aggregate multiple links together for more performance ? If yes, is
> 802.3ad the best bonding method since the switch is involved in
> deciding down which link the ethernet frames are sent, or am I missing
> the plot on this one. 

I say use channel bonding but understand that the connection between a
particular pair of machines will only use one of the links due to the
MAC hashing used by 802.3ad to choose a link to transmit to a particular
host over. So no one connection will be getting more than 1Gb but the
aggregate throughput from the AoE target to multiple iniators will be
greater. And of course make sure you have enough disk performance to
actually use the bandwidth.

> I currently have 4 GBE ports per storage controller that I can
> leverage, and am considering jumping to dual 10 GBE interfaces to the
> switch.

10GBE interfaces would certainly get you faster individual connections
than 1Gb. But worry about disk throughput first. Measure the bandwidth
on your 1Gb links and make sure you can actually hit that before
investing in 10Gb links and assuming the bandwidth is the issue. It
takes a lot of disks to full even a 1Gb pipe on anything but pure
streaming workloads.

> Then, on the initiator side my understanding is that "aggregation"
> comes for free. So in this case all I need to do is ensure I have a
> vlan interface per physical interface on the server, and use
> `aoe-interfaces` to restrict the scope to the multiple vlan
(Continue reading)

Kenneth Kalmer | 20 Feb 2011 20:37
Picon
Gravatar

Re: High performance AoE

On Fri, Feb 18, 2011 at 9:14 PM, Tracy Reed <treed@...> wrote:
> On Fri, Feb 18, 2011 at 03:27:35PM +0100, Adi Kriegisch spake thusly:
>> vblade kind of a reference implementation. Starting multiple vblade
>> processes will not help either because it will just eat up available iops
>> be introducing unnecessary, uncoordinated reads and writes. When talking
>
> Wouldn't the block layer aggregate and coordinate these reads and writes?
> vblade is just reading/writing from a file like any other process, no?  I find
> vblade performance to be pretty decent. Especially since the "thundering herd"
> problem of all vblade processes being awakened by every packet was solved. I am
> never limited by CPU and very rarely even network, only by disk performance
> itself.

I agree with Tracy here, although some tweaking of the kernel might
help for optimal access to the underlying block devices. However, this
is shooting off the hip and I think tests are in order to put this one
to rest.

--

-- 
Kenneth Kalmer
kenneth.kalmer@...
http://opensourcery.co.za
 <at> kennethkalmer

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
(Continue reading)

Kenneth Kalmer | 20 Feb 2011 20:45
Picon
Gravatar

Re: High performance AoE

On Thu, Feb 17, 2011 at 11:49 PM, Tracy Reed <treed@...> wrote:
> On Thu, Feb 17, 2011 at 10:11:18PM +0200, Kenneth Kalmer spake thusly:
>> Since vblade uses a specified device, should I use channel bonding to
>> aggregate multiple links together for more performance ? If yes, is
>> 802.3ad the best bonding method since the switch is involved in
>> deciding down which link the ethernet frames are sent, or am I missing
>> the plot on this one.
>
> I say use channel bonding but understand that the connection between a
> particular pair of machines will only use one of the links due to the
> MAC hashing used by 802.3ad to choose a link to transmit to a particular
> host over. So no one connection will be getting more than 1Gb but the
> aggregate throughput from the AoE target to multiple iniators will be
> greater. And of course make sure you have enough disk performance to
> actually use the bandwidth.

Good points. Since I'm using 4GB FC to the storage array, my thoughts
of 4GBE to the initiators was to eliminate all but physical disk
bottleneck. I'll check out the other bonding mechanisms as well and
see if one could possibly give us higher throughput than a single GBE
link.

>> I currently have 4 GBE ports per storage controller that I can
>> leverage, and am considering jumping to dual 10 GBE interfaces to the
>> switch.
>
> 10GBE interfaces would certainly get you faster individual connections
> than 1Gb. But worry about disk throughput first. Measure the bandwidth
> on your 1Gb links and make sure you can actually hit that before
> investing in 10Gb links and assuming the bandwidth is the issue. It
(Continue reading)

Lachlan Evans | 3 Mar 2011 06:50
Picon

Down,closewait under load

Hi list,

I've encountered an re-occuring issue where a single AoE device goes into the closewait,down state.  I'm hoping someone here might be able to point me in the right direction of where to look to find the underlying cause.

A little about the setup:  two hosts, one acting as a SAN the other as a Xen host.   Both running Debian Squeeze using Debian distributed AoE packages.
 A 5 disk RAID-6 array configured using md and LVM on the SAN.  LVM volumes are then exported via AoE using vblade.  There are 5 volumes exported from the SAN to the Xen host:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 up
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 up
      e0.4        53.687GB  bond0 up

which are then used by the Windows 2003 Server Xen DomU as its disk devices.

The issue first occurred on February 17th 19:13 where this was recorded:

Feb 17 19:13:30 vmsrv kernel: [456093.648028] VBD Resize: new size 0

I believe this log entry originates from Xen's VBD driver reporting the change.

And aoe-stat on the Xen host displaying:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 up
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 closewait,down
      e0.4        53.687GB  bond0 up

Over night last night:

Mar  2 20:28:23 vmsrv kernel: [900000.336023] VBD Resize: new size 0

and aoe-stat displaying:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 closewait,down
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 up
      e0.4        53.687GB  bond0 up

An aoe-revalidate instantly resolves the issue but in the mean time the disks are unavailable.

What leads me to believing that this is an issue related to load is that both occurences have occurred within our backup schedule which generates a large amount of load particularly on the SAN.  Up until about a month ago we were running a combination of IET+open-iscsi and the backup schedule (which has not changed since) didn't seem to impact on that combination.

Any pointers would be greatly appreciated.

Cheers,

Lachlan



------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@...
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss
Jeff Sturm | 3 Mar 2011 18:50
Favicon

Re: Down,closewait under load

We were plagued by this problem a while ago.  "closewait" status means the driver sees the device but is waiting for the block device to close before automatically revalidating it.

 

What version of the initiator (aoe.ko) are you using?

 

After some inspection of the aoe driver source, I now understand that the RTT calculations do not take into account packets that are permanently lost.  So it is possible for the driver to get into a state in which the network is flooded with request packets, resent after each TTL expiration, while the TTL is not adjusted.  After aoe_deadsecs seconds elapse (300 by default, IIRC) the device is marked down.

 

The aoe_maxout defaults are flawed.  With a single target  (say, e1.1) and a single initiator, the device will be queried for its "buffer count" (in response to a "query config" status) and return, for example, 64.  The aoe initiator then uses this number as the default value of aoe_maxout, and will send up to 64 requests to the target before receiving a response.  Now suppose there are 2 ethernet links from the initiator to the target (multipath).  The aoe initiator will send up to 2*64, or 128, requests before receiving a response, which can overwhelm the target.

 

It gets worse than that.  If the shelf has 3 different slots (e.g. e1.1, e1.2, e1.3), the Linux aoe initiator will queue up to 64 requests per slot per interface (3 * 2 * 64, 384).  And if there are 4 different hosts all connecting to the same target, multiply this by 4 (1536).  That's far more outstanding requests than the target can safely handle, and intermediate switch buffers are likely to get flooded as well.

 

Here's how we handled it:

 

-      Enable hardware flow control on all Ethernet devices carrying aoe traffic.  The usual wisdom with hardware flow control is to leave it off, since TCP has pretty good congestion control.  However AOE is not TCP, and there is ample evidence that AOE performs better with it enabled.

 

-      Ensure network buffers are large enough to store outstanding packets.  This is particularly important if you are running jumbo frames.  In our sysctl.conf I have:

 

net.core.rmem_default = 262144

net.core.rmem_max = 16777216

net.core.wmem_default = 262144

net.core.wmem_max = 16777216

 

-      Lower the aoe_maxout parameter of the aoe module as much as necessary to preserve stability of the storage network.  As mentioned above the default aoe_maxout is obtained by querying the device.  Cut this in half, or less, and run some performance tests.  We've lowered it all the way to 8 without much sacrifice in performance.

 

-      Buy good network switches, if you haven't done so already.  The network is only as good as its weakest component.  Switches are not a good place to save money, I've found, and not all are made the same.  Try a few different models if you have the luxury.

 

Good luck,

 

-Jeff

 

From: Lachlan Evans [mailto:aoetools-discuss-gekAO9sGzLu6c6uEtOJ/EA@public.gmane.org]
Sent: Thursday, March 03, 2011 12:50 AM
To: aoetools-discuss-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
Subject: [Aoetools-discuss] Down,closewait under load

 

Hi list,

I've encountered an re-occuring issue where a single AoE device goes into the closewait,down state.  I'm hoping someone here might be able to point me in the right direction of where to look to find the underlying cause.

A little about the setup:  two hosts, one acting as a SAN the other as a Xen host.   Both running Debian Squeeze using Debian distributed AoE packages.
 A 5 disk RAID-6 array configured using md and LVM on the SAN.  LVM volumes are then exported via AoE using vblade.  There are 5 volumes exported from the SAN to the Xen host:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 up
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 up
      e0.4        53.687GB  bond0 up

which are then used by the Windows 2003 Server Xen DomU as its disk devices.

The issue first occurred on February 17th 19:13 where this was recorded:

Feb 17 19:13:30 vmsrv kernel: [456093.648028] VBD Resize: new size 0

I believe this log entry originates from Xen's VBD driver reporting the change.

And aoe-stat on the Xen host displaying:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 up
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 closewait,down
      e0.4        53.687GB  bond0 up

Over night last night:

Mar  2 20:28:23 vmsrv kernel: [900000.336023] VBD Resize: new size 0

and aoe-stat displaying:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 closewait,down
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 up
      e0.4        53.687GB  bond0 up

An aoe-revalidate instantly resolves the issue but in the mean time the disks are unavailable.

What leads me to believing that this is an issue related to load is that both occurences have occurred within our backup schedule which generates a large amount of load particularly on the SAN.  Up until about a month ago we were running a combination of IET+open-iscsi and the backup schedule (which has not changed since) didn't seem to impact on that combination.

Any pointers would be greatly appreciated.

Cheers,

Lachlan


------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Aoetools-discuss mailing list
Aoetools-discuss@...
https://lists.sourceforge.net/lists/listinfo/aoetools-discuss

Gmane