Stephen Hemminger | 1 Dec 2010 01:55
Favicon

BUG - routes not correctly deleted when address is deleted

If multiple addresses are assigned to an interface, and
a route is created that uses that address.  The route is not
deleted when the address is deleted.  Linux does cleanup properly
when the last address is deleted; it seems the FIB lacks the callback
to cleanup routes referencing an address.

Simple example:

# modprobe dummy
# ip li set dev dummy0 up
# ip addr add 192.168.74.160/24 dev dummy0
# ip addr add 192.168.18.11/24 dev dummy0
# ip ro add 74.11.49.0/24 via 192.168.74.160

# ip addr del 192.168.74.160/24 dev dummy0
# ip ro show dev dummy0
74.11.49.0/24 via 192.168.74.160 
192.168.18.0/24  proto kernel  scope link  src 192.168.18.11 

Before I go off and either brute force it (add another call back
into fib_hash and fib_trie), is there a better way?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rui | 1 Dec 2010 04:48
Picon

Re: multi bpf filter will impact performance?

On Tue, Nov 30, 2010 at 10:34 PM, Eric Dumazet <eric.dumazet <at> gmail.com> wrote:
> Le mardi 30 novembre 2010 à 22:21 +0800, Rui a écrit :
>
>> does it mean each incoming packet will issue a softirq?
>
> Hmm, if your machine is loaded, one cpu is probably looping inside
> ksoftirqd. You can probably check with "top" command.
>
>> possible to know the corresponding sock,then to run only one filter?
>>
>
> We could add a SKF_AD_CPU filter, thats a couple of lines ;)
>
> tcpdump   .... "cpu 0 and udp ..."
>
>> >
>> >> Q2. performance is bad? any idea to improve it?
>> >>
>> >
>> > multiqueue card : split each IRQ on a separate cpu.
>> >
>> > If not multiqueue card : use RPS on a recent kernel to split the load on
>> > several cpus.
>> >
>> > Use a filter against the queue, instead of doing a hash code yourself in
>> > the bpf. (code added in commit d19742fb (linux-2.6.33)
>> >
>> > (you need to tweak libpcap to use SKF_AD_QUEUE instruction)
>> >
>> > commit d19742fb1c68e6db83b76e06dea5a374c99e104f
(Continue reading)

Eric Dumazet | 1 Dec 2010 05:03
Picon

Re: multi bpf filter will impact performance?

Le mercredi 01 décembre 2010 à 11:48 +0800, Rui a écrit :

> if  RPS can spread the load into 4 separate cpus, how about the
> "packet_rcv(or tpacket_rcv)" ? will they run in parallel?
> 
> as I use 'tcpdump(PACKET_MMAP)' to copy the packet to user space, I
> expect there are simultaneous packet_rcv running in each CPU to put
> the packet into ringbuffer.

Yes, the filter code wan run in parallel with no particular slowdown,
since code and bpf data is shared by all cpus (no writes)

But its rather important for performance that each cpu store packets
into its own packet socket or ring buffer, to avoid false sharing
slowdowns.

With such a setup (split packets to four cpus, then make sure one cpu
deliver packets to one particular PACKET socket/ring buffer), it should
really be fast enough.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Simon Horman | 1 Dec 2010 05:30
Picon
Gravatar

Re: Bonding, GRO and tcp_reordering

On Tue, Nov 30, 2010 at 09:56:02AM -0800, Rick Jones wrote:
> Simon Horman wrote:
> >Hi,
> >
> >I just wanted to share what is a rather pleasing,
> >though to me somewhat surprising result.
> >
> >I am testing bonding using balance-rr mode with three physical links to try
> >to get > gigabit speed for a single stream. Why?  Because I'd like to run
> >various tests at > gigabit speed and I don't have any 10G hardware at my
> >disposal.
> >
> >The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
> >LSO and GSO disabled on both the sender and receiver I see:
> >
> ># netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> 
> Why 1472 bytes per send?  If you wanted a 1-1 between the send size
> and the MSS, I would guess that 1448 would have been in order.  1472
> would be the maximum data payload for a UDP/IPv4 datagram.  TCP will
> have more header than UDP.

Only to be consistent with UDP testing that I was doing at the same time.
I'll re-test with 1448.

> 
> >TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> >(172.17.60.216) port 0 AF_INET
> >Recv   Send    Send                          Utilization       Service Demand
> >Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
(Continue reading)

Simon Horman | 1 Dec 2010 05:31
Picon
Gravatar

Re: Bonding, GRO and tcp_reordering

On Tue, Nov 30, 2010 at 03:42:56PM +0000, Ben Hutchings wrote:
> On Tue, 2010-11-30 at 22:55 +0900, Simon Horman wrote:
> > Hi,
> > 
> > I just wanted to share what is a rather pleasing,
> > though to me somewhat surprising result.
> >
> > I am testing bonding using balance-rr mode with three physical links to try
> > to get > gigabit speed for a single stream. Why?  Because I'd like to run
> > various tests at > gigabit speed and I don't have any 10G hardware at my
> > disposal.
> > 
> > The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
> > LSO and GSO disabled on both the sender and receiver I see:
> > 
> > # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> > (172.17.60.216) port 0 AF_INET
> > Recv   Send    Send                          Utilization       Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
> > 
> >   87380  16384   1472    10.01      1646.13   40.01    -1.00    3.982  -1.000
> > 
> > But with GRO enabled on the receiver I see.
> > 
> > # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
> > (172.17.60.216) port 0 AF_INET
(Continue reading)

Simon Horman | 1 Dec 2010 05:34
Picon
Gravatar

Re: Bonding, GRO and tcp_reordering

On Tue, Nov 30, 2010 at 05:04:33PM +0100, Eric Dumazet wrote:
> Le mardi 30 novembre 2010 à 15:42 +0000, Ben Hutchings a écrit :
> > On Tue, 2010-11-30 at 22:55 +0900, Simon Horman wrote:
> 
> > > The only other parameter that seemed to have significant effect was to
> > > increase the mtu.  In the case of MTU=9000, GRO seemed to have a negative
> > > impact on throughput, though a significant positive effect on CPU
> > > utilisation.
> > [...]
> > 
> > Increasing MTU also increases the interval between packets on a TCP flow
> > using maximum segment size so that it is more likely to exceed the
> > difference in delay.
> > 
> 
> GRO really is operational _if_ we receive in same NAPI run several
> packets for the same flow.
> 
> As soon as we exit NAPI mode, GRO packets are flushed.
> 
> Big MTU --> bigger delays between packets, so big chance that GRO cannot
> trigger at all, since NAPI runs for one packet only.
> 
> One possibility with big MTU is to tweak "ethtool -c eth0" params
> rx-usecs: 20
> rx-frames: 5
> rx-usecs-irq: 0
> rx-frames-irq: 5
> so that "rx-usecs" is bigger than the delay between two MTU full sized
> packets.
(Continue reading)

Eric Dumazet | 1 Dec 2010 05:47
Picon

Re: Bonding, GRO and tcp_reordering

Le mercredi 01 décembre 2010 à 13:34 +0900, Simon Horman a écrit :

> I was tweaking those values recently for some latency tuning
> but I didn't think of them in relation to last night's tests.
> 
> In terms of my measurements, its just benchmarking at this stage.
> So a trade-off between throughput and latency is acceptable, so long
> as I remember to measure what it is.
> 

I was thinking again this morning about GRO and bonding, and dont know
if it actually works...

Is GRO on on individual eth0/eth1/eth2 you use, or on bonding device
itself ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Taku Izumi | 1 Dec 2010 06:03
Favicon

Re: [PATCH] bonding: add the sysfs interface to see RLB hash table


(2010/11/30 19:10), Eric Dumazet wrote:
> Le mardi 30 novembre 2010 à 19:01 +0900, Taku Izumi a écrit :

>> +	hash_index = bond_info->rx_hashtbl_head;
>> +	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>> +		client_info =&(bond_info->rx_hashtbl[hash_index]);
>> +
>> +		count += sprintf(buf + count,
>> +			"%3d.%3d.%3d.%3d %3d.%3d.%3d.%3d "
>> +			"%02x:%02x:%02x:%02x:%02x:%02x %s\n",
> 
> 
> Oh well, I guess you dont read Joe patches on netdev ;)
> 
> Please take a look at %pI4 and %pM
> 
> sprintf(buf + count, "%pI4 %pI4 %pM %s\n", ...)

 Thank you for your advice. I've become a little wiser..

Taku Izumi <izumi.taku <at> jp.fujitsu.com>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet | 1 Dec 2010 06:04
Picon

[PATCH net-next-2.6] net: optimize INET input path further

Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

Optimize INET input path a bit further, by :

1) moving sk_refcnt close to sk_lock.

This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).

2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

(same cache line than hash / family / bound_dev_if / nulls_node)

This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields. 

Before patch :

offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

After patch :

offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
(Continue reading)

Taku Izumi | 1 Dec 2010 06:09
Favicon

Re: [PATCH] bonding: add the sysfs interface to see RLB hash table


Dear Jay Volburgh and Stephen Hemminger:

(2010/12/01 3:37), Jay Vosburgh wrote:
> Eric Dumazet<eric.dumazet <at> gmail.com>  wrote:
> 
>> Le mardi 30 novembre 2010 à 19:01 +0900, Taku Izumi a écrit :
>>> This patch provides the sysfs interface to see RLB hash table
>>> like the following:
>>>
>>> # cat /sys/class/net/bond0/bonding/rlb_hash_table
>>>
>>> SourceIP        DestinationIP   Destination MAC   DEV
>>>   10.124.196.205  10.124.196. 81 00:19:99:XX:XX:XX eth3
>>>   10.124.196.205  10.124.196.222 00:0a:79:XX:XX:XX eth0
>>>   10.124.196.205  10.124.196. 75 00:15:17:XX:XX:XX eth4
>>>   10.124.196.205  10.124.196.  1 00:21:d8:XX:XX:XX eth3
>>>   10.124.196.205  10.124.196.205 ff:ff:ff:ff:ff:ff eth0
> 
> 	I'm reasonably sure something like this isn't going to be
> acceptable in sysfs (it's much too large).
> 
> 	In the proc file that bonding already uses, this type of
> information isn't unreasonable, but I don't think that is the best place
> for this, for two reasons.
> 
> 	First, the table may have up to 256 entries.  Therefore, a
> sufficiently populated table will easily overrun the one page of space
> available to a sysfs show function or a proc seq_printf (per iteration),
> so it will have to handle that.  The current code in bonding to do its
(Continue reading)


Gmane