George Harvey | 26 Mar 2010 00:21

FTP stalls on tl0 on Challenge S

Hi,

I'm seeing a rather weird problem on a ThunderLAN 100Mb card (tl0) in my
R4400 Challenge S running 5.0.2. I'm in the process of updating various
packages from pkgsrc and noticed that FTP was stalling when fetching
the Vim 7.2 patch file 7.2.304. It didn't matter which server it tried,
it always stalled after 1448 bytes (the file is 1831 bytes). I got
round the problem by diverting the connection through the on-board 10Mb
interface (sq0) but I was curious as to why it failed.

I copied the file on to another local NetBSD box (an Ultra 5), did some
experimenting, and it appears that it's the length that is the
significant factor. If I add a byte, FTP works, if I remove a byte, FTP
works, but at exactly 1831 bytes FTP stalls. Apart from that, the
ThunderLAN card is working fine, I'm running xterm sessions and NFS
mounts through it no problem. Any ideas what's special about a length
of 1831?

George

Jochen Kunz | 26 Mar 2010 08:12
Picon

Re: FTP stalls on tl0 on Challenge S

On Thu, 25 Mar 2010 23:21:57 +0000
George Harvey <fr30 <at> dial.pipex.com> wrote:

> It didn't matter which server it tried,
> it always stalled after 1448 bytes (the file is 1831 bytes).
Did you try an other ftp client? (wget, curl, ncftp, ...)
1448 bytes +  TCP/IP overhead seems to be just a single Ethernet packet.
--

-- 

tschüß,
       Jochen

Homepage: http://www.unixag-kl.fh-kl.de/~jkunz/

Stephen M. Rumble | 26 Mar 2010 07:43
Picon

Re: FTP stalls on tl0 on Challenge S

On Mar 25, 2010, at 4:21 PM, George Harvey wrote:

> I copied the file on to another local NetBSD box (an Ultra 5), did some
> experimenting, and it appears that it's the length that is the
> significant factor. If I add a byte, FTP works, if I remove a byte, FTP
> works, but at exactly 1831 bytes FTP stalls. Apart from that, the
> ThunderLAN card is working fine, I'm running xterm sessions and NFS
> mounts through it no problem. Any ideas what's special about a length
> of 1831?

I can't say I have any good idea of what could be going on, but here are a few things to consider:

1) Have you tried other odd-sized files (e.g. add/subtract 2 bytes from the affected file)? I can't imagine
all odd-length packets are being screwed up, but who knows?

2) Have you looked at tcpdump output on the sgimips and server machines? I'm wondering whether the second
(and final) data packet gets through for the remaining bytes, or if something else is amiss.

If you're not sure what to look for, you could run the following during an attempted transfer and send me the
'out.cap' dumps:
    "tcpdump -s 0 -w out.cap src sgi.box.ip.address" on the server, and
    "tcpdump -s 0 -w out.cap src sparc.server.ip.address" on the tl0 client machine

3) Have you looked at netstat -s? It might be interesting to do 'netstat -s > before', run the failing
transfer for a few seconds, do 'netstat -s > after' and diff the two files. If packets are getting corrupted
somehow, bad checksum counters should increment.

It might be worth trying this on both the client and server, just in case we're screwing up transmitted
packets on the tl0 interface.

(Continue reading)

George Harvey | 26 Mar 2010 23:21

Re: FTP stalls on tl0 on Challenge S

On Thu, 25 Mar 2010 23:43:46 -0700
"Stephen M. Rumble" <rumble <at> cs.stanford.edu> wrote:

> On Mar 25, 2010, at 4:21 PM, George Harvey wrote:
> 
> > I copied the file on to another local NetBSD box (an Ultra 5), did some
> > experimenting, and it appears that it's the length that is the
> > significant factor. If I add a byte, FTP works, if I remove a byte, FTP
> > works, but at exactly 1831 bytes FTP stalls. 

> 1) Have you tried other odd-sized files (e.g. add/subtract 2 bytes from the affected file)? I can't
imagine all odd-length packets are being screwed up, but who knows?

Lengths of 1828, 1829, 1830, 1832, 1833 all work, 1831 and 3279 both
stall so it's looking a bit like the bug strikes with lengths of 
(n * 1448) + 383.

I also tried with wget and that stalls in the same place.

> 2) Have you looked at tcpdump output on the sgimips and server machines? I'm wondering whether the second
(and final) data packet gets through for the remaining bytes, or if something else is amiss.

I've got a wireshark dump which I'll send separately, the final 383
bytes are sent but seem to get lost somewhere at the SGI end.

> 3) Have you looked at netstat -s? It might be interesting to do 'netstat -s > before', run the failing
transfer for a few seconds, do 'netstat -s > after' and diff the two files. If packets are getting corrupted
somehow, bad checksum counters should increment.

Good call, the bad checksum count does increment. Wireshark says the IP
(Continue reading)

Stephen M. Rumble | 27 Mar 2010 03:43
Picon

Re: FTP stalls on tl0 on Challenge S

On Mar 26, 2010, at 3:21 PM, George Harvey wrote:

> Good call, the bad checksum count does increment. Wireshark says the IP
> and TCP checksums are good so that tends to confirm the problem's in
> the SGI.

Hi George,

Sounds like we may be messing up the receive DMA. Could you try the attached patch (against
src/sys/dev/pci/if_tl.c)? I recall correctly, I eventually discovered that this nic's pci<->gio bus
bridge can't DMA across page boundaries. I fixed the TX side, but for some reason neglected RX. I don't
recall if I just forgot to handle the RX case and it worked well enough, or if I believed it wasn't necessary
for some reason.

Anyway, if that doesn't work, I'd suspect that if_tl's bus_dma usage isn't quite correct, but I think it has
been used by others in SGI O2s without trouble.

Let me know if you'd like me to build a kernel. I don't have the toolchain set up, but that'd be easy to do.

Steve
Attachment (if_tl.c.patch): application/octet-stream, 497 bytes
Stephen M. Rumble | 27 Mar 2010 04:30
Picon

Re: FTP stalls on tl0 on Challenge S

On Mar 26, 2010, at 7:43 PM, Stephen M. Rumble wrote:

> On Mar 26, 2010, at 3:21 PM, George Harvey wrote:
> 
>> Good call, the bad checksum count does increment. Wireshark says the IP
>> and TCP checksums are good so that tends to confirm the problem's in
>> the SGI.
> 
> Hi George,
> 
> Sounds like we may be messing up the receive DMA. Could you try the attached patch (against
src/sys/dev/pci/if_tl.c)? I recall correctly, I eventually discovered that this nic's pci<->gio bus
bridge can't DMA across page boundaries. I fixed the TX side, but for some reason neglected RX. I don't
recall if I just forgot to handle the RX case and it worked well enough, or if I believed it wasn't necessary
for some reason.

Hrm. Actually, just the RX descriptor ring boundary restriction was missed, not the RX buffers
themselves. It might still be worth trying the patch if it's quick to do, but it doesn't look like the
glaring omission I thought it was.

Steve
George Harvey | 27 Mar 2010 22:51

Re: FTP stalls on tl0 on Challenge S

On Fri, 26 Mar 2010 20:30:22 -0700
"Stephen M. Rumble" <rumble <at> cs.stanford.edu> wrote:

> On Mar 26, 2010, at 7:43 PM, Stephen M. Rumble wrote:
> 
> > On Mar 26, 2010, at 3:21 PM, George Harvey wrote:
> > 
> >> Good call, the bad checksum count does increment. Wireshark says the IP
> >> and TCP checksums are good so that tends to confirm the problem's in
> >> the SGI.
> > 
> > Hi George,
> > 
> > Sounds like we may be messing up the receive DMA. Could you try the attached patch (against
src/sys/dev/pci/if_tl.c)? I recall correctly, I eventually discovered that this nic's pci<->gio bus
bridge can't DMA across page boundaries. I fixed the TX side, but for some reason neglected RX. I don't
recall if I just forgot to handle the RX case and it worked well enough, or if I believed it wasn't necessary
for some reason.
> 
> Hrm. Actually, just the RX descriptor ring boundary restriction was missed, not the RX buffers
themselves. It might still be worth trying the patch if it's quick to do, but it doesn't look like the
glaring omission I thought it was.

It will probably be Monday before I get a chance to build a kernel but
I'll certainly give it a try and let you know how it goes.

Thanks,
George

(Continue reading)

George Harvey | 29 Mar 2010 19:18

Re: FTP stalls on tl0 on Challenge S

On Fri, 26 Mar 2010 20:30:22 -0700
"Stephen M. Rumble" <rumble <at> cs.stanford.edu> wrote:

> On Mar 26, 2010, at 7:43 PM, Stephen M. Rumble wrote:
> 
> > On Mar 26, 2010, at 3:21 PM, George Harvey wrote:
> > 
> >> Good call, the bad checksum count does increment. Wireshark says the IP
> >> and TCP checksums are good so that tends to confirm the problem's in
> >> the SGI.

> Hrm. Actually, just the RX descriptor ring boundary restriction was missed, not the RX buffers
themselves. It might still be worth trying the patch if it's quick to do, but it doesn't look like the
glaring omission I thought it was.

Tried the patch today and there's no change, FTP still stalls in the
same place.

George


Gmane