Andrew Morton | 1 Feb 07:30 2004

Re: [PATCH 2/4] 2.6.1-mm2: sched-domain

Nick Piggin <piggin <at> cyberone.com.au> wrote:
>
> This is the core sched domains patch.

I'm having a ton of trouble with this on the 4-way ppc64 box.  Symptoms are
similar to memory corruption: gcc falls over with sig11, filenames
corrupted, etc.  Often it fails to get through the initscripts without
userspace processes failing randomly.

One example:

 cc -Wall  -Wall -I../include    -c -o search_path.o search_path.c
make[1]: B<lots of random binary garbage>: Command not found
make[1]: *** [search_path.o] Error 127
make[1]: Leaving directory `/mnt/sdb5/ltp-full-20040108/lib'
make: *** [libltp.a] Error 2

and

cc -O -Wall  -w -o  test ./test.c
cc -c -Wall  -w -o  test_arch.o ./test.c
cc -Wall  -w -o  test_D ./test.c
make[4]: *** [test] Segmentation fault
make[4]: *** Deleting file `test'

I'm surprised that nobody else has noticed it.  The results of a binary
search through my current patch queue:

local_bh_enable-warning-fix.patch			OK
pnp-8250_pnp-fix.patch
(Continue reading)

Martin J. Bligh | 1 Feb 07:52 2004

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain

> But Anton says that he's using your scheduler patches without problems.
> 
> mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version
> powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease)
> 
> I tried a kernel compiled with -O1 and it failed in the same way, which
> somewhat rules out a compiler bug.
> 
> Anyone have any suggestions as to a next step?

Where'd your PPC toolchain come from? Anton tends to roll his own, IIRC.
What happens if Anton gives you a binary kernel, and you run that?

M.

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
Andrew Morton | 1 Feb 07:56 2004

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain

"Martin J. Bligh" <mbligh <at> aracnet.com> wrote:
>
> > But Anton says that he's using your scheduler patches without problems.
> > 
> > mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version
> > powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease)
> > 
> > I tried a kernel compiled with -O1 and it failed in the same way, which
> > somewhat rules out a compiler bug.
> > 
> > Anyone have any suggestions as to a next step?
> 
> Where'd your PPC toolchain come from? Anton tends to roll his own, IIRC.

I begat it ages ago and I hoard it jealously, because its birth was so
painful.

> What happens if Anton gives you a binary kernel, and you run that?

Dunno, I'll send him a .config.

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
Anton Blanchard | 1 Feb 07:57 2004
Picon

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain


> I'm having a ton of trouble with this on the 4-way ppc64 box.  Symptoms are
> similar to memory corruption: gcc falls over with sig11, filenames
> corrupted, etc.  Often it fails to get through the initscripts without
> userspace processes failing randomly.

...

> points the finger at the core sched-domains patch.
> 
> But Anton says that he's using your scheduler patches without problems.

Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or
maybe my box wasnt stable and I didnt notice :)

Anton

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
Rick Lindsley | 1 Feb 10:02 2004
Picon

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain

    Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or
    maybe my box wasnt stable and I didnt notice :)

Good point --  I've been running standard regression tests against it
on x86 and seen nothing like this.  But I haven't enabled SMT.  I just
finished going through the code with Martin and another engineer last
week and am summarizing some thoughts, but there was no major fallovers
like this.

Rick

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
Nick Piggin | 1 Feb 11:24 2004
Picon

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain


Rick Lindsley wrote:

>    Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or
>    maybe my box wasnt stable and I didnt notice :)
>
>Good point --  I've been running standard regression tests against it
>on x86 and seen nothing like this.  But I haven't enabled SMT.  I just
>finished going through the code with Martin and another engineer last
>week and am summarizing some thoughts, but there was no major fallovers
>like this.
>
>

SMT should be i386 only at this stage so it rules that out.

x86 is pretty stable, that I am sure of. I have run it on quite
a lot of boxes, SMP P4, NUMAQ, SMP P3s, UP, etc plus countless
people testing -mm.

It wouldn't surprise me if there is a problem with another
architecture... but then again as you say Anton is running it
OK so I dunno. Maybe it is a compiler bug?

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
(Continue reading)

Andrew Morton | 1 Feb 11:34 2004

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain

Nick Piggin <piggin <at> cyberone.com.au> wrote:
>
>  Andrew, Anton, are you using CONFIG_PREEMPT?

nope.

And we're not sure that Anton has tested the patch much yet.  It could be
that the bug is happening for him too.

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
Nick Piggin | 1 Feb 11:26 2004
Picon

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain


Nick Piggin wrote:

>
>
> Rick Lindsley wrote:
>
>>    Do you have the SMT scheduler enabled? Maybe its a bug with SMT 
>> off or
>>    maybe my box wasnt stable and I didnt notice :)
>>
>> Good point --  I've been running standard regression tests against it
>> on x86 and seen nothing like this.  But I haven't enabled SMT.  I just
>> finished going through the code with Martin and another engineer last
>> week and am summarizing some thoughts, but there was no major fallovers
>> like this.
>>
>>
>
> SMT should be i386 only at this stage so it rules that out.
>
> x86 is pretty stable, that I am sure of. I have run it on quite
> a lot of boxes, SMP P4, NUMAQ, SMP P3s, UP, etc plus countless
> people testing -mm.
>
> It wouldn't surprise me if there is a problem with another
> architecture... but then again as you say Anton is running it
> OK so I dunno. Maybe it is a compiler bug?
>
>
(Continue reading)

Anton Blanchard | 1 Feb 11:51 2004
Picon

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain


> SMT should be i386 only at this stage so it rules that out.

It turns out my testing was done with SMT scheduler on but with a busted
topology. I just fixed it (cpumask_snprintf is broken on big endian,
the output had me confused) and its blowing up pretty soon in the slab
cache.

Give me a bit, I'll narrow it down. I'll also do a boot with SMT
scheduelr off and verify its OK.

Anton

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
Anton Blanchard | 2 Feb 00:50 2004
Picon

Re: Re: [PATCH 2/4] 2.6.1-mm2: sched-domain


> Give me a bit, I'll narrow it down. I'll also do a boot with SMT
> scheduelr off and verify its OK.

Im seeing very early slab corruption. Back out sched-* and it boots OK.

sym.0014:02:01.0:0:0: tagged command queuing enabled, command queue depth 16.
sym.0014:02:01.0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
scsi: On host 12 channel 0 id 0 only 128 (max_scsi_report_luns) of 189483851 luns reported, try increasing max_scsi_report_luns.
scsi: host 12 channel 0 id 0 lun 0x5a5a5a5a5a5a5a5a has a LUN larger than currently supported.
scsi: host 12 channel 0 id 0 lun 0x5a5a5a5a5a5a5a5a has a LUN larger than currently supported.

Anton

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn

Gmane