Re: Critical crashes on Windows under high load
Martin Aspeli <optilude+lists <at> gmail.com>
2009-11-02 15:11:45 GMT
Stefan Behnel wrote:
>>>> The Deliverance (0.3/trunk) rules use fairly complex xpath expressions. We're
>>>> trying to simplify these, but there's nothing obviously wrong, and in any case
>>>> it shouldn't crash.
>>> XPath shouldn't crash by itself, so I'd rather focus the debugging on the
>>> other things you are doing. Are you running the XPath queries against trees
>>> that are being modified concurrently?
>> It's possible that Deliverance is doing something evil here, but I kind
>> of doubt it. As far as I can tell, this is a Windows-specific problem,
>> or at least no-one seems to have reported it on Unix.
>
> So I assume you ran similar load tests under Unix systems?
No, I wish we could. :(
I'm basing this on the fact that (a) Unix deployments seem more common
(b) no-one has reported this on Unix that I can see and (c) I've found
at least one other person with Windows crashes.
But who knows, I could be completely wrong. What I can say for certain
is that the crashes do occur from time to time under relatively normal
usage patterns.
>>> Did you check for memory problems?
>> How would I do that?
>
> I mean, does the process' memory usage grow uncontrolled? If it's running
> out of memory, it's quite possible that it crashes. Not all memory errors
> can be handled safely.
We normally discover the error only after the process has crashed.
There's no pre-warning.
It looks like memory usage is relatively stable when the system is
running normally. I'll try to take a closer look, though.
>>> Could you try to come up with a stripped down set of operations that your
>>> code does using lxml? And which of them happen concurrently?
>> I'm not sure. It'd be difficult.
>
> Who said debugging would come for free?
Heh, true. A *lot* of time has gone into this already. We're talking
about a fairly big stack here, though. What I think we try, though is to
attempt to reproduce the problem with a load test suite and a static
back end instead of having Plone in the mix. That should produce a
relatively small WSGI pipeline and a manageable amount of code. If it
still crashes, of course.
>> The crash dialogue doesn't tell me
>> where in lxml the problem is (since there's no stack trace). Deliverance
>> is doing a fair amount of work with lxml (evaluating xpath expressions,
>> parsing the two input trees (theme + content), modifying the output
>> tree).
>
> Is that one tree per thread or are trees being handled by multiple threads?
> If threads don't share data, it can't be a threading issue (at least not
> from the POV of lxml).
One per thread almost certainly. They're read on each request as far as
I can tell.
I'd have to defer to the Deliverance developers, though.
>>>> We've tried to run both multi-threaded and single-threaded 'paster' processes:
>>>> the problem happens with both.
>>> Does that mean that this happens even if you run everything single-threaded?
>> We put the paster processes under which the WSGI pipeline runs into
>> single threaded mode (or at least, we set the threadpool size of each
>> process to 1), so in theory, there shouldn't be any concurrency. I don't
>> know if that's actually the case, though.
>
> It would be helpful if you could find out. In the worst case, you can
> inject a WSGI layer that simply acquires a lock while it forwards the
> request. Then you're sure it's single threaded.
Does anyone know? We're using Paste#httpserver and set threadpool_count
= 1. I assume that means single threaded?
>> I guess the most constructive thing would be if I could find some better
>> way of debugging this. People closer to the project (and server) where
>> this is happening are working on a load test suite that can reproduce
>> this reliably, though it's pretty much trial and error. The problem is
>> that as of right now, I don't know what I'd do next even if they did
>> make it occur reliably.
>
> Well, at least, if it can be reproduced, it can be tracked down and fixed.
Yeah. That's basically what we're working towards now. But it's not
straightforward, at least not in a way that we can give to other people
to look at.
>> I don't understand how lxml is built, how Cython works, how to write C
>> extensions, or how to do C development on Windows. It's a loooong time
>> since I wrote C/C++ and that was on Linux.
>
> Luckily, you don't have to. lxml is written in Cython, not in C.
But libxml2 and libxslt are. I suppose it's conceivable the problem is
there, or in the way they're statically linked perhaps? Not that I
understand Cython either.
Thanks for your help!
Martin
--
--
Author of `Professional Plone Development`, a book for developers who
want to work with Plone. See http://martinaspeli.net/plone-book