I have a bit of an interesting scenario, and I thought I'd reach out to everyone here to see if anyone's seen something similar or has any ideas. We've used Celery at my company for 4-5 years now in conjunction with Django. When this codebase started off, we were using Django 1.0, so it's grown "organically" let's call it :). We're currently on Django 1.6 and celery 3.1.7, but I'm in the process of an upgrade to Django 1.8 and Celery 3.1.18.
On the old versions, everything works fine and has for well over a year without any issue. After I've upgraded I've ran into some snags with some tasks just dropping off the face of the Earth, and I don't have many ideas on how to track the problem down. Our setup is a bit funky, and I feel like it almost certainly plays into the issue, but I can't quite sort out where it's going wrong.
We've built a wrapper of sorts around the Task calling, so that we can fire tasks a little bit easier after a database transaction is committed. Effectively, if we're in a transaction, we keep a list of of tasks that need to be fired off, and when we commit, we iterate over the list firing off the requested tasks with their given arguments. We have a number of places in our code base where we need to do this, calling the same tasks, so making a wrapper made sense, I think. A Django signal is used to trigger the tasks to be executed.
In our setup we're behind a load balancer, in the exact same load balancer configuration as it was in pre-upgrade. RabbitMQ sits behind a simple load balancer, and Celery on the 2 separate servers talks to Rabbit via that load balanced IP. The 2 celery nodes also have Apache/mod_wsgi running our Django stack as well. I have Selenium test setup, and I run it through about 80 "checkouts" through our app and about ~20 of those fail after the upgrade, 0 before. Tracking in the logs, I can see our wrapper calling the delay method of a particular task, but that's the last I see of it. Logging statement right before the call dumps out all the info I expect to see, the next line executes, RabbitMQ never sees it, and the log statement at the beginning of the task itself is never executed. I also verified that there isn't any form of exception happening on the call either, and nothing of note in the Celery logs themselves.
Finally, the task this seems to always happen on is a task that's fired by a different task. Task A acts as a router, it executes, and uses our wrapper to call Task B's delay method (not in a transaction at this point so it's passed on through, no hold or anything special).
Any thoughts on what I could have missed in the upgrade that could have caused something like this? Best I can tell the task is never called in any way, as I don't see anything useful in tcp dumps so it doesn't look like a communication problem with Rabbit. I'm hesitant to believe in a communication issue, simply because I can revert these upgrades at get back to 100% execution on my tests. Rabbit's working just fine, I feel like it has to be something else. That said, I've been proven wrong before, and I'm willing to be proven wrong again :)
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to