Mathieu Longtin | 17 Oct 17:37 2014

Tracking and retrying failed tasks

Hi,

I'm investigating Celery to replace Qless. I'm processing a few millions tasks a day, so scale is important. Celery peaked my interest because it seems more mature and maintained.

I don't care about a task result unless it fails. If it fails, I want to keep the tasks around, see it in a dashboard and re-submit it for retry once I fix the problem. I can do that easily with qless.

One thing that I can't seem to figure out is how to get a list of failed tasks. I set up the tasks like this:

<at> app.task(store_errors_even_if_ignored=True, ignore_result=True)

Once a task failed, I can see in it in the result redis, but I don't see any ways to force a retry, nor are there any part of the celery command to explore that.

I looked at flower, but flower only sees task that failed on its watch, nothing that failed before it started. I am guessing it wouldn't be able to keep 200K failed tasks in memory either.

Any pointers?

Thanks

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
phill | 16 Oct 14:59 2014
Picon

Confirming - It's okay to migrate from 3.0 to 3.1 on the existing backend with pending work in the queue?

I'm migrating from celery 3.0 to 3.1, running on the redis backend with pickle task serialization. I've done some testing where I just bring down the 3.0 queue (with jobs pending) and bring up 3.1 and things seem to work fine. It's really hard to get a large diversity of cases in my testing though so I also wanted to get confirmation that this is a reasonable way to migrate. The "What's New" guide is thorough and doesn't mention this, so I'm presuming so but I was looking for a positive confirmation that this is advisable.

If it's not, I can try to go through a process of draining the 3.0 queue entirely before upgrading or attempting to migrate over tasks but that definitely complicates things, so hoping this simple path is an acceptable one.

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
Gregory Taylor | 14 Oct 18:42 2014
Picon

Celery fails to reconnect to Redis after service disruptions

We've had service disruptions for our Redis broker the last two nights. One was network maintenance, the other was a software upgrade that caused Redis to refuse connections for about a minute. 

The thing that troubled us is that celery seems to be very inconsistent about recovering from network/service issues when using Redis as a broker. This is very likely something configuration related, we're just not sure where we've gone wrong. If we restart the celery processes, everything goes back to normal. That puts us in a tough situation in that celery (as configured) requires manual intervention to recover.

Here are some version numbers:

django==1.7.1
celery==3.1.15
redis==2.10.3
hiredis==0.1.4
billiard==3.3.0.18
kombu==3.0.23

Here are my Python-land settings: https://gist.github.com/gtaylor/c61e9b4802b094d3aeb4

Here is my supervisor unit (with ansible Jinja2 template variables included): https://gist.github.com/gtaylor/798c370894998377741d

Our gunicorn app servers seem to recover automatically when Redis comes back up, but it seems 50/50 with celery. Sometimes we recover from minor disruptions just fine, other times we have to restart the celery workers. 

Any ideas would be appreciated!

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
phill | 13 Oct 21:17 2014
Picon

Do database connections really get closed after each task, or before?

I'm running Celery 3.0.X (working on a 3.1 upgrade though!). I'm seeing a number of sleeping database connections which caused me to look into how connections are closed in djcelery. I found a bit of conflicting evidence. Ask says they're closed after task execution ( https://groups.google.com/forum/#!topic/celery-users/_oxdeICeU58 ), but it looks to me like it might be more accurate to say they're closed _before_ task execution : https://github.com/celery/django-celery/blob/master/djcelery/loaders.py#L113

I'm wondering if I'm reading that correctly? We recently added more worker capacity so is it plausible that the sleeping connections I'm seeing are from idle workers who won't close their connections until the next task runs?

Thanks in advance - Phill

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
Jacky Wang | 11 Oct 05:27 2014
Picon

Celery signal does not work well when specifying a sender

I tried to add a task_success signal to task scan. But the handler < monitor > did not called after scan executed succeed.  I do not know why?

The code is as following shown
<at> app.task()
def scan(text):
    pattern = r'.*\.mp3'
    logger.info('scan text %s', text)
    return re.search(pattern, text).group()
 
<at> signals.task_success.connect(sender=scan)
def monitor(sender, **kwargs):
    logger.info('task scan completed - %s', kwargs['result'])

I have inspected the internal of signal. And found that the signal module maintains a dict of <look up key, receiver>. The look up key is consisting of id(receiver), id(sender). 

Here are related log, the first block is the signal module records the receiver [ Here, it is monitor]. Notice that the id of sender is 62589472

Worker-1 3304   Append receivers - look up key: <(62608664L, 62589472L)>, 
sender: << <at> <at> task: proj.text.tasks.scan of WAETask:0x3d10828>task: proj.text.tasks.scan of WAETask:0x3bb0828>>,
 receiver: <<weakref at 0000000003BB7548; to 'function' at 0000000003BB5518 (monitor)>>

The second block record the sender id during the execution of celery.utils.dispatch.signal.Signal.send()
Worker-1 3304 - Sender is << <at> task: proj.text.tasks.scan of WAETask:0x3b8ad68>>
Worker-1 3304 -  Sender id is <62655008>
 
The ids of sender are different, 62589472 and 62655008 It looks that there are two instance of the sender, 'proj.text.tasks.scan'. 
Why does this happened?

Environment:
Version: v3.1.15
OS: Windows-7-6.1.7601-SP1
Python: 2.7

Configuration
transport:   amqp://guest:** <at> localhost:5679//
results:     mongodb://localhost:27017/
concurrency: 4 (prefork)

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
phill | 10 Oct 16:35 2014
Picon

Best Pattern for "One at a time" without running into big chain recursion error?

I've read a number of the threads on the large-chain-recursion issue ( https://github.com/celery/celery/issues/1078 )....

The pattern I run into sometimes is that I want to get a bunch of things done (hundreds or small-thousands) and I either don't want to saturate a worker pool, or don't want to clobber some external resource the tasks will content with. For instance I may want to send a few thousand updated records to some external API which will rate-limit me if I get chatty, and which might be slow.

We started creating chains, but ran into the pickle recursion issue (which I understand is a problem even with json because the worker pools will pickle to eachother). I _don't_ need the results of these tasks to be published forward (and if I do I can store their results somewhere to be picked up later), in case that helps.

Curious what pattern people use for this? I can create worker pools with a single worker and use groups, but then I end up with a proliferation of rarely-used pools which is no fun. I can chain the tasks manually (I used to do this before Canvas) and have one call the next as it's finishing but that's fragile (any task fails and the chain breaks).

Are there good workarounds for the recursion error that might get me there? Can I create a chain of smaller chains and avoid this somehow?


Curious how others solve these kinds of problems.

Thanks in advance!

Phill

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
GianMario Mereu | 10 Oct 15:22 2014
Picon

strange behaviour in chain

Hello everybody,

I facing with a strange behavior with chain. I have defined two tasks:

<at> app.task
def download_province(volte=[1,2], provincia='Lodi'): 
    
    ....
    do stuff ... 
   .....

    return True


<at> app.task
def create_snapshots(provincia='Lodi'):
    
    ....
    do stuff ... 
   .....

    return True


then I create the chain in this way:

task_result = chain( 
    tasks.download_province.subtask(([1], 'Lodi'), immutable=True),
    tasks.create_snapshots.subtask(('Lodi'), immutable=True)
    )()

the first subtask (download_province) is correctly executed while the second (create_snapshots) one is not. Celery is raising the following error:

TypeError: create_snapshots() takes at most 1 argument (4 given)

as the signature of subtask was not immutable. Do I make some mistake that I don't see at all? 

thanks al lot
gmario

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
Tore Olsen | 10 Oct 14:09 2014
Picon

Keep getting "missed heartbeat"

Hi all,

My celery setup is functioning correctly (AFAICT) but the workers keep logging:
2014-10-10 11:37:40,065: INFO/MainProcess] missed heartbeat from celery <at> apns.<my-ip>

This happens very sporadically; sometimes every few seconds, sometimes after a couple of minutes.

I'd like to understand what's actually causing this, and whether it's something I can ignore or if I should do something about it.

I have three workers and a beat scheduler configured via supervisord running like this:

celery -A cargame beat -n beat.%%h -l INFO

celery worker -A cargame -Q gcm -n gcm.%%h -l INFO

celery worker -A cargame -Q celery -n default.%%h -l INFO

celery worker -A cargame -Q apns -n apns.%%h -l INFO


The workers have very low load in the staging environment where I still get missed heartbeats. (Two of the workers are for push notifications.)


I'm using redis as a broker.

Any input appreciated.

Regards,
Tore

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
erikcw | 9 Oct 00:10 2014
Picon

What happens to reserved/scheduled tasks if celery is killed?

I've noticed from monitoring celery with flower that tasks that are pending due to being run with a countdown from being retried or being explicitly invoked that way seem to wait with a specific worker being "scheduled".  The list of scheduled tasks for each worker can grow quite long.

What happens to these "scheduled" tasks if that worker is forcefully killed (ie kill -9) or a power failure.  Are these tasks lost, or will they be picked up by another worker?  What about "reserved" tasks?

My broker for this installation is redis in case that make a difference.

Thanks!
Erik

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
William Kral | 3 Oct 23:51 2014
Picon

ETA/Countdown limitations or resource requirements

I seem to recall reading a while back about some setting that affected the maximum ETA time for a job, but I can't remember what it was.

I was wondering if there is any limits on how far in the future you can schedule jobs or how many future dated jobs you could have and what kind of resources they might consume. Reading about the implementation on an old thread here, it seemed like workers would just keep pulling messages off the queue and internally scheduling them. It wasn't totally clear to me if there are limits to that or how it would scale. So I guess I have a list of questions.

Am I correct that workers would get all the future dated messages off the queue and then schedule the jobs internally for execution?
Would the same jobs end up on multiple workers listening to that queue?
If that was the case, would this mean worker memory consumption would grow with the number of future dated messages?
Would the messages be ACKed at the time of execution?
If a worker with possibly thousands of future dated jobs were to be killed/died I'm assuming those messages will still be on the queue and a new worker would receive them.

I'm planning on a RabbitMQ backend in this implementation but maybe other backends scale better with this approach.

I suppose alternatively jobs could just be saved in a database with an execution date and have a cron add them to the queue if they're going to occur in the next hour if it doesn't scale well to just dump them all on the queue.

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
Mark Grey | 2 Oct 21:35 2014

Cannot import name int_types

I'm seeing an import error periodically in production from Kombu.

  File "/usr/share/python/datawarehouse/lib/python2.6/site-packages/celery/utils/imports.py", line 53, in instantiate
    return symbol_by_name(name)(*args, **kwargs)
  File "/usr/share/python/datawarehouse/lib/python2.6/site-packages/kombu/utils/__init__.py", line 92, in symbol_by_name
    except ValueError as exc:
  File "/usr/share/python/datawarehouse/lib/python2.6/site-packages/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/share/python/datawarehouse/lib/python2.6/site-packages/celery/app/amqp.py", line 16, in <module>
    from kombu import Connection, Consumer, Exchange, Producer, Queue
  File "/usr/share/python/datawarehouse/lib/python2.6/site-packages/kombu/__init__.py", line 67, in __getattr__
    module = __import__(object_origins[name], None, None, [name])
  File "/usr/share/python/datawarehouse/lib/python2.6/site-packages/kombu/messaging.py", line 16, in <module>
    from .five import int_types, text_t, values
ImportError: cannot import name int_types

I noticed that this int_types value is removed in newer versions on github.  I'm installing celery 3.1.15 via pip, and the versions of the dependencies it's fetching are as follows:

kombu = 3.0.8
billiard = 3.3.0.18
amqp = 1.4.6

Any reason I'd be seeing this error with these version?  What's especially strange is that it's highly intermittent and seems to only manifest every so often.

Any help appreciated as always!

MGII

--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to celery-users-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.

Gmane