Sage Weil | 1 Feb 2012 01:33
Favicon
Gravatar

efficient removal of old objects

Currently rgw logs objects it wants to delete after some period of time, 
and an radosgw-admin command comes back later to process the log.  It 
works, but is currently slow (one sync op at a time).

A better approach would be to mark objects for later removal, and have the 
OSD do it in some more efficient way.  wip-objs-expire has a client side 
(librados) interface for this.

I think there are a couple questions:

Should this be generalized to saying "do these osd ops at time X" instead 
of "delete at time X".  Then it could setxattr, remove, call into a class, 
whatever.

How would the OSD implement this?  A kludgey way would be to do it during 
scrub.  The current scrub implementation may make that problematic because 
it does a whole PG at time, and we probably don't want to issue a whole 
PG's worth of deletes at a time.  Is there a way to make that less 
painful?  

Not using scrub means we need some sort of index to keep track of objects 
with delayed events.  Using a collection for this might work, but loading 
all this state into memory would be slow if there were too many events 
registered.

Given all that, and that we need a solution to the expiration soon 
(weeks), do we
 - do a complete solution now,
 - parallelize radosgw-admin log processing,
 - or hack it into scrub?
(Continue reading)

Josh Durgin | 1 Feb 2012 01:52
Favicon

Re: efficient removal of old objects

On 01/31/2012 04:33 PM, Sage Weil wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).
>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.
>
> I think there are a couple questions:
>
> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.
>
> How would the OSD implement this?  A kludgey way would be to do it during
> scrub.  The current scrub implementation may make that problematic because
> it does a whole PG at time, and we probably don't want to issue a whole
> PG's worth of deletes at a time.  Is there a way to make that less
> painful?
>
> Not using scrub means we need some sort of index to keep track of objects
> with delayed events.  Using a collection for this might work, but loading
> all this state into memory would be slow if there were too many events
> registered.
>
> Given all that, and that we need a solution to the expiration soon
> (weeks), do we
>   - do a complete solution now,
>   - parallelize radosgw-admin log processing,
(Continue reading)

Tommi Virtanen | 1 Feb 2012 02:02
Favicon

Re: efficient removal of old objects

On Tue, Jan 31, 2012 at 16:33, Sage Weil <sage <at> newdream.net> wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).
>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.

Is there some reason why this would be significantly more performant
when done by the OSD itself? It seems like the deletion times can be
bucketed by time nicely, then each bucket just contains a set of ids
-- a good fit for the map data type -- and the client for running this
deletion just streams the bucket contents over and issues delete
messages for everything. What makes that inherently slow?

> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.

That sounds like a really complex API, for quite marginal gain.

To make my point even clearer: point me to another data store that has
that idiom.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Josh Durgin | 1 Feb 2012 02:19
Favicon

Re: efficient removal of old objects

(sorry for the extra email)

On 01/31/2012 04:33 PM, Sage Weil wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).
>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.
>
> I think there are a couple questions:
>
> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.

What are some other use cases for this? It may be useful in the future,
but if the only immediate use is speeding up rgw-admin, I don't think
it's worth further complicating the osd and all the layers above it.

> How would the OSD implement this?  A kludgey way would be to do it during
> scrub.  The current scrub implementation may make that problematic because
> it does a whole PG at time, and we probably don't want to issue a whole
> PG's worth of deletes at a time.  Is there a way to make that less
> painful?

This would also tie it to scrub actually happening. This means osds
with high load would never process the operations, unless you disable
the load check, in which case you slow down loady osds with scrubbing.
(Continue reading)

胡瀚森 | 1 Feb 2012 05:13
Picon

Problem while reading the paper about CRUSH

Hi everyone.
I'm new here and interested in the CRUSH algorithm which is a part of
Ceph, so I'm just reading the paper from
http://ssrc.ucsc.edu/Papers/weil-sc06.pdf
Unfortunately I'm not native English speaker, so I do have some
problem understanding the content. I wish anybody here could help.

I encounter the word "decluster" which cannot be found in a
dictionary. I think it should have something to do with "cluster", but
I'm not able to tell the actual meaning of it.
Appearance: page 1
Further, existing randomized distribution schemes that decluster
replication by spreading each disk’s replicas across many other
devices suffer from a high probability of data loss from coincident
device failures.
Appearance: page 2
We say that CRUSH generates a declustered distribution of replicas in
that the set of devices sharing replicas for one item also appears to
be independent of all other items.

2. I don't understand "per-device weight value"
Appearance: page 2
The CRUSH algorithm distributes data objects among storage devices
according to a per-device weight value, approximating a uniform
probability distribution.

3. What does the word "metric" mean?
Appearance: page 3
This results in a one-dimensional placement metric, weight, which
should be derived from the device’s capabilities.
(Continue reading)

Mark Kampe | 1 Feb 2012 05:34
Favicon

Re: Problem while reading the paper about CRUSH

De-cluster means ensure that objects that all have one copy on a single volume have their other copies
spread all over the cloud.  This enables many to many recovery.

The weights bias selection, e.g. so we can discourage placement on a device that is more full.

A "metric" is a unit or means of measuring an interesting quantity.
---mark---NrybXǧv^)޺{.n+z]z{ayʇڙ,jfhzwj:+vwjmzZ+ݢj"!
Yehuda Sadeh Weinraub | 1 Feb 2012 09:04
Picon

Re: efficient removal of old objects

(resending to list, sorry tv)

On Tue, Jan 31, 2012 at 5:02 PM, Tommi Virtanen
<tommi.virtanen <at> dreamhost.com> wrote:
>
> On Tue, Jan 31, 2012 at 16:33, Sage Weil <sage <at> newdream.net> wrote:
> > Currently rgw logs objects it wants to delete after some period of time,
> > and an radosgw-admin command comes back later to process the log.  It
> > works, but is currently slow (one sync op at a time).
> >
> > A better approach would be to mark objects for later removal, and have the
> > OSD do it in some more efficient way.  wip-objs-expire has a client side
> > (librados) interface for this.
>
> Is there some reason why this would be significantly more performant
> when done by the OSD itself? It seems like the deletion times can be
> bucketed by time nicely, then each bucket just contains a set of ids
> -- a good fit for the map data type -- and the client for running this
> deletion just streams the bucket contents over and issues delete
> messages for everything. What makes that inherently slow?

Random access to random cold objects is generally slower than doing
the operations on a single pg. E.g., if doing it as part of the scrub,
then objects are accessed anyway and are hopefully cached.

>
> > Should this be generalized to saying "do these osd ops at time X" instead
> > of "delete at time X".  Then it could setxattr, remove, call into a class,
> > whatever.
>
(Continue reading)

Yehuda Sadeh Weinraub | 1 Feb 2012 09:26
Picon

Re: efficient removal of old objects

On Tue, Jan 31, 2012 at 4:33 PM, Sage Weil <sage <at> newdream.net> wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).

Intent log generation doesn't come free of charge, it adds some load
on the system.

>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.

Note that setting expiration on an object is a more lightweight
operation than appending the intent log, as it would be done as a sub
op in the compound operation that created the object.

>
> I think there are a couple questions:
>
> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.

While I think it'd make a nice feature, I also think that the problem
space of a garbage collection is a bit different, and given the time
constraints it wouldn't make sense implementing this right now anyway.
>
> How would the OSD implement this?  A kludgey way would be to do it during
> scrub.  The current scrub implementation may make that problematic because
(Continue reading)

Jim Schutt | 1 Feb 2012 16:54
Picon

[RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

Hi,

FWIW, I've been trying to understand op delays under very heavy write
load, and have been working a little with the policy throttler in hopes of
using throttling delays to help track down which ops were backing up.
Without much success, unfortunately.

When I saw the wip-osd-op-tracking branch, I wondered if any of this
stuff might be helpful.  Here it is, just in case.

-- Jim

Jim Schutt (6):
  msgr: print message sequence number and tid when receiving message
    envelope
  common/Throttle: track sleep/wake sequences in Throttle, report them
    for policy throttler
  common/Throttle: throttle in FIFO order
  common/Throttle: FIFO throttler doesn't need to signal waiters when
    max changes
  common/Throttle: make get() report number of waiters on entry/exit
  msg: log Message interactions with throttler

 src/common/Throttle.h      |   75 +++++++++++++++++++++++++++++++-------------
 src/msg/Message.h          |   71 +++++++++++++++++++++++++++++++++++------
 src/msg/SimpleMessenger.cc |   22 +++++++++---
 3 files changed, 129 insertions(+), 39 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
(Continue reading)

Jim Schutt | 1 Feb 2012 16:54
Picon

[RFC PATCH 3/6] common/Throttle: throttle in FIFO order

Under heavy write load from many clients, many reader threads will
be waiting in the policy throttler, all on a single condition variable.
When a wakeup is signalled, any of those threads may receive the
signal.  This increases the variance in the message processing
latency, and in extreme cases can significantly delay a message.

This patch causes threads to exit a throttler in the same order
they entered.

Signed-off-by: Jim Schutt <jaschut <at> sandia.gov>
---
 src/common/Throttle.h |   42 ++++++++++++++++++++++++++++--------------
 1 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index 10560bf..ca72060 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
 <at>  <at>  -3,23 +3,31  <at>  <at> 

 #include "Mutex.h"
 #include "Cond.h"
+#include <list>

 class Throttle {
-  int64_t count, max, waiting;
+  int64_t count, max;
   uint64_t sseq, wseq;
   Mutex lock;
-  Cond cond;
(Continue reading)


Gmane