Patrick Barnes | 1 Feb 2010 07:43
Picon

Trees and leaf nodes

I have a large tree of documents that I want to display in a dynamic 
treeview, but I'm not sure how best to determine whether a given node 
has children or not.

Each document has a 'path' element giving it's place in the tree, eg 
this following doc has the parent 'AA', which in turn has the top-level 
parent 'A'.
doc = {
     '_id':'11203',
     'path':['A','AA','11203']
     ...
};

Fetching and displaying the entire tree is simple... this view:
function(doc) { emit(doc.path, doc.display_name); }
emits in lexicographical order, and some php code turns it into a nested 
list.

The entire tree is slow to retrieve and display in one go. With jquery 
treeview-async, I retrieve and display only the top-level nodes on page 
load, then on expanding a node it will call the server with the ID, 
asking for children.

I have a view that returns just the immediate children of the given key:
function(doc) {
	if (doc.doc_type=='group') {
		//Emit the path of the parent...
		doc.path.pop();
		emit(doc.path, doc.display_name);
	}
(Continue reading)

Brian Candler | 1 Feb 2010 12:04
Picon
Favicon

Re: Trees and leaf nodes

On Mon, Feb 01, 2010 at 05:43:29PM +1100, Patrick Barnes wrote:
> It looks messy if the node has a little 'expand' icon next to it,
> though - but there are no actual children below it. Is it possible
> to create a view that will emit as value whether each node has
> children?

I think the client needs to make two queries. Under a given node:

1. Get the list of the children
2. Determine whether each child also has children

You can do (2) using a single multi-key query, e.g.

  {keys:[c1, c2, c3, ...]}

Using your existing view you'll get all the grandchildren, which you
probably don't care about but can just check whether there are any or not.

To reduce the size of the returned data you could make a grouped query
instead.  With a suitable reduce function on your view, then you could get
back a count of the number of children for each child, which might be useful
to display in your UI anyway.

HTH,

Brian.

Patrick Barnes | 1 Feb 2010 14:15
Picon

Re: Trees and leaf nodes

Thanks, that sounds like should work. The only boundary case for a query 
to get the number of children is that if there are no children for that 
key it will return a blank result, not a '0' record.

I thought - hey, if I only cared about presence or absence of children I 
could just use the get_children view, passing it key=$parent_id & 
limit=0 ... but the view summary (at least in 0.10.0) only returns total 
row count and offset, not how many rows would have been returned if 
limit=0 had not been present.

... okay, having just typed the last paragraph I realise - even if 
feasible, that would only work using one key at a time. *yawns*

But the grouped query makes sense, I'll test it tomorrow.

-P

On 1/02/2010 10:04 PM, Brian Candler wrote:
> On Mon, Feb 01, 2010 at 05:43:29PM +1100, Patrick Barnes wrote:
>> It looks messy if the node has a little 'expand' icon next to it,
>> though - but there are no actual children below it. Is it possible
>> to create a view that will emit as value whether each node has
>> children?
>
> I think the client needs to make two queries. Under a given node:
>
> 1. Get the list of the children
> 2. Determine whether each child also has children
>
> You can do (2) using a single multi-key query, e.g.
(Continue reading)

Karel Minařík | 1 Feb 2010 15:24
Picon
Gravatar

Re: How to import data quickly

Hi,

just for info, on a current project I needed to import 6mil+ of docs,  
and the sweet spot was 10K docs per on batch upload. Higher values  
gave worse results. I don't have the numbers handy, but it took couple  
of hours to convert the docs from CSV and bulk upload them into Couch,  
I guess like 8hrs (on a rather old IBM Blade machine)... (And the real  
pain was handling malformed CSV parts, patching FasterCSV to not choke  
on it, etc.)

Karel

On 28.Jan, 2010, at 15:02 , Troy Kruthoff wrote:

> Just curious, what batch size did you use...  I was just getting to  
> run some test data to see where the sweet spot is for our hardware,  
> I remember reading somewhere that someone thought it was around 3k  
> docs.
>
> Troy
>
>
> On Jan 28, 2010, at 4:21 AM, Sean Clark Hess wrote:
>
>> Sweet... down to 28 minutes with bulk. Thanks
>>
>> On Thu, Jan 28, 2010 at 4:25 AM, Sean Clark Hess  
>> <seanhess@...> wrote:
>>
>>> Ah, I forgot about bulk! Thanks!
(Continue reading)

Karel Minařík | 1 Feb 2010 15:27
Picon
Gravatar

Re: Measuring duration of view index building

Hi Roger,

>> Of course, the best thing in my case would be, if couch itself  
>> would log
>> something like "start/end building index for _design/mydoc" -- but as
>> far as I know, there's no way to do that?
> Look in the log file.

apologies, I am severely behind my inbox. I am stupid enough to  
oversee the log entries for view indices building. But, how'd you  
relatively easily parse that? If there isn't something a la "Starting  
building XXX", it's rather cumbersome to work with? I guess `time  
curl ...` is still my friend.

> Do a GET against /_active_tasks which will tell you if view indexing  
> is
> running as well as progress.

And I am stupid enough to oversee that, and the nice GUI in Futon,  
thanks!!

Karel

Adam Wolff | 1 Feb 2010 17:24
Picon
Gravatar

Re: Trees and leaf nodes

If you only care about whether or not there are children, you could reduce
to the count of children for each key.

A

On Mon, Feb 1, 2010 at 5:15 AM, Patrick Barnes <mrtrick@...> wrote:

> Thanks, that sounds like should work. The only boundary case for a query to
> get the number of children is that if there are no children for that key it
> will return a blank result, not a '0' record.
>
> I thought - hey, if I only cared about presence or absence of children I
> could just use the get_children view, passing it key=$parent_id & limit=0
> ... but the view summary (at least in 0.10.0) only returns total row count
> and offset, not how many rows would have been returned if limit=0 had not
> been present.
>
> ... okay, having just typed the last paragraph I realise - even if
> feasible, that would only work using one key at a time. *yawns*
>
> But the grouped query makes sense, I'll test it tomorrow.
>
> -P
>
>
>
> On 1/02/2010 10:04 PM, Brian Candler wrote:
>
>> On Mon, Feb 01, 2010 at 05:43:29PM +1100, Patrick Barnes wrote:
>>
(Continue reading)

Santi Saez | 1 Feb 2010 17:27
Picon

Best way to store 2^32 IPs in CouchDB


Hi,

I'm doing some initial tests with CouchDB, trying to store 2^32 IP 
addresses (approximately 4.3 billions of documents).

Documents have only required fields: _id and _rev, but I've noticed that 
the minimum space occupied by each document is approximately 3.7KB, so I 
need +14TB disk space only for the basic scheme without any extra field 
(using IP as unique identifier in integer format).

Note that playing with a simple Python script and a binary data file, 
this data can be stored in 16GB space (each IP 4 = bytes * 2 ^32 addresses).

Is it possible to optimize the disk space for what I'm trying to do 
using CouchDB? Perhaps disabling "something", compressing, or changing 
_rev field format/size.. thanks!!

I haver read the manual for CouchDB perfomance, but I didn't get it:

http://wiki.apache.org/couchdb/Performance

Regards,

--

-- 
Santi Saez
http://woop.es

Robert Newson | 1 Feb 2010 17:31
Picon
Gravatar

Re: Best way to store 2^32 IPs in CouchDB

Try database compaction?

B.

On Mon, Feb 1, 2010 at 4:27 PM, Santi Saez <santisaez@...> wrote:
>
> Hi,
>
> I'm doing some initial tests with CouchDB, trying to store 2^32 IP addresses
> (approximately 4.3 billions of documents).
>
> Documents have only required fields: _id and _rev, but I've noticed that the
> minimum space occupied by each document is approximately 3.7KB, so I need
> +14TB disk space only for the basic scheme without any extra field (using IP
> as unique identifier in integer format).
>
> Note that playing with a simple Python script and a binary data file, this
> data can be stored in 16GB space (each IP 4 = bytes * 2 ^32 addresses).
>
> Is it possible to optimize the disk space for what I'm trying to do using
> CouchDB? Perhaps disabling "something", compressing, or changing _rev field
> format/size.. thanks!!
>
> I haver read the manual for CouchDB perfomance, but I didn't get it:
>
> http://wiki.apache.org/couchdb/Performance
>
> Regards,
>
> --
(Continue reading)

Elf | 1 Feb 2010 17:32
Picon

Re: Best way to store 2^32 IPs in CouchDB

Did you plan to handle IPv6 in future versions of your program? :)

2010/2/1 Santi Saez <santisaez@...>:
>
> Hi,
>
> I'm doing some initial tests with CouchDB, trying to store 2^32 IP addresses
> (approximately 4.3 billions of documents).
>
> Documents have only required fields: _id and _rev, but I've noticed that the
> minimum space occupied by each document is approximately 3.7KB, so I need
> +14TB disk space only for the basic scheme without any extra field (using IP
> as unique identifier in integer format).
>
> Note that playing with a simple Python script and a binary data file, this
> data can be stored in 16GB space (each IP 4 = bytes * 2 ^32 addresses).
>
> Is it possible to optimize the disk space for what I'm trying to do using
> CouchDB? Perhaps disabling "something", compressing, or changing _rev field
> format/size.. thanks!!
>
> I haver read the manual for CouchDB perfomance, but I didn't get it:
>
> http://wiki.apache.org/couchdb/Performance
>
> Regards,
>
> --
> Santi Saez
> http://woop.es
(Continue reading)

Santi Saez | 1 Feb 2010 17:52
Picon

Re: Best way to store 2^32 IPs in CouchDB

El 01/02/10 17:31, Robert Newson escribió:

> Try database compaction?

I have tried database compaction in another testing server (Debian Lenny 
box) using CouchDB 0.8.0-2, and after database compaction disk size is 
the same:

# curl http://localhost:5984/test
{"db_name":"test","doc_count":15999,"doc_del_count":0,"update_seq":15999,"compact_running":false,"disk_size":60330312}

# curl -X POST http://localhost:5984/test/_compact
{"ok":true}

# curl http://localhost:5984/test
{"db_name":"test","doc_count":15999,"doc_del_count":0,"update_seq":15999,"compact_running":false,"disk_size":60330312}

Acording to the documentation [1]: "Compaction rewrites the database 
file, removing outdated document revisions and deleted documents".

So, it's normal because in my test I have not delete/upadate any 
document, only inserts.

Thanks!

[1] http://wiki.apache.org/couchdb/Compaction

--

-- 
Santi Saez
http://woop.es
(Continue reading)


Gmane