Emmanuel Tabard | 6 Feb 09:10
Picon

Re: [sql] memory consumption

Feb  6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 (imdbpy2sql.py) score 718 or sacrifice child
Feb  6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) total-vm:18212980kB,
anon-rss:18149024kB, file-rss:64kB

Greedy boy :D 

Le 27 janv. 2012 à 09:18, Davide Alberani a écrit :

> On Thu, Jan 26, 2012 at 18:50, Emmanuel Tabard <manu@...> wrote:
>> 
>> # TIME FINAL : 1223min, 59sec (wall) 1095min, 16sec (user) 13min, 7sec
>> (system)
> 
> :-)
> 
>> You can notice that :
>>  - title 84% success
>>  - name 99% success
> 
> For sure movie titles change more and faster than user names (a user name
> can change only if a typo was found, if the person legally change his name or
> if a second person with the same name/surname is added to the db)
> 
> Anyway, 84% is a little too high... I guess some kind of titles (tv
> series episodes,
> maybe?) are not handled correctly.
> 
> 
> Thanks for your help!
> 
(Continue reading)

Davide Alberani | 7 Feb 09:20
Picon
Gravatar

Re: [sql] memory consumption

On Mon, Feb 6, 2012 at 09:10, Emmanuel Tabard <manu@...> wrote:
>
> Feb  6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 (imdbpy2sql.py) score 718 or
sacrifice child
> Feb  6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) total-vm:18212980kB,
anon-rss:18149024kB, file-rss:64kB
>
> Greedy boy :D

Eheh... :-)
As usual, I'm really busy right now... I hope to have time to give it
a look this weekend.

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
Davide Alberani | 11 Feb 19:49
Picon
Gravatar

Re: [sql] memory consumption

On Tue, Feb 7, 2012 at 09:20, Davide Alberani
<davide.alberani@...> wrote:
>
> As usual, I'm really busy right now... I hope to have time to give it
> a look this weekend.

Ehi, snowstorms buy you a lot of free time... :-P

It was easier that I thought, mostly thanks to the fact the we already have
md5 checksum of names and title (a more or less recent feature).

In the mercurial repository there's a draft of solution.

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 10000 each time.

Notes:
- by default, the database are created in the current directory (and
not deleted);
  there's now the '-t dir' command line argument, to specify a
temporary directory.
- I've not tested it with huge amounts of data: if it's slow or fails,
let me know
  if it's while storing or restoring the IDs (and the error message).
- 10.000 entries for a batch is *totally* arbitrary: we've to choose a
good compromise
  between performances and the maximum size of a query.
- the batch is executed as a single query, like:
(Continue reading)

Emmanuel Tabard | 11 Feb 22:06
Picon

Re: [sql] memory consumption

Hey !

Ehi, snowstorms buy you a lot of free time... :-P

:D

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 10000 each time.

Looks nice !!!

It seems that you load all the datas in memory before storing it in the temp databases. 
"cls.select(ISNOTNULL(cls.q.imdbID))"

Maybe you should save the imdbids by batch of 10000entries ?


Tell me if you need the complete database dump to test with tons of datas !

Thank you very much for your help !


Le 11 févr. 2012 à 19:49, Davide Alberani a écrit :

On Tue, Feb 7, 2012 at 09:20, Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

As usual, I'm really busy right now... I hope to have time to give it
a look this weekend.

Ehi, snowstorms buy you a lot of free time... :-P

It was easier that I thought, mostly thanks to the fact the we already have
md5 checksum of names and title (a more or less recent feature).

In the mercurial repository there's a draft of solution.

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 10000 each time.

Notes:
- by default, the database are created in the current directory (and
not deleted);
 there's now the '-t dir' command line argument, to specify a
temporary directory.
- I've not tested it with huge amounts of data: if it's slow or fails,
let me know
 if it's while storing or restoring the IDs (and the error message).
- 10.000 entries for a batch is *totally* arbitrary: we've to choose a
good compromise
 between performances and the maximum size of a query.
- the batch is executed as a single query, like:
     UPDATE table SET imdb_id = CASE md5sum WHEN 'md5_1' THEN
'imdbID1' ... END WHERE md5sum IN ('md5_1', md5_2', ...)
 I don't really know if this syntax is valid for every SQL databases...
- I've simplified the code, maybe too much.
- I've not tested it with CSV support.

As usual, any test, bug report, comment and so on is welcome.


--
Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Davide Alberani | 12 Feb 08:44
Picon
Gravatar

Re: [sql] memory consumption

On Sat, Feb 11, 2012 at 22:06, Emmanuel Tabard <manu@...> wrote:
>
> Looks nice !!!

Well, it's not exactly nice & clean code... let's hope it works.

> It seems that you load all the datas in memory before storing it in the temp
> databases.
> "cls.select(ISNOTNULL(cls.q.imdbID))"
>
> Maybe you should save the imdbids by batch of 10000entries ?

If you believe at the theory, the return value of a 'select' method is
a lazy object, and so it should not consume much memory.
It's true that it's then cycled over, one item at a time, and so it
may be slow (in which case, we can directly use a cursors and
process the results in batches).

> Tell me if you need the complete database dump to test with tons of datas !

It won't hurt... :)
A dump of "SELECT md5sum, imdb_id" from the 'name', 'title', 'char_name'
and 'company_name' will be enough (not on the public list, please :))

Bye,

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 12 Feb 14:20
Picon

Re: [sql] memory consumption



Le 12 févr. 2012 à 08:44, Davide Alberani a écrit :

Well, it's not exactly nice & clean code... let's hope it works.

If prefer nice features to clean code :D

If you believe at the theory, the return value of a 'select' method is
a lazy object, and so it should not consume much memory.
It's true that it's then cycled over, one item at a time, and so it
may be slow (in which case, we can directly use a cursors and
process the results in batches).

Fair enough !
When it was selecting all the not null ids, the memory of the process grows up and the size of the .db never grows up.
My theory is that dbm save on close ? Does that make sense ?



It seems that you load all the datas in memory before storing it in the temp
databases.
"cls.select(ISNOTNULL(cls.q.imdbID))"

Maybe you should save the imdbids by batch of 10000entries ?

If you believe at the theory, the return value of a 'select' method is
a lazy object, and so it should not consume much memory.
It's true that it's then cycled over, one item at a time, and so it
may be slow (in which case, we can directly use a cursors and
process the results in batches).

Tell me if you need the complete database dump to test with tons of datas !

It won't hurt... :)
A dump of "SELECT md5sum, imdb_id" from the 'name', 'title', 'char_name'
and 'company_name' will be enough (not on the public list, please :))


Bye,

--
Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Davide Alberani | 12 Feb 14:56
Picon
Gravatar

Re: [sql] memory consumption

On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard <manu@...> wrote:
>
> Fair enough !
> When it was selecting all the not null ids, the memory of the process grows
> up and the size of the .db never grows up.
> My theory is that dbm save on close ? Does that make sense ?

Strange (even if, being anydbm a generic interface to various underlying
modules, you can never tell).

This simple snippet, on my system, creates a 1.2 Gb files and in the process
the memory in not used much (besides for caches, but it doesn't matter):

#!/usr/bin/env python
import time
import anydbm

long_string = 'LALALALA' * 1024
db = anydbm.open('/tmp/big.db', 'n')
for x in xrange(100000):
    x = str(x)
    db[x] = long_string

print 'INSERT'
db.close()
print 'CLOSE'
time.sleep(10)
print 'DONE'
sys.exit()
#======================

I fear that the leak is in the cycle on the result of the 'select'. :-/

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 12 Feb 15:52
Picon

Re: [sql] memory consumption

I was wondering, why don't you use the original dbs ?

Something like that takes 3 seconds: 

"CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL
CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL
"

And use your query to restore.

Should be freaking fast ...

Le 12 févr. 2012 à 14:56, Davide Alberani a écrit :

> On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard <manu@...> wrote:
>> 
>> Fair enough !
>> When it was selecting all the not null ids, the memory of the process grows
>> up and the size of the .db never grows up.
>> My theory is that dbm save on close ? Does that make sense ?
> 
> Strange (even if, being anydbm a generic interface to various underlying
> modules, you can never tell).
> 
> This simple snippet, on my system, creates a 1.2 Gb files and in the process
> the memory in not used much (besides for caches, but it doesn't matter):
> 
> #!/usr/bin/env python
> import time
> import anydbm
> 
> long_string = 'LALALALA' * 1024
> db = anydbm.open('/tmp/big.db', 'n')
> for x in xrange(100000):
>    x = str(x)
>    db[x] = long_string
> 
> print 'INSERT'
> db.close()
> print 'CLOSE'
> time.sleep(10)
> print 'DONE'
> sys.exit()
> #======================
> 
> I fear that the leak is in the cycle on the result of the 'select'. :-/
> 
> 
> -- 
> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
> http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Emmanuel Tabard | 12 Feb 16:50
Picon

Re: [sql] memory consumption

Here is a little workarround :

-- Extract imdb_id and md5sum  (6sec)
CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL;
CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL;

-- Add indexes (12sec)
ALTER TABLE title_extract ADD INDEX md5sum_idx (md5sum)
ALTER TABLE name_extract ADD INDEX md5sum_idx (md5sum)

-- Reset imdb ids ...
UPDATE title SET imdb_id = NULL;
UPDATE name SET imdb_id = NULL;

-- Restore imdb ids for movies (2min)
UPDATE title
INNER JOIN title_extract USING (md5sum)
SET title.imdb_id = title_extract.imdb_id

-- Restore imdb ids for people (5min)
UPDATE name
INNER JOIN name_extract USING (md5sum)
SET name.imdb_id = name_extract.imdb_id

Total time save/restore : less than 10minutes

Le 12 févr. 2012 à 15:52, Emmanuel Tabard a écrit :

> I was wondering, why don't you use the original dbs ?
> 
> Something like that takes 3 seconds: 
> 
> "CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL
> CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL
> "
> 
> And use your query to restore.
> 
> Should be freaking fast ...
> 
> Le 12 févr. 2012 à 14:56, Davide Alberani a écrit :
> 
>> On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard <manu@...> wrote:
>>> 
>>> Fair enough !
>>> When it was selecting all the not null ids, the memory of the process grows
>>> up and the size of the .db never grows up.
>>> My theory is that dbm save on close ? Does that make sense ?
>> 
>> Strange (even if, being anydbm a generic interface to various underlying
>> modules, you can never tell).
>> 
>> This simple snippet, on my system, creates a 1.2 Gb files and in the process
>> the memory in not used much (besides for caches, but it doesn't matter):
>> 
>> #!/usr/bin/env python
>> import time
>> import anydbm
>> 
>> long_string = 'LALALALA' * 1024
>> db = anydbm.open('/tmp/big.db', 'n')
>> for x in xrange(100000):
>>   x = str(x)
>>   db[x] = long_string
>> 
>> print 'INSERT'
>> db.close()
>> print 'CLOSE'
>> time.sleep(10)
>> print 'DONE'
>> sys.exit()
>> #======================
>> 
>> I fear that the leak is in the cycle on the result of the 'select'. :-/
>> 
>> 
>> -- 
>> Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
>> http://www.mimante.net/
> 

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Davide Alberani | 12 Feb 19:23
Picon
Gravatar

Re: [sql] memory consumption

On Sun, Feb 12, 2012 at 16:50, Emmanuel Tabard <manu@...> wrote:
>
> Here is a little workarround :

Well, that's a very interesting solution, thanks. :-)

Anyway, I have to think a little about it: storing the data in the
filesystem granted
us the possibility to split the "CSV mode" into 2 separated steps:
first all data are
put in CSV/pickle/dbm files, then the db is re-created.

Hmmm... mumble, mumble... (d'oh, the weekend is over!)

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

Gmane