Ron Peacetree | 1 Oct 2005 01:40
Picon
Favicon

Re: [PERFORM] A Better External Sort?

25MBps should not be a CPU bound limit for IO, nor should it be
an OS limit.  It should be something ~100x (Single channel RAM)
to ~200x (dual channel RAM) that.

For an IO rate of 25MBps to be pegging the CPU at 100%, the CPU
is suffering some combination of
A= lot's of cache misses ("cache thrash"), 
B= lot's of random rather than sequential IO (like pointer chasing)
C= lot's of wasteful copying
D= lot's of wasteful calculations

In fact, this is crappy enough performance that the whole IO layer
should be rethought and perhaps reimplemented from scratch.
Optimization of the present code is unlikely to yield a 100-200x
improvement.

On the HD side, the first thing that comes to mind is that DBs are
-NOT- like ordinary filesystems in a few ways:
1= the minimum HD IO is a record that is likely to be larger than
a HD sector.  Therefore, the FS we use should be laid out with
physical segments of max(HD sector size, record size)

2= DB files (tables) are usually considerably larger than any other
kind of files stored.  Therefore the FS we should use should be laid
out using LARGE physical pages.  64KB-256KB at a _minimum_.

3= The whole "2GB striping" of files idea needs to be rethought.
Our tables are significantly different in internal structure from the
usual FS entity.

(Continue reading)

Jim C. Nasby | 1 Oct 2005 01:48

Re: Query in SQL statement

On Thu, Sep 29, 2005 at 09:28:38PM +0800, Christopher Kings-Lynne wrote:
> 
> >CREATE SEQUENCE ai_id;
> >CREATE TABLE badusers (
> >  id int DEFAULT nextval('ai_id') NOT NULL,
> >  UserName varchar(30),
> >  Date  datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
> >  Reason varchar(200),
> >  Admin varchar(30) DEFAULT '-',
> >  PRIMARY KEY (id),
> >  KEY UserName (UserName),
> >  KEY Date (Date)
> >);
> >
> >
> >Am always getting foll. Errors,
> >
> >ERROR:  relation "ai_id" already exists
> >ERROR:  syntax error at or near "(" at character 240
> 
> You have just copied the Mysql code to Postgresql.  It will in no way 
> work.  Your default for 'Date' is illegal in postgresql and hence it 
> must allow NULLs.  There is no such thing as a 'datetime' type.  There 
> is no such thing as 'Key'.  Also your mixed case identifiers won't be 
> preserved.  You want:
> 
> CREATE TABLE badusers (
>   id SERIAL PRIMARY KEY,
>   UserName varchar(30),
>   Date  timestamp,
(Continue reading)

Dann Corbit | 1 Oct 2005 01:52

Re: [PERFORM] A Better External Sort?

I have perused the tuple sort stuff.

The good:
The documentation of the sort algorithm from Knuth's TAOCP was
beautifully done.  Everyone who writes an algorithm should credit the
original source like this, and also where it deviates.  That was done
very nicely.

The bad:
With random access, tape style merging is not necessary.  A priority
queue based merge will be a lot faster.

The UGLY:
Please, someone, tell me I was hallucinating.  Is that code really
READING AND WRITING THE WHOLE TUPLE with every sort exchange?!  Maybe
there is a layer of abstraction that I am somehow missing.  I just can't
imagine that is what it is really doing.  If (somehow) it really is
doing that, a pointer based sort which forms a permutation based upon
the keys, would be a lot better.

The fundamental algorithm itself could also be improved somewhat.

Here is a {public domain} outline for an introspective quick sort that
Pete Filander and I wrote some time ago and contributed to FastDB.  It
is written as a C++ template, but it will take no effort to make it a
simple C routine.  It assumes that e_type has comparison operators, so
in C you would use a compare function instead.

/*
** Sorting stuff by Dann Corbit and Pete Filandr.
(Continue reading)

Gregory Maxwell | 1 Oct 2005 03:44
Picon
Gravatar

Re: [PERFORM] A Better External Sort?

On 9/30/05, Ron Peacetree <rjpeace <at> earthlink.net> wrote:
> 4= I'm sure we are paying all sorts of nasty overhead for essentially
> emulating the pg "filesystem" inside another filesystem.  That means
> ~2x as much overhead to access a particular piece of data.
>
> The simplest solution is for us to implement a new VFS compatible
> filesystem tuned to exactly our needs: pgfs.
>
> We may be able to avoid that by some amount of hacking or
> modifying of the current FSs we use, but I suspect it would be more
> work for less ROI.

On this point, Reiser4 fs already implements a number of things which
would be desirable for PostgreSQL. For example: write()s to reiser4
filesystems are atomic, so there is no risk of torn pages (this is
enabled because reiser4 uses WAFL like logging where data is not
overwritten but rather relocated). The filesystem is modular and
extensible so it should be easy to add whatever additional semantics
are needed.  I would imagine that all that would be needed is some
more atomicity operations (single writes are already atomic, but I'm
sure it would be useful to batch many writes into a transaction),some
layout and packing controls, and some flush controls.  A step further
would perhaps integrate multiversioning directly into the FS (the
wandering logging system provides the write side of multiversioning, a
little read side work would be required.). More importantly: the file
system was intended to be extensible for this sort of application.

It might make a good 'summer of code' project for someone next year,
... presumably by then reiser4 will have made it into the mainline
kernel by then. :)
(Continue reading)

Gregory Maxwell | 1 Oct 2005 04:07
Picon
Gravatar

Re: [PERFORM] A Better External Sort?

On 9/28/05, Ron Peacetree <rjpeace <at> earthlink.net> wrote:
> 2= We use my method to sort two different tables.  We now have these
> very efficient representations of a specific ordering on these tables.  A
> join operation can now be done using these Btrees rather than the
> original data tables that involves less overhead than many current
> methods.

If we want to make joins very fast we should implement them using RD
trees. For the example cases where a join against a very large table
will produce a much smaller output, a RD tree will provide pretty much
the optimal behavior at a very low memory cost.

On the subject of high speed tree code for in-core applications, you
should check out http://judy.sourceforge.net/ . The performance
(insert, remove, lookup, AND storage) is really quite impressive.
Producing cache friendly code is harder than one might expect, and it
appears the judy library has already done a lot of the hard work. 
Though it is *L*GPLed, so perhaps that might scare some here away from
it. :) and good luck directly doing joins with a LC-TRIE. ;)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Dann Corbit | 1 Oct 2005 07:32

Re: [PERFORM] A Better External Sort?

Judy definitely rates a WOW!!

> -----Original Message-----
> From: pgsql-hackers-owner <at> postgresql.org [mailto:pgsql-hackers-
> owner <at> postgresql.org] On Behalf Of Gregory Maxwell
> Sent: Friday, September 30, 2005 7:07 PM
> To: Ron Peacetree
> Cc: Jeffrey W. Baker; pgsql-hackers <at> postgresql.org; pgsql-
> performance <at> postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> 
> On 9/28/05, Ron Peacetree <rjpeace <at> earthlink.net> wrote:
> > 2= We use my method to sort two different tables.  We now have these
> > very efficient representations of a specific ordering on these
tables.
> A
> > join operation can now be done using these Btrees rather than the
> > original data tables that involves less overhead than many current
> > methods.
> 
> If we want to make joins very fast we should implement them using RD
> trees. For the example cases where a join against a very large table
> will produce a much smaller output, a RD tree will provide pretty much
> the optimal behavior at a very low memory cost.
> 
> On the subject of high speed tree code for in-core applications, you
> should check out http://judy.sourceforge.net/ . The performance
> (insert, remove, lookup, AND storage) is really quite impressive.
> Producing cache friendly code is harder than one might expect, and it
> appears the judy library has already done a lot of the hard work.
(Continue reading)

Tom Lane | 1 Oct 2005 08:01
Picon

Re: [PERFORM] A Better External Sort?

"Jeffrey W. Baker" <jwbaker <at> acm.org> writes:
> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M.  Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape.  It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here.  The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives.  But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass.  That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is.  But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size.  (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
(Continue reading)

Dann Corbit | 1 Oct 2005 08:32

Re: [PERFORM] A Better External Sort?

> -----Original Message-----
> From: pgsql-hackers-owner <at> postgresql.org [mailto:pgsql-hackers-
> owner <at> postgresql.org] On Behalf Of Tom Lane
> Sent: Friday, September 30, 2005 11:02 PM
> To: Jeffrey W. Baker
> Cc: Luke Lonergan; Josh Berkus; Ron Peacetree; pgsql-
> hackers <at> postgresql.org; pgsql-performance <at> postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> 
> "Jeffrey W. Baker" <jwbaker <at> acm.org> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M.  Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes
over
> > the tape.  It could be done in a single pass heap merge with
N*log(M)
> > comparisons, and, more importantly, far less input and output.
> 
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here.  

I believe I made the exact same suggestion several days ago.

>The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much
greater
> than number of tape drives.  But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.
(Continue reading)

Simon Riggs | 1 Oct 2005 11:29
Favicon
Gravatar

Re: [PERFORM] A Better External Sort?

On Sat, 2005-10-01 at 02:01 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker <at> acm.org> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M.  Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> > the tape.  It could be done in a single pass heap merge with N*log(M)
> > comparisons, and, more importantly, far less input and output.
> 
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here.  The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much greater
> than number of tape drives.  But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.
> 
> I don't believe we can simply legislate that there be only one merge
> pass.  That would mean that, if we end up with N runs after the initial
> run-forming phase, we need to fit N tuples in memory --- no matter how
> large N is, or how small work_mem is.  But it seems like a good idea to
> try to use an N-way merge where N is as large as work_mem will allow.
> We'd not have to decide on the value of N until after we've completed
> the run-forming phase, at which time we've already seen every tuple
> once, and so we can compute a safe value for N as work_mem divided by
> largest_tuple_size.  (Tape I/O buffers would have to be counted too
> of course.)
> 
> It's been a good while since I looked at the sort code, and so I don't
> recall if there are any fundamental reasons for having a compile-time-
> constant value of the merge order rather than choosing it at runtime.
(Continue reading)

hubert depesz lubaczewski | 1 Oct 2005 12:01
Picon

Re: database bloat, but vacuums are done, and fsm seems to be setup ok

On 9/30/05, Jim C. Nasby <jnasby <at> pervasive.com> wrote:

Looks like it's definately an issue with index bloat. Note that it's
normal to have some amount of empty space depending on vacuum and update
frequency, so 15G -> 20G isn't terribly surprising. I would suggest
using pg_autovacuum instead of the continuous vacuum; it's very possible
that some of your tables need more frequent vacuuming than they're
getting now. If you go this route, you might want to change the default
settings a bit to make pg_autovacuum more agressive.


actually i have a very bad experience with autovaccum - of course it is because i dont know how to setup it correctly, but for me it's just easier to setup continuos vacuums. and i know which tables are frequently updated, so i setup additional vacuums on them.
 

Also, I'd suggest posting to -hackers about the index bloat. Would you
be able to make a filesystem copy (ie: tar -cjf database.tar.bz2
$PGDATA) available? It might also be useful to keep an eye on index size
in pg_class.relpages and see exactly what indexes are bloating.


i'm watching it right now (which indices are bloating), but i cannot send copy of pgdata - it contains very sensitive information.
 
depesz

Gmane