henk de wit | 1 Dec 2007 14:21
Picon
Favicon

Utilizing multiple cores for one query

I wonder whether the current versions of postgres (i.e. either 8.2 or 8.3) are able to utilize multiple cores for the execution of a single query?

This is one thing that systems like SQL Server and Oracle have been able to do for quite some time. I haven't seen much in the documentation that hints that this may be possible in PG, nor did I find much in the mailinglists about this. The only thing I found was a topic that discussed some patches that may eventually lead to a sequence scan being handled by multiple cores.

Could someone shed some light on the current or future abilities of PG for making use of multiple cores to execute a single query?

Thanks in advance


Express yourself instantly with MSN Messenger! MSN Messenger
Jonah H. Harris | 1 Dec 2007 14:31
Picon
Gravatar

Re: Utilizing multiple cores for one query

On Dec 1, 2007 8:21 AM, henk de wit <henk53602 <at> hotmail.com> wrote:
> I wonder whether the current versions of postgres (i.e. either 8.2 or 8.3)
> are able to utilize multiple cores for the execution of a single query?

Nope.

> This is one thing that systems like SQL Server and Oracle have been able to
> do for quite some time. I haven't seen much in the documentation that hints
> that this may be possible in PG, nor did I find much in the mailinglists
> about this. The only thing I found was a topic that discussed some patches
> that may eventually lead to a sequence scan being handled by multiple cores.

I believe the threads you're talking about were related to scanning,
not parallel query.  Though, when Qingqing and I were discussing
parallel query a little over a year ago, I do seem to recall several
uninformed opinions stating that sequential scans were the only thing
it could be useful for.

> Could someone shed some light on the current or future abilities of PG for
> making use of multiple cores to execute a single query?

Currently, the only way to parallelize a query in Postgres is to use pgpool-II.

http://pgpool.projects.postgresql.org/

--

-- 
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation                | fax: 732.331.1301
499 Thornall Street, 2nd Floor          | jonah.harris <at> enterprisedb.com
Edison, NJ 08837                        | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

henk de wit | 1 Dec 2007 15:42
Picon
Favicon

Re: Utilizing multiple cores for one query

> > I wonder whether the current versions of postgres (i.e. either 8.2 or 8.3)
> > are able to utilize multiple cores for the execution of a single query?
> Nope.

I see, thanks for the clarification.

Btw, in this thread: http://archives.postgresql.org/pgsql-performance/2007-10/msg00159.php

the following is said:

>You can determine what runs in parellel based on the
>indentation of the output.
>Items at the same indentation level under the same
>"parent" line will run in parallel

Wouldn't this offer some opportunities for running things on multiple cores? Based on the above, many people already seem to think that PG is able to utilize multiple cores for 1 query. Of course, it can be easily "proved" that this does not happen by simply watching at the CPU utilization graphs when executing a query. Nevertheless, those people may wonder why (some of) those items that already run in parallel not actually run in parallel using multiple cores?

> Currently, the only way to parallelize a query in Postgres is to use pgpool-II.
>
> http://pgpool.projects.postgresql.org/

Yes, I noticed this project before. At the time it was not really clear how stable and/or how well supported this is. It indeed seems to support parallel queries automatically by being able to rewrite standard queries. It does seem it needs different DB nodes and is thus probably not able to use multiple cores of a single DBMS. Also, I could not really find how well pgpool-II is doing at making judgments of the level of parallelization it's going to use. E.g. when there are 16 nodes in the system with a currently low utilization, a single query may be split into 16 pieces. On the other hand, when 8 of these nodes are heavily utilized, splitting to 8 pieces might be better. etc.

Anyway, are there any plans for postgresql to support parallelizing queries natively?



Express yourself instantly with MSN Messenger! MSN Messenger
Jonah H. Harris | 1 Dec 2007 17:39
Picon
Gravatar

Re: Utilizing multiple cores for one query

On Dec 1, 2007 9:42 AM, henk de wit <henk53602 <at> hotmail.com> wrote:
> Wouldn't this offer some opportunities for running things on multiple cores?

No, it's not actually parallel in the same sense.

> Yes, I noticed this project before. At the time it was not really clear how
> stable and/or how well supported this is. It indeed seems to support
> parallel queries automatically by being able to rewrite standard queries. It
> does seem it needs different DB nodes and is thus probably not able to use
> multiple cores of a single DBMS.

I've seen it actually set up to use multiple connections to the same
DBMS.  How well it would work is pretty much dependent on your
application and the amount of parallelization you could actually gain.

> Also, I could not really find how well
> pgpool-II is doing at making judgments of the level of parallelization it's
> going to use. E.g. when there are 16 nodes in the system with a currently
> low utilization, a single query may be split into 16 pieces. On the other
> hand, when 8 of these nodes are heavily utilized, splitting to 8 pieces
> might be better. etc.

IIRC, it doesn't plan parallelization that way.  It looks at what is
partitioned (by default) on different nodes and parallelizes based on
that.  As I said earlier, you can partition a single node and put
pgpool-II on top of it to gain some parallelization.  Unfortunately,
it isn't capable of handling things like parallel index builds or
other useful maintenance features... but it can do fairly good query
result parallelization.

--

-- 
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation                | fax: 732.331.1301
499 Thornall Street, 2nd Floor          | jonah.harris <at> enterprisedb.com
Edison, NJ 08837                        | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Jean-David Beyer | 1 Dec 2007 18:03
Picon

Re: Utilizing multiple cores for one query


henk de wit wrote:
>> > I wonder whether the current versions of postgres (i.e. either 8.2
> or 8.3)
>> > are able to utilize multiple cores for the execution of a single query?
>> Nope.
> 
> I see, thanks for the clarification.
> 
> Btw, in this thread:
> http://archives.postgresql.org/pgsql-performance/2007-10/msg00159.php
> 
> the following is said:
> 
>>You can determine what runs in parellel based on the
>>indentation of the output.
>>Items at the same indentation level under the same
>>"parent" line will run in parallel
> 
> Wouldn't this offer some opportunities for running things on multiple
> cores? Based on the above, many people already seem to think that PG is
> able to utilize multiple cores for 1 query.

Of course, it depends on just what you mean. Since postgresql is a
client-server system, the client can run on one processor and the server on
another. And that _is_ parallelism in a way. For me in one application, my
client uses about 20% of a processor and the server uses around 80%. But in
more detail,

VIRT   RES  SHR SWAP %MEM %CPU     TIME+ P COMMAND

2019m  94m  93m 1.9g  1.2   79   2:29.97 3 postgres: jdbeyer stock [local]
INSERT
2019m 813m 813m 1.2g 10.2    2  23:38.67 0 postgres: writer process

2018m  29m  29m 1.9g  0.4    0   4:07.59 3 /usr/bin/postmaster -p 5432 -D ...
 8624  652  264 7972  0.0    0   0:00.10 2 postgres: logger process

 9624 1596  204 8028  0.0    0   0:01.07 2 postgres: stats buffer process

 8892  840  280 8052  0.0    0   0:00.74 1 postgres: stats collector process
 6608 2320 1980 4288  0.0   22   1:56.27 0 /home/jdbeyer/bin/enter

The P column shows the processor the process last ran on. In this case, I
might get away with using one processor, it is clearly using all four.

Now this is not processing a single query on multiple cores (in this case,
the "query" is running on core #3 only), but the ancillary stuff is running
on multiple cores and some of it should be charged to the query. And the OS
kernel takes time for IO and stuff as well.

> Of course, it can be easily
> "proved" that this does not happen by simply watching at the CPU
> utilization graphs when executing a query. Nevertheless, those people
> may wonder why (some of) those items that already run in parallel not
> actually run in parallel using multiple cores?
> 
>
--
  .~.  Jean-David Beyer          Registered Linux User 85642.
  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
 /( )\ Shrewsbury, New Jersey    http://counter.li.org
 ^^-^^ 11:40:01 up 1 day, 2:02, 5 users, load average: 4.15, 4.14, 4.15
Mindaugas | 2 Dec 2007 11:26

Dealing with big tables

  Hello,

  Started to work with big tables (like 300GB) and performance problems started to appear. :(

  To simplify things - table has an index on From an index on To columns. And it also have several other not
indexed columns. There are 100000+ of different values for From and the same for To.

  I execute simple query "select * from bigtable where From='something'". Query returns like 1000 rows and
takes 5++ seconds to complete. As far as I understand the query is slow because:
  - first it has to retrieve pointers to rows with data from index. That goes fast.
  - retrieve all the rows one by one. There we have 100% random read because rows with the same From is
distributed evenly through all the 300GB and most probably nothing is cached. So there we are limited by
_one_ disk performance independently of how many disks we have in storage? And in case storage consists of
15k rpm Cheetahs with 3.5ms average read seek time we should expect not more than ~285 rows per second?

  I feel that I'm overlooking something here. But I'm new into data warehousing. :)

  Also this query should greatly benefit from parallel execution or async IO. Storage (seeks/second)
scales almost linearly when it has a lot of disks. And query is completely IO bound so it should scale well on
single server.

  And I cannot use some index organized table or table partitioned by From :) because there are at least 2
similar indexes by which queries can be executed - From and To.

  Ideas for improvement? Greenplum or EnterpriseDB? Or I forgot something from PostgreSQL features.

  Thanks,

  Mindaugas

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Sami Dalouche | 2 Dec 2007 12:05
Picon
Favicon

Re: Dealing with big tables

Hi,

my answer may be out of topic since you might be looking for a
postgres-only solution.. But just in case....

What are you trying to achieve exactly ? Is there any way you could
re-work your algorithms to avoid selects and use a sequential scan
(consider your postgres data as one big file) to retrieve each of the
rows, analyze / compute them (possibly in a distributed manner), and
join the results at the end ? 

A few pointers :
http://lucene.apache.org/hadoop/
http://www.gridgain.com/

Regards,
Sami Dalouche

On Sun, 2007-12-02 at 12:26 +0200, Mindaugas wrote:
> Hello,
> 
>   Started to work with big tables (like 300GB) and performance problems started to appear. :(
> 
>   To simplify things - table has an index on From an index on To columns. And it also have several other not
indexed columns. There are 100000+ of different values for From and the same for To.
> 
>   I execute simple query "select * from bigtable where From='something'". Query returns like 1000 rows and
takes 5++ seconds to complete. As far as I understand the query is slow because:
>   - first it has to retrieve pointers to rows with data from index. That goes fast.
>   - retrieve all the rows one by one. There we have 100% random read because rows with the same From is
distributed evenly through all the 300GB and most probably nothing is cached. So there we are limited by
_one_ disk performance independently of how many disks we have in storage? And in case storage consists of
15k rpm Cheetahs with 3.5ms average read seek time we should expect not more than ~285 rows per second?
> 
>   I feel that I'm overlooking something here. But I'm new into data warehousing. :)
> 
>   Also this query should greatly benefit from parallel execution or async IO. Storage (seeks/second)
scales almost linearly when it has a lot of disks. And query is completely IO bound so it should scale well on
single server.
> 
>   And I cannot use some index organized table or table partitioned by From :) because there are at least 2
similar indexes by which queries can be executed - From and To.
> 
>   Ideas for improvement? Greenplum or EnterpriseDB? Or I forgot something from PostgreSQL features.
> 
>   Thanks,
> 
>   Mindaugas
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>                 http://www.postgresql.org/about/donate

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Mindaugas | 2 Dec 2007 12:37

Re: Dealing with big tables


> my answer may be out of topic since you might be looking for a
> postgres-only solution.. But just in case....

  I'd like to stay with SQL.

> What are you trying to achieve exactly ? Is there any way you could
> re-work your algorithms to avoid selects and use a sequential scan
> (consider your postgres data as one big file) to retrieve each of the
> rows, analyze / compute them (possibly in a distributed manner), and
> join the results at the end ?

  I'm trying to improve performance - get answer from mentioned query 
faster.

  And since cardinality is high (100000+ different values) I doubt that it 
would be possible to reach select speed with reasonable number of nodes of 
sequential scan nodes.

  Mindaugas

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Guillaume Smet | 2 Dec 2007 12:55
Picon
Gravatar

Re: Dealing with big tables

On Dec 2, 2007 11:26 AM, Mindaugas <ml <at> kilimas.com> wrote:
>   I execute simple query "select * from bigtable where From='something'". Query returns like 1000 rows and
takes 5++ seconds to complete. As far as I understand the query is slow because:

Can you post an EXPLAIN ANALYZE? Which version of PostgreSQL do you use?

--
Guillaume

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Gregory Stark | 2 Dec 2007 13:07
Favicon

Re: Dealing with big tables


"Mindaugas" <ml <at> kilimas.com> writes:

>   I execute simple query "select * from bigtable where From='something'".
>   Query returns like 1000 rows and takes 5++ seconds to complete. 

As you pointed out that's not terribly slow for 1000 random accesses. It
sounds like your drive has nearly 5ms seek time which is pretty common.

What exactly is your goal? Do you need this query to respond in under a
specific limit? What limit? Do you need to be able to execute many instances
of this query in less than 5s * the number of executions? Or do you have more
complex queries that you're really worried about?

I do have an idea of how to improve Postgres for this case but it has to wait
until we're done with 8.3 and the tree opens for 8.4.

>   Ideas for improvement? Greenplum or EnterpriseDB? Or I forgot something
>   from PostgreSQL features.

Both Greenplum and EnterpriseDB have products in this space which let you
break the query up over several servers but at least in EnterpriseDB's case
it's targeted towards running complex queries which take longer than this to
run. I doubt you would see much benefit for a 5s query after the overhead of
sending parts of the query out to different machines and then reassembling the
results. If your real concern is with more complex queries they may make sense
though. It's also possible that paying someone to come look at your database
will find other ways to speed it up.

--

-- 
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo <at> postgresql.org so that your
       message can get through to the mailing list cleanly


Gmane