Re: Parallel query order by / limit
PgPool has to get all of the rows to order them, then it can limit how many are returned. Unfortunately,
there is no magic distributed index that can help with this operation. (Imagine fetching a row through
an index on a backend database and comparing it to all other backend databases to determine if that row is or
is not included in the result set. Yikes!) You will find similar issues with group by and other clauses
that require a sort. Subqueries can be problematic, too.
Note the data is returned to the PG cluster running pgpool for your sort and limit operations. This
operation used dblink, which allocates memory to hold the rows and can exhaust memory if the amount of data
is too large. If you want to run a partitioned datawarehouse I suggest looking at Greenplum. If you
have targeted data to retrieve, pgpool can do that.
PgPool works well in parallel mode if you remember these limitations. I'm running several ginormous
databases with it and get excellent response.
From: Drew Robb <drewrobb@...>
Sent: Thursday, June 2, 2011 11:17 AM
Subject: [Pgpool-general] Parallel query order by / limit
Hello, I'm fairly new to pgpool (enjoying it so far) and I'm trying to
understand parallel query behavior. I'm using pgpool 3.0.4 and pqsql
I have a parallel table USERS distributed via the primary key user_id. When I execute the following query
select user_id, date_created from users order by date_created limit 10;
I get the following output from pgpool debug:
statement2: SELECT pool_parallel("SELECT user_id, users.date_created FROM users")
My understanding is that the parallel engine is unable to translate the order by
and limit clause to into the query executed on the worker nodes? This
creates a big performance problem since all records in the table are
seemingly being processed. I read the documentation, but was unable to
figure out why this is the case. It seems to me that the order by and
limit clauses could be applied to both the workers (returning upto N*10
records) and then reapplied to filter these results down to just 10
records. Is the behavior i'm seeing expected, does the optimization I had in mind make sense?
Either way, is there a way that I could specify exactly the parallel query
to be executing on the workers to avoid this problem? I know that
creating a particular view in and registering it in the dist_def table might work, but this isn't very dynamic.
Pgpool-general mailing list