Craig Ringer | 1 Sep 2008 03:49
Picon
Favicon
Gravatar

Re: slow update of index during insert/copy

Thomas Finneid wrote:
> Hi
> 
> I am working on a table which stores up to 125K rows per second and I
> find that the inserts are a little bit slow. The insert is in reality a
> COPY of a chunk of rows, up to 125K. A COPY og 25K rows, without an
> index, is fast enough, about 150ms. With the index, the insert takes
> about 500ms. The read though, is lightning fast, because of the index.
> It takes only 10ms to retrieve 1000 rows from a 15M row table. As the
> table grows to several billion rows, that might change though.
> 
> I would like the insert, with an index, to be a lot faster than 500ms,
> preferrably closer to 150ms. Any advice on what to do?
> Additionally, I dont enough about pg configuring to be sure I have
> included all the important directives and given them proportional
> values, so any help on that as well would be appreciated.
> 
> Here are the details:
> 
> postgres 8.2.7 on latest kubuntu, running on dual Opteron quad cores,
> with 8GB memory and 8 sata disks on a raid controller (no raid config)

Just on a side note, your system is pretty strangely heavy on CPU
compared to its RAM and disk configuration. Unless your workload in Pg
is computationally intensive or you have something else hosted on the
same machine, those CPUs will probably sit mostly idle.

The first thing you need to do is determine where, during your bulk
loads, the system is bottlenecked. I'd guess it's stuck waiting for disk
writes, personally, but I'd want to investigate anyway.
(Continue reading)

Scott Carey | 1 Sep 2008 10:08
Favicon

Re: slow update of index during insert/copy

Are you even getting COPY to work with JDBC?  As far as I am aware, COPY doesn't work with JDBC at the moment:
http://jdbc.postgresql.org/todo.html   Listed in the todo page, under "PG Extensions"   is "Add support for COPY."  I tried to use it with JDBC a while ago and gave up after a couple limited experiments and reading that -- but didn't dig very deep into it.

As suggested, you should determine if you are disk bound or CPU bound.  My experience with COPY is that it is suprisingly easy to make it CPU bound, but the conditions for that can vary quire a bit from schema to schema and hardware to hardware.
 
pg_bulkload may not be the tool for you for many reasons -- it requires a rigid data format and control file, very similar to Oracle's sqlloader.  It may not fit your needs at all -- its just worth a look to see if it does since if there's a match, it will be much faster. 

On Sun, Aug 31, 2008 at 6:49 PM, Craig Ringer <craig <at> postnewspapers.com.au> wrote:
Thomas Finneid wrote:
> Hi
>
> I am working on a table which stores up to 125K rows per second and I
> find that the inserts are a little bit slow. The insert is in reality a
> COPY of a chunk of rows, up to 125K. A COPY og 25K rows, without an
> index, is fast enough, about 150ms. With the index, the insert takes
> about 500ms. The read though, is lightning fast, because of the index.
> It takes only 10ms to retrieve 1000 rows from a 15M row table. As the
> table grows to several billion rows, that might change though.
>
> I would like the insert, with an index, to be a lot faster than 500ms,
> preferrably closer to 150ms. Any advice on what to do?
> Additionally, I dont enough about pg configuring to be sure I have
> included all the important directives and given them proportional
> values, so any help on that as well would be appreciated.
>
> Here are the details:
>
> postgres 8.2.7 on latest kubuntu, running on dual Opteron quad cores,
> with 8GB memory and 8 sata disks on a raid controller (no raid config)

Just on a side note, your system is pretty strangely heavy on CPU
compared to its RAM and disk configuration. Unless your workload in Pg
is computationally intensive or you have something else hosted on the
same machine, those CPUs will probably sit mostly idle.

The first thing you need to do is determine where, during your bulk
loads, the system is bottlenecked. I'd guess it's stuck waiting for disk
writes, personally, but I'd want to investigate anyway.

If you're not satisfied with the results from pg_bulkload you can look
into doing things like moving your indexes to separate tablespaces (so
they don't fight for I/O on the same disk sets as your tables),
separating your bulk load tables from other online/transactional tables,
etc.

Also, to relay common advice from this list:

If you land up considering hardware as a performance answer, getting a
decent SAS RAID controller with a battery backed cache (so you can
enable its write cache) and a set of fast SAS disks might be worth it.
For that matter, a good SATA RAID controller and some 10kRPM SATA disks
could help too. It all appears to depend a lot on the particular
workload and the characteristics of the controller, though.

--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Thomas Finneid | 1 Sep 2008 13:16
Picon
Picon

Re: slow update of index during insert/copy


Scott Carey wrote:
> Are you even getting COPY to work with JDBC?  As far as I am aware, COPY 
> doesn't work with JDBC at the moment:

I used a patched jdbc driver, provided by someone on the list, dont have 
the reference at hand. It works perfectly and its about 5 times faster, 
for my job, than insert.

> As suggested, you should determine if you are disk bound or CPU bound.  
> My experience with COPY is that it is suprisingly easy to make it CPU 
> bound, but the conditions for that can vary quire a bit from schema to 
> schema and hardware to hardware.

COPY is not the problem, as far as I see. The problem is the update 
speed of the index. I tested the same procedure on a table with and 
without an index. Having an index makes it 200-250% slower, than without.

But as you state I should check whether the problem is cpu or disk 
bound. In addition, as someone else suggested, I might need to move the 
indexes to a different disk, which is not a bad idea considering the 
index becomes quite large with up 125K rows a second.

But I haver another consern, which is the db server configuration. I am 
not entirely convinced the db is configured prperly. I had one problem 
where the disk started thrashing after the table had reached a certainb 
size, so when I configured shmmax, and the corresponding in pg, properly 
I got rid of the trashing. I will have to read through the documentation 
  properly.

regards

thomas

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Thomas Finneid | 1 Sep 2008 13:29
Picon
Picon

Re: slow update of index during insert/copy


Craig Ringer wrote:
> Just on a side note, your system is pretty strangely heavy on CPU
> compared to its RAM and disk configuration. Unless your workload in Pg
> is computationally intensive or you have something else hosted on the
> same machine, those CPUs will probably sit mostly idle.

Its a devel machine for experimenting with pg and the disk performance 
and for experimenting with multithreaded java programs. Its not going to 
be particularily demanding on memory, but 8GB is good enough, I think.

> The first thing you need to do is determine where, during your bulk
> loads, the system is bottlenecked. I'd guess it's stuck waiting for disk
> writes, personally, but I'd want to investigate anyway.

Will investigate.

> If you're not satisfied with the results from pg_bulkload you can look
> into doing things like moving your indexes to separate tablespaces (so
> they don't fight for I/O on the same disk sets as your tables),
> separating your bulk load tables from other online/transactional tables,
> etc.

(Btw, its jdbc copy, not commandline.)
I dont think its the bulkload thats the problem, in it self, because 
loading it without an index is quite fast (and 5 times faster than 
ordinary insert). But of course, the bulkload process affects other 
parts of the system which can cause a bottleneck.

> Also, to relay common advice from this list:
> 
> If you land up considering hardware as a performance answer, getting a
> decent SAS RAID controller with a battery backed cache (so you can
> enable its write cache) and a set of fast SAS disks might be worth it.
> For that matter, a good SATA RAID controller and some 10kRPM SATA disks
> could help too. It all appears to depend a lot on the particular
> workload and the characteristics of the controller, though.

It does have a sata raid controller, but not have the battery pack, 
because its a develmachine and not a production machine, I thought it 
was not needed. But if you are saying the battery pack enables a cache 
which enables faster disk writes I will consider it.
Its the first time I have worked with a raid controller, so I suspect I 
have to read up on the features to understand how to utilise it best.

regards

thomas

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Duan Ligong | 1 Sep 2008 13:22
Picon

too many clog files

Hi, there.

I have encountered an issue that there are too many 
clog file under the .../pg_clog/ directory. Some of them 
were even produced one month ago.

My questions:
- Does Vacuum delete the old clog files?
- Can we controll the maximum number of the clog files?
- When, or in what case is  a new clog file produced?
- Is there a mechanism that the clog files are recycled?

#The version of my postgresql is 8.1

Regards
Duan

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

David West | 1 Sep 2008 14:18

limit clause breaks query planner?

Hi,

 

I have a single table with about 10 million rows, and two indexes.  Index A is on a column A with 95% null values.  Index B is on a column B with about 10 values, ie. About a million rows of each value.

 

When I do a simple query on the table (no joins) with the following condition:

A is null AND

B = ‘21’

 

it uses the correct index, index B.  However, when I add a limit clause of 15, postgres decides to do a sequential scan :s.  Looking at the results from explain:

 

"Limit  (cost=0.00..3.69 rows=15 width=128)"

"  ->  Seq Scan on my_table this_  (cost=0.00..252424.24 rows=1025157 width=128)"

"        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"

 

It appears that postgres is (very incorrectly) assuming that it will only have to retrieve 15 rows on a sequential scan, and gives a total cost of 3.69.  In reality, it has to scan many rows forward until it finds the correct value, yielding very poor performance for my table.

 

If I disable sequential scan (set enable_seqscan=false) it then incorrectly uses the index A that has 95% null values: it seems to incorrectly apply the same logic again that it will only have to retrieve 15 rows with the limit clause, and thinks that the index scan using A is faster than index scan B.

 

Only by deleting the index on A and disabling sequential scan will it use the correct index, which is of course by far the fastest.

 

Is there an assumption in the planner that a limit of 15 will mean that postgres will only have to read 15 rows?  If so is this a bad assumption?  If a particular query is faster without a limit, then surely it will also be faster with the limit.

 

Any workarounds for this?

 

Thanks

David

Pavel Stehule | 1 Sep 2008 14:52
Picon
Gravatar

Re: limit clause breaks query planner?

Hello

you should partial index

create index foo(b) on mytable where a is null;

regards
Pavel Stehule

2008/9/1 David West <david.west <at> cusppoint.com>:
> Hi,
>
>
>
> I have a single table with about 10 million rows, and two indexes.  Index A
> is on a column A with 95% null values.  Index B is on a column B with about
> 10 values, ie. About a million rows of each value.
>
>
>
> When I do a simple query on the table (no joins) with the following
> condition:
>
> A is null AND
>
> B = '21'
>
>
>
> it uses the correct index, index B.  However, when I add a limit clause of
> 15, postgres decides to do a sequential scan :s.  Looking at the results
> from explain:
>
>
>
> "Limit  (cost=0.00..3.69 rows=15 width=128)"
>
> "  ->  Seq Scan on my_table this_  (cost=0.00..252424.24 rows=1025157
> width=128)"
>
> "        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"
>
>
>
> It appears that postgres is (very incorrectly) assuming that it will only
> have to retrieve 15 rows on a sequential scan, and gives a total cost of
> 3.69.  In reality, it has to scan many rows forward until it finds the
> correct value, yielding very poor performance for my table.
>
>
>
> If I disable sequential scan (set enable_seqscan=false) it then incorrectly
> uses the index A that has 95% null values: it seems to incorrectly apply the
> same logic again that it will only have to retrieve 15 rows with the limit
> clause, and thinks that the index scan using A is faster than index scan B.
>
>
>
> Only by deleting the index on A and disabling sequential scan will it use
> the correct index, which is of course by far the fastest.
>
>
>
> Is there an assumption in the planner that a limit of 15 will mean that
> postgres will only have to read 15 rows?  If so is this a bad assumption?
> If a particular query is faster without a limit, then surely it will also be
> faster with the limit.
>
>
>
> Any workarounds for this?
>
>
>
> Thanks
>
> David

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Fernando Hevia | 1 Sep 2008 15:24
Picon

Re: Best hardware/cost tradoff?


> -----Mensaje original-----
> De: pgsql-performance-owner <at> postgresql.org 
> [mailto:pgsql-performance-owner <at> postgresql.org] En nombre de cluster
> Enviado el: Sábado, 30 de Agosto de 2008 07:21
> Para: pgsql-performance <at> postgresql.org
> Asunto: Re: [PERFORM] Best hardware/cost tradoff?
> 
> We are now leaning towards just buying 4 SAS disks.
> 
> So should I just make one large RAID-10 partition or make two 
> RAID-1's having the log on one RAID and everything else on 
> the second RAID?
> How can I get the best read/write performance out of these four disks?
> (Remember, that it is a combined web-/database server).
> 

Make a single RAID 10. It´s simpler and it will provide you better write
performance which is where your bottleneck will be. I think you should
minimize the web server role in this equation as it should mostly work on
cached data.

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Alvaro Herrera | 1 Sep 2008 16:33
Favicon
Gravatar

Re: too many clog files

Duan Ligong wrote:
> Hi, there.
> 
> I have encountered an issue that there are too many 
> clog file under the .../pg_clog/ directory. Some of them 
> were even produced one month ago.

If you're going to repost a question, it is only polite that you link to
the answers already provided.  Particularly so when some of your
questions were already answered.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

David West | 1 Sep 2008 16:44

Re: limit clause breaks query planner?

Thanks for your suggestion but the result is the same.

Here is the explain analyse output from different queries.
Select * from my_table where A is null and B = '21' limit 15

"Limit  (cost=0.00..3.68 rows=15 width=128) (actual time=85837.043..85896.140 rows=15 loops=1)"
"  ->  Seq Scan on my_table this_  (cost=0.00..258789.88 rows=1055580 width=128) (actual
time=85837.038..85896.091 rows=15 loops=1)"
"        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"
"Total runtime: 85896.214 ms"

As you can see the estimated cost was 3.68: a long way from the true value.

Doing 'set enable_seqscan=false' and repeating the select:
"Limit  (cost=0.00..5.58 rows=15 width=128) (actual time=4426.438..4426.834 rows=15 loops=1)"
"  ->  Index Scan using idx_A on my_table this_  (cost=0.00..392956.76 rows=1055970 width=128) (actual
time=4426.433..4426.768 rows=15 loops=1)"
"        Index Cond: (A IS NULL)"
"        Filter: ((B)::text = '21'::text)"
"Total runtime: 4426.910 ms"

Probably some caching made this query faster, but it's still too slow, and using the wrong index.

Deleting index A gives:
"Limit  (cost=0.00..56.47 rows=15 width=128) (actual time=10.298..10.668 rows=15 loops=1)"
"  ->  Index Scan using idx_B on my_table this_  (cost=0.00..3982709.15 rows=1057960 width=128) (actual
time=10.293..10.618 rows=15 loops=1)"
"        Index Cond: ((B)::text = '21'::text)"
"        Filter: (A IS NULL)"
"Total runtime: 10.735 ms"
Much better.  However I need index A for another query so I can't just delete it.

Looking at the estimated cost, you can see why it's choosing the order that it is choosing, but it just
doesn't seem to reflect reality at all.

Now here's the result of the query, with both indexes in place and sequential scan enabled
Select * from my_table where A is null and B = '21'
"Bitmap Heap Scan on my_table this_  (cost=20412.89..199754.37 rows=1060529 width=128) (actual
time=470.772..7432.062 rows=1020062 loops=1)"
"  Recheck Cond: ((B)::text = '21'::text)"
"  Filter: (A IS NULL)"
"  ->  Bitmap Index Scan on idx_B (cost=0.00..20147.76 rows=1089958 width=0) (actual
time=466.545..466.545 rows=1020084 loops=1)"
"        Index Cond: ((B)::text = '21'::text)"
"Total runtime: 8940.119 ms"

In this case it goes for the correct index.  It appears that the query planner makes very simplistic
assumptions when it comes to LIMIT?

Thanks
David

-----Original Message-----
From: Pavel Stehule [mailto:pavel.stehule <at> gmail.com] 
Sent: 01 September 2008 13:53
To: David West
Cc: pgsql-performance <at> postgresql.org
Subject: Re: [PERFORM] limit clause breaks query planner?

Hello

you should partial index

create index foo(b) on mytable where a is null;

regards
Pavel Stehule

2008/9/1 David West <david.west <at> cusppoint.com>:
> Hi,
>
>
>
> I have a single table with about 10 million rows, and two indexes.  Index A
> is on a column A with 95% null values.  Index B is on a column B with about
> 10 values, ie. About a million rows of each value.
>
>
>
> When I do a simple query on the table (no joins) with the following
> condition:
>
> A is null AND
>
> B = '21'
>
>
>
> it uses the correct index, index B.  However, when I add a limit clause of
> 15, postgres decides to do a sequential scan :s.  Looking at the results
> from explain:
>
>
>
> "Limit  (cost=0.00..3.69 rows=15 width=128)"
>
> "  ->  Seq Scan on my_table this_  (cost=0.00..252424.24 rows=1025157
> width=128)"
>
> "        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"
>
>
>
> It appears that postgres is (very incorrectly) assuming that it will only
> have to retrieve 15 rows on a sequential scan, and gives a total cost of
> 3.69.  In reality, it has to scan many rows forward until it finds the
> correct value, yielding very poor performance for my table.
>
>
>
> If I disable sequential scan (set enable_seqscan=false) it then incorrectly
> uses the index A that has 95% null values: it seems to incorrectly apply the
> same logic again that it will only have to retrieve 15 rows with the limit
> clause, and thinks that the index scan using A is faster than index scan B.
>
>
>
> Only by deleting the index on A and disabling sequential scan will it use
> the correct index, which is of course by far the fastest.
>
>
>
> Is there an assumption in the planner that a limit of 15 will mean that
> postgres will only have to read 15 rows?  If so is this a bad assumption?
> If a particular query is faster without a limit, then surely it will also be
> faster with the limit.
>
>
>
> Any workarounds for this?
>
>
>
> Thanks
>
> David

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Gmane