Jiří Nádvorník | 29 Jul 12:11 2014
Picon

Re: Cursor + upsert (astronomical data)

Hi Oleg, Sergey,

The problem would be crossmatch if I would have a catalog to crossmatch with. But I am actually trying to
build this catalog.

The crossmatching can be actually used to solve that problem, when I crossmatch the observations with
themselves on q3c_join with 1 arcsec. But as I said, that crashes on the fact that we have ~thousand images
overlapping. This is the factor of quadratic complexity of self-crossmatching the table (self-joining
is the same thing if I understand the crossmatching term correctly). 

I actually managed to write a silver-bullet query, which did the self-joining in the most nested subquery
and then worked with the results via analytic methods like count and rank, which found the best candidate
of these self-joined tuples to compute the average coordination on which I grouped them and got my
identifier. I can send you the code if you are interested.

It worked like charm for smaller data - fast and small error (<1%). But self-joining 3e8 rows when in the
highest density areas you multiply each observation by the factor of 1000^2, the temporary results run
out of disk space (didn’t have more than 1.5 TB). So I managed to solve this by dividing the dataset to
smaller parts (cca 25000 for one part). When ran in parallel on 8 cores, it used them quite good (cca 6 cores
fully loaded at one time) and the 24 GB memory had load at ~75%. 

If the time was linear compared to processed results, the time to process the 3e8 rows was ~9 days. However I
tried this only once (for obvious reasons) and the RDMS crashed after cca 4 days of this heavy load. Don't
know whether I was stretching Postgres over it's limits here so I tried to find an algorithm with linear
complexity at that point. That's how I got into the point where I'm right now.

P.S.: I consulted this with people from catalina and they are doing this thing by Friends-of-Friends
algorithm - but they don't have most of the stars in the high density areas like SMC as we do. That's why I
didn't decide to not use it as I think it would crash horribly when you have the distances comparable with
the threshold of your astrometry.
(Continue reading)

Jiří Nádvorník | 26 Jul 12:46 2014
Picon

Cursor + upsert (astronomical data)

Hello guys.

 

My issue kind of hits multiple topics, but the main question is about performance. I think you need to understand the background a little bit to be able to help me. So I will firstly define the problem and my solutions to it and place the questions for you to the end of this message.

 

Problem:

I have a table of observations of objects on the sky. The most important columns are the coordinates (x,y). All other columns in there are just additional information about the observation. The problem is that when I take an image of the same object on the sky twice, the coordinates x,y won’t be the same, they will be only close to each other. My task is to generate a common identifier to all of the observations of the same object and assign the observations to it (N:1 relation). The criterium is that all of the observations which are within 1 arcsec of this identifier are considered as the same object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.

The reason why I solve the performance issues here is that the table of observations has atm cca 3e8 rows after 1.5 year of gathering the data. The number growth is linear.

 

Technical background:

I’m trying to keep the whole algoritm in DB if possible because I have a good PostgreSQL plugin Q3C for indexing the coordinates of the objects on the sky (https://code.google.com/p/q3c/source/browse/README.q3c). It also has quite a few stored procedures to look in that table for near neighbors which is what I’m doing. The function q3c_join(x1, x2, y1, y2, radius) returns true if the object y is within radius of the object x. It simply generates a list of index bitmap or queries with the operators <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.

 

Solution:

After lot of experimentation with clustering the objects and trying to process them all together in one “silver-bullet” SQL query I decided to use some more simple approach. The main problem with the “process all at once approach” is that the finding neighbours for each observation is in definition quadratic and for 3e8 rows just runs out of disk space (~TBs of memory for the temporary results).

The simplest approach I could think of is that I process each row of the 3e8 rows sequentially and ask:

Do I have an identifier in the radius of 1 arcsec?

No: Generate one and assign me to it.

Yes: Update it and assigne me to it. The update is done as weighted average – I keep the number of how many observations the identifier has been computed. The result is that the identifier will have average coordinates of all the observations it identifies – it will be the center.

 

So here I come with my procedure. It has 3 params. The first two are range of oids to list in the table. Used for scaling and parallelization of the algorithm. The third is the radius in which to search for the neighbours.

 

DROP TYPE IF EXISTS coords;

CREATE TYPE coords AS (

                raj2000 double precision,

                dej2000 double precision

                );

 

 

DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);

CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int, radius double precision)

                RETURNS VOID AS $$

DECLARE

                cur1 CURSOR FOR

                               SELECT

                                               raj2000, dej2000

                               FROM

                                               \schema.observation AS obs

                               WHERE

                                               obs.oid >= fromOID

                               AND

                                               obs.oid < toOID;

                curr_raj2000 double precision;

                curr_dej2000 double precision;

                curr_coords_cat coords;

                cnt int;       

 

BEGIN

/*SELECT current_setting('transaction_isolation') into tmp;

raise notice 'Isolation level %', tmp;*/

OPEN cur1;

cnt:=0;

LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;

LOOP

                FETCH cur1 INTO curr_raj2000, curr_dej2000;

                               EXIT WHEN NOT found;

               

                WITH

                               upsert

                AS

                               (UPDATE

                                               \schema.objcat

                               SET

                                               ipix_cat=q3c_ang2ipix(

                                                               (raj2000 * weight + curr_raj2000) / (weight + 1),

                                                               (dej2000 * weight + curr_dej2000) / (weight + 1)

                                               ),

                                               raj2000 = (raj2000 * weight + curr_raj2000) / (weight + 1),

                                               dej2000 = (dej2000 * weight + curr_dej2000) / (weight + 1),

                                               weight=weight+1

                               WHERE

                                               q3c_join(curr_raj2000, curr_dej2000,

                                                               raj2000, dej2000,

                                                               radius)

                               RETURNING *),

                ins AS

                (INSERT INTO

                                               \schema.objcat

                                               (ipix_cat, raj2000, dej2000, weight)

                               SELECT

                                               (q3c_ang2ipix(curr_raj2000, curr_dej2000)),

                                               curr_raj2000,

                                               curr_dej2000,

                                               1

                WHERE NOT EXISTS

                               (SELECT * FROM upsert)

                RETURNING *)

                UPDATE

                               \schema.observation

                SET

                               id_cat = (SELECT DISTINCT

                                                               id_cat

                                               FROM

                                                               upsert

                                               UNION

                                               SELECT

                                                               id_cat

                                               FROM

                                                               ins

                                               WHERE id_cat IS NOT NULL

                                               LIMIT 1)

                WHERE CURRENT OF cur1;

                cnt:=cnt+1;

 

                IF ((cnt % 100000 ) = 0) THEN

                               RAISE NOTICE 'Processed % entries', cnt;

                END IF;

                              

END LOOP;

CLOSE cur1;

END;

$$ LANGUAGE plpgsql;

 

Results: When I run the query only once (1 client thread), it runs cca 1 mil rows per hour. Which is days for the whole dataset. When I run it in parallel with that lock to ensure pessimistic synchronization, it runs sequentially too J the other threads just waiting. When I delete that lock and hope to solve the resulting conflicts later, the ssd disk serves up to 4 threads relatively effectively – which can divide my days of time by 4 – still inacceptable.

 

The reason is quite clear here – I’m trying to write something in one cycle of the script to a table – then in the following cycle I need to read that information.

 

Questions for you:

1.       The first question is if you can think of a better way how to do this and maybe if SQL is even capable of doing such thing – or do I have to do it in C? Would rewriting the SQL function to C help?

2.       Could I somehow bend the commiting during the algorithm for my thing? Ensure that inside one cycle, the whole part of the identifiers table would be kept in memory for faster lookups?

3.       Why is this so slow? J It is comparable to the quadratic algorithm in the terms of speed – only does not use any memory.

 

I tried to sum this up the best I could – for more information please don’t hesitate to contact me.

 

Thank you very much for even reading this far.

 

Best Regards,

 

Jiri Nadvornik

 

 

Josh Berkus | 25 Jul 23:47 2014

Re: High rate of transaction failure with the Serializable Isolation Level

On 07/23/2014 06:18 PM, Reza Taheri wrote:
> [By way of introduction, we are a TPC subcommittee that is developing a
> benchmark with cloud-like characteristics for virtualized databases. The
> end-to-end benchmarking kit will be publicly available, and will run on
> PGSQL]

Awesome!  Any idea when it will be available?  Our community could
really use some updated benchmark tooling ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Borodin Vladimir | 24 Jul 16:22 2014

Debugging writing load

Hi all.

I have a database with quiet heavy writing load (about 20k tps, 5k of which do writes). And I see lots of writing I/O (I mean amount of data, not iops) to this database, much more than I expect. My question is how can I debug what for backend processes do lots of writes to the $PGDATA/base directory? Below are some details.

The database works on machine with 128 GB of RAM and md raid10 of 8 ssd disks (INTEL SSDSC2BB480G4 480 GB). It runs PostgreSQL 9.3.4 on Red Hat 6.5 with the following postgresql.conf - http://pastebin.com/LNLHppcb. Sysctl parameters for page cache are:

# sysctl -a | grep vm.dirty
vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 104857600
vm.dirty_ratio = 40
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 300
#

Total database size is now a bit more than 500 GB.

I have different raid10 arrays for PGDATA and pg_xlog directory (with different mount options). And under load iostat shows that it is written about 20 MB/s on array with xlogs and about 200 MB/s on array with PGDATA. Iotop shows me that ~ 80-100 MB/s of data is written by pdflush (and it is expected behavior for me). And the other ~100 MB is being written by backend processes (varying from 1 MB/s to 30 MB/s). Checkpointer process, bgwriter process and autovacuum workers do really little work (3-5 MB/s).

Lsof on several backend processes shows me that backend uses just database files (tables and indexes) and last xlog file. Is there any way to understand why is backend writing lots of data to $PGDATA/base directory? I have tried to use pg_stat_statements for it but I haven’t found a good way to understand what is happening. Is there a way to see something like "this backend process has written these pages to disk while performing this query"?

Would be very grateful for any help. Thanks.

--
Vladimir




Reza Taheri | 24 Jul 03:18 2014

High rate of transaction failure with the Serializable Isolation Level

Hello PGSQL performance community,

[By way of introduction, we are a TPC subcommittee that is developing a benchmark with cloud-like characteristics for virtualized databases. The end-to-end benchmarking kit will be publicly available, and will run on PGSQL]

 

I am running into very high failure rates when I run with the Serializable Isolation Level. I have simplified our configuration to a single database with a constant workload, a TPC-E workload if you will, to focus on this this problem. We are running with PGSQL 9.2.4, ODBC 2.2.14 (as well as 2.3.3pre, which didn’t help), RHEL 6.4, and a 6-way VM with 96GB of memory on a 4-socket Westmere server.

 

With our 9 transactions running with a mix of SQL_TXN_READ_COMMITTED and SQL_TXN_REPEATABLE_READ, we get less than 1% deadlocks, all of which occur because each row in one table, BROKER, may be read or written by multiple transactions at the same time. So, there are legitimate conflicts, which we deal with using an exponential backoff algorithm that sleeps for 10ms/30ms/90ms/etc.

 

When we raise the Trade-Result transaction to SQL_TXN_SERIALIZABLE, we face a storm of conflicts. Out of 37,342 Trade-Result transactions, 15,707 hit an error, and have to be rolled back and retired one or more times. The total failure count (due to many transactions failing more than once) is 31,388.

 

What is unusual is that the majority of the failures occur in a statement that should not have any isolation conflicts. About 17K of failures are from the statement below:

2014-07-23 11:27:15 PDT 26085 ERROR:  could not serialize access due to read/write dependencies among transactions

2014-07-23 11:27:15 PDT 26085 DETAIL:  Reason code: Canceled on identification as a pivot, during write.

2014-07-23 11:27:15 PDT 26085 HINT:  The transaction might succeed if retried.

2014-07-23 11:27:15 PDT 26085 CONTEXT:  SQL statement "update   TRADE

                set     T_COMM = comm_amount,

                        T_DTS = trade_dts,

                        T_ST_ID = st_completed_id,

                        T_TRADE_PRICE = trade_price

                where   T_ID = trade_id"

        PL/pgSQL function traderesultframe5(ident_t,value_t,character,timestamp without time zone,trade_t,s_price_t) line 15 at SQL statement

 

This doesn’t make sense since at any given time, only one transaction might possibly be accessing the row that is being updated. There should be no conflicts if we have row-level locking/isolation


The second most common conflict happens 7.6K times in the statement below:

2014-07-23 11:27:23 PDT 26039 ERROR:  could not serialize access due to read/write dependencies among transactions

2014-07-23 11:27:23 PDT 26039 DETAIL:  Reason code: Canceled on identification as a pivot, during conflict in checking.

2014-07-23 11:27:23 PDT 26039 HINT:  The transaction might succeed if retried.

2014-07-23 11:27:23 PDT 26039 CONTEXT:  SQL statement "insert

                into    SETTLEMENT (    SE_T_ID,

                                        SE_CASH_TYPE,

                                        SE_CASH_DUE_DATE,

                                        SE_AMT)

                values (        trade_id,

                                cash_type,

                                due_date,

                                se_amount

                        )"

        PL/pgSQL function traderesultframe6(ident_t,timestamp without time zone,character varying,value_t,timestamp without time zone,trade_t,smallint,s_qty_t,character) line 23 at SQL statement

 

I don’t understand why an insert would hit a serialization conflict

 

We also have 4.5K conflicts when we try to commit:

2014-07-23 11:27:23 PDT 26037 ERROR:  could not serialize access due to read/write dependencies among transactions

2014-07-23 11:27:23 PDT 26037 DETAIL:  Reason code: Canceled on identification as a pivot, during commit attempt.

2014-07-23 11:27:23 PDT 26037 HINT:  The transaction might succeed if retried.

2014-07-23 11:27:23 PDT 26037 STATEMENT:  COMMIT

 

 

Does PGSQL raise locks to page level when we run with SQL_TXN_SERIALIZABLE? Are there any knobs I can play with to alleviate this?  FWIW, the same transactions on MS SQL Server see almost no conflicts.

 

Thanks,
Reza

Rural Hunter | 23 Jul 15:21 2014
Picon

Very slow planning performance on partition table

Hi,

I have a table partitioned with about 60 children tables.  Now I found 
the planning time of simple query with partition key are very slow.
# explain analyze select count(*) as cnt from article where pid=88 and 
hash_code='2ca3ff8b17b163f0212c2ba01b80a064';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=16.55..16.56 rows=1 width=0) (actual 
time=0.259..0.259 rows=1 loops=1)
    ->  Append  (cost=0.00..16.55 rows=2 width=0) (actual 
time=0.248..0.250 rows=1 loops=1)
          ->  Seq Scan on article  (cost=0.00..0.00 rows=1 width=0) 
(actual time=0.002..0.002 rows=0 loops=1)
                Filter: ((pid = 88) AND (hash_code = 
'2ca3ff8b17b163f0212c2ba01b80a064'::bpchar))
          ->  Index Scan using article_88_hash_idx on article_88 
article  (cost=0.00..16.55 rows=1 width=0) (actual time=0.246..0.248 
rows=1 loops=1)
                Index Cond: (hash_code = 
'2ca3ff8b17b163f0212c2ba01b80a064'::bpchar)
                Filter: (pid = 88)
  Total runtime: 3.816 ms
(8 rows)

Time: 30999.986 ms

You can see the timing output that the actual run time of the 'explain 
analyze' is 30 seconds while the select sql itself takes only 3 ms. My 
partition key is on article.pid and the constraint is simple like this: 
CONSTRAINT article_88_pid_check CHECK (pid = 88). What's wrong and how 
can I improve the planning performance?

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Picon

estimate btree index size without creating

Dear all,

Is it possible to estimate btree index size on text field before
creating it. For example i have a table like this:

fortest=# \d index_estimate
i          integer
somestring text

This is avg bytes in text field
fortest=# select round(avg(octet_length(somestring))) from index_estimate ;
4

And number of tuples is
fortest=# select reltuples from pg_class where relname = 'index_estimate';
10001

Can i estimate index size, operate only this data, may be exist
formula or something to do this estimation.

-- 
Best Regards,
Seliavka Evgenii

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

johno | 21 Jul 23:09 2014
Picon

Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

Hi there,

I am trying to optimize a simple query that returns first 100 rows that have been updated since a given timestamp (ordered by timestamp and id desc).  If there are several rows with the same timestamp I need to a second condition, that states that I want to return rows having the given timestamp and id > given id.

The obvious query is 

SELECT * FROM register_uz_accounting_entities
    WHERE effective_on > '2014-07-11' OR (effective_on = '2014-07-11' AND id > 1459)
    ORDER BY effective_on, id
    LIMIT 100

With a composite index on (effective_on, id)

Query plan

"Limit  (cost=4613.70..4613.95 rows=100 width=1250) (actual time=0.122..0.130 rows=22 loops=1)"
"  Buffers: shared hit=28"
"  ->  Sort  (cost=4613.70..4617.33 rows=1453 width=1250) (actual time=0.120..0.122 rows=22 loops=1)"
"        Sort Key: effective_on, id"
"        Sort Method: quicksort  Memory: 30kB"
"        Buffers: shared hit=28"
"        ->  Bitmap Heap Scan on register_uz_accounting_entities  (cost=35.42..4558.17 rows=1453 width=1250) (actual time=0.036..0.083 rows=22 loops=1)"
"              Recheck Cond: ((effective_on > '2014-07-11'::date) OR ((effective_on = '2014-07-11'::date) AND (id > 1459)))"
"              Buffers: shared hit=28"
"              ->  BitmapOr  (cost=35.42..35.42 rows=1453 width=0) (actual time=0.026..0.026 rows=0 loops=1)"
"                    Buffers: shared hit=6"
"                    ->  Bitmap Index Scan on idx2  (cost=0.00..6.49 rows=275 width=0) (actual time=0.017..0.017 rows=15 loops=1)"
"                          Index Cond: (effective_on > '2014-07-11'::date)"
"                          Buffers: shared hit=3"
"                    ->  Bitmap Index Scan on idx2  (cost=0.00..28.21 rows=1178 width=0) (actual time=0.007..0.007 rows=7 loops=1)"
"                          Index Cond: ((effective_on = '2014-07-11'::date) AND (id > 1459))"
"                          Buffers: shared hit=3"
"Total runtime: 0.204 ms"


Everything works as expected. However if I change the constraint to a timestamp/date more in the past (resulting in far more matching rows) the query slows down drastically.

SELECT * FROM register_uz_accounting_entities
WHERE effective_on > '2014-06-11' OR (effective_on = '2014-06-11' AND id > 1459)
ORDER BY effective_on, id
LIMIT 100
 
"Limit  (cost=0.42..649.46 rows=100 width=1250) (actual time=516.125..516.242 rows=100 loops=1)"
"  Buffers: shared hit=576201"
"  ->  Index Scan using idx2 on register_uz_accounting_entities  (cost=0.42..106006.95 rows=16333 width=1250) (actual time=516.122..516.229 rows=100 loops=1)"
"        Filter: ((effective_on > '2014-06-11'::date) OR ((effective_on = '2014-06-11'::date) AND (id > 1459)))"
"        Rows Removed by Filter: 797708"
"        Buffers: shared hit=576201"
"Total runtime: 516.304 ms"


I've tried to optimize this query by pushing down the limit and order by's into explicit subselects.

SELECT * FROM (
   SELECT * FROM register_uz_accounting_entities 
   WHERE effective_on > '2014-06-11'
   ORDER BY effective_on, id LIMIT 100
   ) t1
UNION
  SELECT * FROM (
    SELECT * FROM register_uz_accounting_entities
    WHERE effective_on = '2014-06-11' AND id > 1459
    ORDER BY effective_on, id LIMIT 100
) t2
ORDER BY effective_on, id
LIMIT 100
 
-- query plan
"Limit  (cost=684.29..684.54 rows=100 width=1250) (actual time=2.648..2.708 rows=100 loops=1)"
"  Buffers: shared hit=203"
"  ->  Sort  (cost=684.29..684.79 rows=200 width=1250) (actual time=2.646..2.672 rows=100 loops=1)"
"        Sort Key: register_uz_accounting_entities.effective_on, register_uz_accounting_entities.id"
"        Sort Method: quicksort  Memory: 79kB"
"        Buffers: shared hit=203"
"        ->  HashAggregate  (cost=674.65..676.65 rows=200 width=1250) (actual time=1.738..1.971 rows=200 loops=1)"
"              Buffers: shared hit=203"
"              ->  Append  (cost=0.42..661.15 rows=200 width=1250) (actual time=0.054..0.601 rows=200 loops=1)"
"                    Buffers: shared hit=203"
"                    ->  Limit  (cost=0.42..338.62 rows=100 width=1250) (actual time=0.053..0.293 rows=100 loops=1)"
"                          Buffers: shared hit=101"
"                          ->  Index Scan using idx2 on register_uz_accounting_entities  (cost=0.42..22669.36 rows=6703 width=1250) (actual time=0.052..0.260 rows=100 loops=1)"
"                                Index Cond: (effective_on > '2014-06-11'::date)"
"                                Buffers: shared hit=101"
"                    ->  Limit  (cost=0.42..318.53 rows=100 width=1250) (actual time=0.037..0.228 rows=100 loops=1)"
"                          Buffers: shared hit=102"
"                          ->  Index Scan using idx2 on register_uz_accounting_entities register_uz_accounting_entities_1  (cost=0.42..30888.88 rows=9710 width=1250) (actual time=0.036..0.187 rows=100 loops=1)"
"                                Index Cond: ((effective_on = '2014-06-11'::date) AND (id > 1459))"
"                                Buffers: shared hit=102"
"Total runtime: 3.011 ms"

=> Very fast.

The question is... why is the query planner unable to make this optimization for the slow query? What am I missing?

Queries with syntax highlighting https://gist.github.com/jsuchal/0993fd5a2bfe8e7242d1

Thanks in advance.


Albe Laurenz | 21 Jul 10:02 2014
Picon

Re: Blocking every 20 sec while mass copying.

Please keep the list on CC: in your responses.

Benjamin Dugast wrote:
> 2014-07-18 13:11 GMT+02:00 Albe Laurenz <laurenz.albe <at> wien.gv.at>:
>> This sounds a lot like checkpoint I/O spikes.
>>
>> Check with the database server log if the freezes coincide with checkpoints.
>>
>> You can increase checkpoint_segments when you load data to have them occur less often.
>>
>> If you are on Linux and you have a lot of memory, you might hit spikes because too
>> much dirty data are cached; check /proc/sys/vm/dirty_ratio and /proc/sys/dirty_background_ratio.

> The checkpoint_segments is set to 64 already
> 
> the dirty_ration was set by default to 10 i put it down to 5
> the dirty_background_ratio was set to 5 and I changed it to 2
> 
> There is less freezes but the insert is so slower than before.

That seems to indicate that my suspicion was right.

I would say that your I/O system is saturated.
Have you checked with "iostat -mNx 1"?

If you really cannot drop the indexes during loading, there's probably not much more
you can do to speed up the load.
You can try to increase checkpoint_segments beyond 64 and see if that buys you anything.

Tuning the file system write cache will not reduce the amount of I/O necessary, but it
should reduce the spikes (which is what I thought was your problem).

Yours,
Laurenz Albe

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Benjamin Dugast | 18 Jul 12:52 2014

Blocking every 20 sec while mass copying.

Hello,

I'm working on Postgres 9.3.4 for a project.

We are using Scala, Akka and JDBC to insert data in the database, we have around 25M insert to do which are basically lines from 5000 files. We issue a DELETE according to the file (mandatory) and then a COPY each 1000 lines of that file.

DELETE request : DELETE FROM table WHERE field1 = ? AND field2 = ?;
COPY request : COPY table FROM STDIN WITH CSV

We have indexes on our database that we can't delete to insert our data.

When we insert the data there is some kind of freezes on the databases between requests. Freezes occur about every 20 seconds.

Here is a screenshot from yourkit.

We tried different solutions:
  • 1 table to 5 tables to reduces lock contention
  • fillfactor on indexes
  • commit delay
  • fsync to off (that helped but we can't do this)

We mainly want to know why this is happening because it slowing the insert too much for us.

Chris Ruprecht | 18 Jul 00:47 2014

Building multiple indexes on one table.

Is there any way that I can build multiple indexes on one table without having to scan the table multiple
times? For small tables, that's probably not an issue, but if I have a 500 GB table that I need to create 6
indexes on, I don't want to read that table 6 times.
Nothing I could find in the manual other than reindex, but that's not helping, since it only rebuilds
indexes that are already there and I don't know if that reads the table once or multiple times. If I could
create indexes inactive and then run reindex, which then reads the table once, I would have a solution. But
that doesn't seem to exist either.

best regards,
chris
-- 
chris ruprecht
database grunt and bit pusher extraordinaíre

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Gmane