Picon

estimate btree index size without creating

Dear all,

Is it possible to estimate btree index size on text field before
creating it. For example i have a table like this:

fortest=# \d index_estimate
i          integer
somestring text

This is avg bytes in text field
fortest=# select round(avg(octet_length(somestring))) from index_estimate ;
4

And number of tuples is
fortest=# select reltuples from pg_class where relname = 'index_estimate';
10001

Can i estimate index size, operate only this data, may be exist
formula or something to do this estimation.

-- 
Best Regards,
Seliavka Evgenii

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

(Continue reading)

johno | 21 Jul 23:09 2014
Picon

Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

Hi there,

I am trying to optimize a simple query that returns first 100 rows that have been updated since a given timestamp (ordered by timestamp and id desc).  If there are several rows with the same timestamp I need to a second condition, that states that I want to return rows having the given timestamp and id > given id.

The obvious query is 

SELECT * FROM register_uz_accounting_entities
    WHERE effective_on > '2014-07-11' OR (effective_on = '2014-07-11' AND id > 1459)
    ORDER BY effective_on, id
    LIMIT 100

With a composite index on (effective_on, id)

Query plan

"Limit  (cost=4613.70..4613.95 rows=100 width=1250) (actual time=0.122..0.130 rows=22 loops=1)"
"  Buffers: shared hit=28"
"  ->  Sort  (cost=4613.70..4617.33 rows=1453 width=1250) (actual time=0.120..0.122 rows=22 loops=1)"
"        Sort Key: effective_on, id"
"        Sort Method: quicksort  Memory: 30kB"
"        Buffers: shared hit=28"
"        ->  Bitmap Heap Scan on register_uz_accounting_entities  (cost=35.42..4558.17 rows=1453 width=1250) (actual time=0.036..0.083 rows=22 loops=1)"
"              Recheck Cond: ((effective_on > '2014-07-11'::date) OR ((effective_on = '2014-07-11'::date) AND (id > 1459)))"
"              Buffers: shared hit=28"
"              ->  BitmapOr  (cost=35.42..35.42 rows=1453 width=0) (actual time=0.026..0.026 rows=0 loops=1)"
"                    Buffers: shared hit=6"
"                    ->  Bitmap Index Scan on idx2  (cost=0.00..6.49 rows=275 width=0) (actual time=0.017..0.017 rows=15 loops=1)"
"                          Index Cond: (effective_on > '2014-07-11'::date)"
"                          Buffers: shared hit=3"
"                    ->  Bitmap Index Scan on idx2  (cost=0.00..28.21 rows=1178 width=0) (actual time=0.007..0.007 rows=7 loops=1)"
"                          Index Cond: ((effective_on = '2014-07-11'::date) AND (id > 1459))"
"                          Buffers: shared hit=3"
"Total runtime: 0.204 ms"


Everything works as expected. However if I change the constraint to a timestamp/date more in the past (resulting in far more matching rows) the query slows down drastically.

SELECT * FROM register_uz_accounting_entities
WHERE effective_on > '2014-06-11' OR (effective_on = '2014-06-11' AND id > 1459)
ORDER BY effective_on, id
LIMIT 100
 
"Limit  (cost=0.42..649.46 rows=100 width=1250) (actual time=516.125..516.242 rows=100 loops=1)"
"  Buffers: shared hit=576201"
"  ->  Index Scan using idx2 on register_uz_accounting_entities  (cost=0.42..106006.95 rows=16333 width=1250) (actual time=516.122..516.229 rows=100 loops=1)"
"        Filter: ((effective_on > '2014-06-11'::date) OR ((effective_on = '2014-06-11'::date) AND (id > 1459)))"
"        Rows Removed by Filter: 797708"
"        Buffers: shared hit=576201"
"Total runtime: 516.304 ms"


I've tried to optimize this query by pushing down the limit and order by's into explicit subselects.

SELECT * FROM (
   SELECT * FROM register_uz_accounting_entities 
   WHERE effective_on > '2014-06-11'
   ORDER BY effective_on, id LIMIT 100
   ) t1
UNION
  SELECT * FROM (
    SELECT * FROM register_uz_accounting_entities
    WHERE effective_on = '2014-06-11' AND id > 1459
    ORDER BY effective_on, id LIMIT 100
) t2
ORDER BY effective_on, id
LIMIT 100
 
-- query plan
"Limit  (cost=684.29..684.54 rows=100 width=1250) (actual time=2.648..2.708 rows=100 loops=1)"
"  Buffers: shared hit=203"
"  ->  Sort  (cost=684.29..684.79 rows=200 width=1250) (actual time=2.646..2.672 rows=100 loops=1)"
"        Sort Key: register_uz_accounting_entities.effective_on, register_uz_accounting_entities.id"
"        Sort Method: quicksort  Memory: 79kB"
"        Buffers: shared hit=203"
"        ->  HashAggregate  (cost=674.65..676.65 rows=200 width=1250) (actual time=1.738..1.971 rows=200 loops=1)"
"              Buffers: shared hit=203"
"              ->  Append  (cost=0.42..661.15 rows=200 width=1250) (actual time=0.054..0.601 rows=200 loops=1)"
"                    Buffers: shared hit=203"
"                    ->  Limit  (cost=0.42..338.62 rows=100 width=1250) (actual time=0.053..0.293 rows=100 loops=1)"
"                          Buffers: shared hit=101"
"                          ->  Index Scan using idx2 on register_uz_accounting_entities  (cost=0.42..22669.36 rows=6703 width=1250) (actual time=0.052..0.260 rows=100 loops=1)"
"                                Index Cond: (effective_on > '2014-06-11'::date)"
"                                Buffers: shared hit=101"
"                    ->  Limit  (cost=0.42..318.53 rows=100 width=1250) (actual time=0.037..0.228 rows=100 loops=1)"
"                          Buffers: shared hit=102"
"                          ->  Index Scan using idx2 on register_uz_accounting_entities register_uz_accounting_entities_1  (cost=0.42..30888.88 rows=9710 width=1250) (actual time=0.036..0.187 rows=100 loops=1)"
"                                Index Cond: ((effective_on = '2014-06-11'::date) AND (id > 1459))"
"                                Buffers: shared hit=102"
"Total runtime: 3.011 ms"

=> Very fast.

The question is... why is the query planner unable to make this optimization for the slow query? What am I missing?

Queries with syntax highlighting https://gist.github.com/jsuchal/0993fd5a2bfe8e7242d1

Thanks in advance.


Albe Laurenz | 21 Jul 10:02 2014
Picon

Re: Blocking every 20 sec while mass copying.

Please keep the list on CC: in your responses.

Benjamin Dugast wrote:
> 2014-07-18 13:11 GMT+02:00 Albe Laurenz <laurenz.albe <at> wien.gv.at>:
>> This sounds a lot like checkpoint I/O spikes.
>>
>> Check with the database server log if the freezes coincide with checkpoints.
>>
>> You can increase checkpoint_segments when you load data to have them occur less often.
>>
>> If you are on Linux and you have a lot of memory, you might hit spikes because too
>> much dirty data are cached; check /proc/sys/vm/dirty_ratio and /proc/sys/dirty_background_ratio.

> The checkpoint_segments is set to 64 already
> 
> the dirty_ration was set by default to 10 i put it down to 5
> the dirty_background_ratio was set to 5 and I changed it to 2
> 
> There is less freezes but the insert is so slower than before.

That seems to indicate that my suspicion was right.

I would say that your I/O system is saturated.
Have you checked with "iostat -mNx 1"?

If you really cannot drop the indexes during loading, there's probably not much more
you can do to speed up the load.
You can try to increase checkpoint_segments beyond 64 and see if that buys you anything.

Tuning the file system write cache will not reduce the amount of I/O necessary, but it
should reduce the spikes (which is what I thought was your problem).

Yours,
Laurenz Albe

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Benjamin Dugast | 18 Jul 12:52 2014

Blocking every 20 sec while mass copying.

Hello,

I'm working on Postgres 9.3.4 for a project.

We are using Scala, Akka and JDBC to insert data in the database, we have around 25M insert to do which are basically lines from 5000 files. We issue a DELETE according to the file (mandatory) and then a COPY each 1000 lines of that file.

DELETE request : DELETE FROM table WHERE field1 = ? AND field2 = ?;
COPY request : COPY table FROM STDIN WITH CSV

We have indexes on our database that we can't delete to insert our data.

When we insert the data there is some kind of freezes on the databases between requests. Freezes occur about every 20 seconds.

Here is a screenshot from yourkit.

We tried different solutions:
  • 1 table to 5 tables to reduces lock contention
  • fillfactor on indexes
  • commit delay
  • fsync to off (that helped but we can't do this)

We mainly want to know why this is happening because it slowing the insert too much for us.

Chris Ruprecht | 18 Jul 00:47 2014

Building multiple indexes on one table.

Is there any way that I can build multiple indexes on one table without having to scan the table multiple
times? For small tables, that's probably not an issue, but if I have a 500 GB table that I need to create 6
indexes on, I don't want to read that table 6 times.
Nothing I could find in the manual other than reindex, but that's not helping, since it only rebuilds
indexes that are already there and I don't know if that reads the table once or multiple times. If I could
create indexes inactive and then run reindex, which then reads the table once, I would have a solution. But
that doesn't seem to exist either.

best regards,
chris
-- 
chris ruprecht
database grunt and bit pusher extraordinaíre

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Magers, James | 14 Jul 00:55 2014

Query Performance question

I am using a Pentaho process to access the database and select the appropriate information to update the DB tables and records.  I am trying to select the previous subscription key in order to update the factable for any records that have the previous key to have the current subscription key. This query is intended to use the current subscription key and subscription info to select the previous subscription key to allow for the information to be updated.  I would like to optimize the query to execute more efficiently.

The database table has about 60K records in it and when I run an explain anaylyze it indicates that the query optimizer chooses to execute a bitmap heap scan, this seems like an inefficient method for this query.

Query:
Select subscription_key as prev_sub_key
from member_subscription_d
where subscription_value ='noname <at> mailinator.com'
and newsletter_nme = 'newsletter_member'
and subscription_platform = 'email'
and version = (select version -1 as mtch_vers
               from member_subscription_d
               where subscription_key = 4037516)


Current Data in Database for this address:
 subscription_key | version |       date_from        |          date_to           |  newsletter_nme   | subscription_platform | subscription_value | subscription_status | list_status | current_status | unsubscribetoken |    transaction_date    | newsletter_sts
------------------+---------+------------------------+----------------------------+-------------------+-----------------------+--------------------+---------------------+-------------+----------------+------------------+------------------------+----------------
          4001422 |       1 | 2000-02-09 00:00:00-05 | 2014-04-19 09:57:24-04     | newsletter_member | email                 | noname <at> mailinator.com       | VALID               | pending     | f              |                  | 2000-02-09 00:00:00-05 |              2
          4019339 |       2 | 2014-04-19 09:57:24-04 | 2014-06-04 12:27:34-04     | newsletter_member | email                 | noname <at> mailinator.com       | VALID               | subscribe   | f              |                  | 2014-04-19 09:57:24-04 |              1
          4037516 |       3 | 2014-06-04 12:27:34-04 | 2199-12-31 23:59:59.999-05 | newsletter_member | email                 | noname <at> mailinator.com       | VALID               | subscribe   | t              |                  | 2014-06-04 12:27:34-04 |              1
(3 rows)


System information:
Postgres Version: 9.2
OS : Linux cmprodpgsql1 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Pentaho: 5.0.1-stable

postgresql.conf
checkpoint_segments = '8'
data_directory = '/var/lib/postgresql/9.2/main'
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
effective_cache_size = '2GB'
external_pid_file = '/var/run/postgresql/9.2-main.pid'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
listen_addresses = '*'
log_line_prefix = '%t '
max_connections = '200'
max_wal_senders = '3'
port = 5432
shared_buffers = '1024MB'
ssl = off
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/certs/ssl-cert-snakeoil.key'
timezone = 'localtime'
unix_socket_directory = '/var/run/postgresql'
wal_keep_segments = '8'
wal_level = 'hot_standby'
work_mem = '100MB'

Huang, Suya | 11 Jul 06:14 2014

GIN index not used

Hi,

 

I’ve got a table with GIN index on integer[] type. While doing a query with filter criteria on that column has GIN index created, it’s not using index at all, still do the full table scan. Wondering why?

 

Table is analyzed.

 

dev=# \d+ booking_weekly

                            Table "booking_weekly"

    Column    |          Type          | Modifiers | Storage  | Stats target | Description

--------------+------------------------+-----------+----------+--------------+-------------

date         | date                   |           | plain    |              |

id              | character varying(256) |           | extended |              |

t_wei       | double precision       |           | plain    |              |

booking_ts     | integer[]              |           | extended |              |

Indexes:

    "idx_booking_weekly_1_1" btree (id), tablespace "tbs_data"

    "idx_booking_weekly_1_2" gin (booking_ts), tablespace "tbs_data"

 

dev=# select * from booking_weekly limit 1;

-[ RECORD 1

date         | 2014-05-03

id        | 148f8ecbf40

t_wei       | 0.892571268670041

booking_ts     | {2446685,4365133,5021137,2772581,1304970,6603422,262511,5635455,4637460,5250119,3037711,6273424,3198590,3581767,6612741,5813035,3074851}

 

 

dev=# explain analyze select * FROM booking_weekly

WHERE date = '2014-05-03' AND

booking_ts <at> >array[2446685];

-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------

QUERY PLAN | Seq Scan on booking_weekly  (cost=10000000000.00..10000344953.64 rows=1288 width=1233) (actual time=0.015..1905.657 rows=1 loops=1)

-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------

QUERY PLAN |   Filter: ((booking_ts <at> > '{2446685}'::integer[]) AND (date = '2014-05-03'::date))

-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------------------------------

QUERY PLAN |   Rows Removed by Filter: 1288402

-[ RECORD 4 ]--------------------------------------------------------------------------------------------------------------------------------------

QUERY PLAN | Total runtime: 1905.687 ms

 

Thanks,

Suya

Josh Berkus | 9 Jul 02:06 2014

Re: PGSQL 9.3 - billion rows

On 07/07/2014 06:59 AM, Nicolas Paris wrote:
> - Can postgresql support such table (this table is the fact table of a
> datamart -> many join query with dimensions tables) ?

Yes, it can.

> - If yes, I would like to test (say insert 2 billion test rows), what
> serveur configuration do I need ? How much RAM ?
> - If not, would it be better to think about a cluster or other ?
> - (Have you any idea to optimize this table ?)

Consider also trying cstore_fdw: https://github.com/citusdata/cstore_fdw

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Nicolas Paris | 7 Jul 15:59 2014
Picon

PGSQL 9.3 - billion rows

​​
​​
​​
​​
​Hello,

I have a fact table ( table and indexes are bellow ) that will probably get arround 2 billion rows.

- Can postgresql support such table (this table is the fact table of a datamart -> many join query with dimensions tables) ?
- If yes, I would like to test (say insert 2 billion test rows), what serveur configuration do I need ? How much RAM ?
- If not, would it be better to think about a cluster or other ?
- (Have you any idea to optimize this table ?)

Thanks a lot !


CREATE TABLE observation_fact
(
  encounter_num integer NOT NULL,
  patient_num integer NOT NULL,
  concept_cd character varying(50) NOT NULL,
  provider_id character varying(50) NOT NULL,
  start_date timestamp without time zone NOT NULL,
  modifier_cd character varying(100) NOT NULL DEFAULT ' <at> '::character varying,
  instance_num integer NOT NULL DEFAULT 1,
  valtype_cd character varying(50),
  tval_char character varying(255),
  nval_num numeric(18,5),
  valueflag_cd character varying(50),
  quantity_num numeric(18,5),
  units_cd character varying(50),
  end_date timestamp without time zone,
  location_cd character varying(50),
  observation_blob text,
  confidence_num numeric(18,5),
  update_date timestamp without time zone,
  download_date timestamp without time zone,
  import_date timestamp without time zone,
  sourcesystem_cd character varying(50),
  upload_id integer,
  text_search_index serial NOT NULL,
  CONSTRAINT observation_fact_pk PRIMARY KEY (patient_num, concept_cd, modifier_cd, start_date, encounter_num, instance_num, provider_id)
)
WITH (
  OIDS=FALSE
);


CREATE INDEX of_idx_allobservation_fact
  ON i2b2databeta.observation_fact
  USING btree
  (patient_num, encounter_num, concept_cd COLLATE pg_catalog."default", start_date, provider_id COLLATE pg_catalog."default", modifier_cd COLLATE pg_catalog."default", instance_num, valtype_cd COLLATE pg_catalog."default", tval_char COLLATE pg_catalog."default", nval_num, valueflag_cd COLLATE pg_catalog."default", quantity_num, units_cd COLLATE pg_catalog."default", end_date, location_cd COLLATE pg_catalog."default", confidence_num);


CREATE INDEX of_idx_clusteredconcept
  ON i2b2databeta.observation_fact
  USING btree
  (concept_cd COLLATE pg_catalog."default");


CREATE INDEX of_idx_encounter_patient
  ON i2b2databeta.observation_fact
  USING btree
  (encounter_num, patient_num, instance_num);


CREATE INDEX of_idx_modifier
  ON i2b2databeta.observation_fact
  USING btree
  (modifier_cd COLLATE pg_catalog."default");

CREATE INDEX of_idx_sourcesystem_cd
  ON i2b2databeta.observation_fact
  USING btree
  (sourcesystem_cd COLLATE pg_catalog."default");


CREATE INDEX of_idx_start_date
  ON i2b2databeta.observation_fact
  USING btree
  (start_date, patient_num);


CREATE INDEX of_idx_uploadid
  ON i2b2databeta.observation_fact
  USING btree
  (upload_id);


CREATE UNIQUE INDEX of_text_search_unique
  ON i2b2databeta.observation_fact
  USING btree
  (text_search_index);

piuschan | 4 Jul 23:04 2014
Picon

stored procedure suddenly runs slowly in HOT STANDBY but fast in primary

We have two database servers running streaming replication between them:

Primary server
==============
OS: Linux version 2.6.18-371.3.1.el5
PostgreSQL: 9.1.11

HOT standby server
==================
OS: Linux version 2.6.32-431.11.2.el6.x86_64
PostgreSQL: 9.1.11

Since July 1, one SP suddenly runs slowly in HOT STANDBY server. After
investigation, I can narrow the problem to one particular query in SP. The
weird things are:

(1) The SP takes about 25 seconds to run in HOT STANDBY only, but only 0.5
second in primary
(2) If I extract the query in the SP and run it in a psql session, it runs
fine even in HOT STANDBY
(3) The SP is:
CREATE OR REPLACE FUNCTION tmp_test (p_beacon_id bigint, p_rpt_start_ts
bigint, p_rpt_end_ts bigint) RETURNS bigint AS $$
DECLARE
   --
   v_min_locate_id     bigint;
   --
BEGIN
   --
   SELECT MIN(locate_id) INTO v_min_locate_id
   FROM   event_startstop
   WHERE  beacon_id = p_beacon_id
   AND    locate_id IS NOT NULL
   AND    network_timestamp BETWEEN p_rpt_start_ts AND p_rpt_end_ts;
   --
   RETURN v_min_locate_id;
   --
EXCEPTION
   WHEN OTHERS THEN
      RAISE EXCEPTION 'tmp_test %, %', SQLSTATE, SQLERRM;
END
$$ LANGUAGE 'plpgsql' STABLE;

(4) explain analyze buffers in HOT STANDBY:
DB=# explain (analyze, buffers true) select * from tmp_test (55627,
1403989199, 1404187199);
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Function Scan on tmp_test  (cost=0.25..0.26 rows=1 width=8) (actual
time=25300.000..25300.002 rows=1 loops=1)
   Buffers: shared hit=25357218 read=880466 written=4235
 Total runtime: 25300.067 ms
(3 rows)

(5) if running the SQL from psql:
DB=# explain (analyze, buffers true) select min(locate_id) from
event_startstop where beacon_id=55627 and locate_id is not null and
network_timestamp between 1403989199 and 1404187199;

QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=79.11..79.12 rows=1 width=8) (actual time=0.342..0.342
rows=1 loops=1)
   Buffers: shared hit=8 read=7
   ->  Append  (cost=0.00..79.06 rows=20 width=8) (actual time=0.190..0.326
rows=11 loops=1)
         Buffers: shared hit=8 read=7
         ->  Seq Scan on event_startstop  (cost=0.00..0.00 rows=1 width=8)
(actual time=0.002..0.002 rows=0 loops=1)
               Filter: ((locate_id IS NOT NULL) AND (network_timestamp >=
1403989199) AND (network_timestamp <= 1404187199) AND (beacon_id = 55627))
         ->  Bitmap Heap Scan on event_startstop_201406_b54to56k
event_startstop  (cost=4.71..79.06 rows=19 width=8) (actual
time=0.186..0.310 rows=11 loops=1)
               Recheck Cond: ((beacon_id = 55627) AND (network_timestamp >=
1403989199) AND (network_timestamp <= 1404187199))
               Filter: (locate_id IS NOT NULL)
               Buffers: shared hit=8 read=7
               ->  Bitmap Index Scan on
event_startstop_201406_b54to56k_bidntslid_idx  (cost=0.00..4.71 rows=19
width=0) (actual time=0.170..0.170 rows=11 loops=1)
                     Index Cond: ((beacon_id = 55627) AND (network_timestamp
>= 1403989199) AND (network_timestamp <= 1404187199))
                     Buffers: shared hit=5 read=1
 Total runtime: 0.485 ms
(14 rows)

Time: 159.359 ms

(6) the event_startstop is a parent table with 406 children tables

--
View this message in context: http://postgresql.1045698.n5.nabble.com/stored-procedure-suddenly-runs-slowly-in-HOT-STANDBY-but-fast-in-primary-tp5810599.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

--

-- 
Sent via pgsql-performance mailing list (pgsql-performance <at> postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Huang, Suya | 4 Jul 03:44 2014

DB sessions 100 times of DB connections

Hi,

 

We’ve experienced a DB issue yesterday and after checked the log found that the peak sessions is 3000 while the peak DB connections is only around 30. The application is having problem of pulling data but no warnings in DB log as it doesn’t exceed max_connections.

 

 

How could this happen? How does sessions/connections work in Postgres?

 

Thanks,

Suya


Gmane