Christian Gough | 17 Apr 20:25 2015

Postgresql Host Swapping Hard With Abundant Free Memory


We have been having some pretty painful site outages related to heavy swapping, even though the system may
appear to have as much as 10GB of free memory.
It seems that as soon as the system hits ~22GB (of 32GB) of memory usage it starts to swap.  As soon as we go below
~22GB, swap is released.  

During the worst outages we see:

heavy swapping      (10-15GB)
heavy disk IO       (up to 600ms)
heavy CPU load: 4.0 (load over 100+ with 26 cores)
available memory:   (6-8GB)

Here's the host information, it's a dedicated physical host, not a VM.

ubuntu-14.04.1 LTS
Linux db004 3.13.0-34-generic #60-Ubuntu SMP Wed Aug 13 15:45:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
24 cores, 32GB RAM, 32GB swapfile
RAID 10 with SSDs

Postgres version:

postgis-2.1.4 extension

Here's our /etc/sysctl.conf:

kernel.shmmax = 16850395136
kernel.shmall = 4113866
(Continue reading)

Andomar | 22 Apr 19:37 2015

Query plan with missing timespans

This is a question about how to read "explain analyze".  I've anonymized 
column names and table names.

In the output of "explain analyze" below, what was the query doing 
between actual time 1.426 and 17.077?

Kind regards,

  HashAggregate  (cost=862.02..862.62 rows=48 width=90) (actual 
time=17.077..17.077 rows=0 loops=1)
    Group Key: col, col, col
    Buffers: shared hit=6018
    ->  Nested Loop  (cost=1.52..861.18 rows=48 width=90) (actual 
time=17.077..17.077 rows=0 loops=1)
          Buffers: shared hit=6018
          ->  Nested Loop  (cost=1.09..26.74 rows=303 width=41) (actual 
time=0.033..1.426 rows=384 loops=1)
                Buffers: shared hit=845
                ->  Index Scan using ind on tbl  (cost=0.42..8.44 rows=1 
width=8) (actual time=0.010..0.011 rows=1 loops=1)
                      Index Cond: (col = 123)
                      Buffers: shared hit=4
                ->  Index Scan using ind on tbl (cost=0.67..18.28 rows=2 
width=49) (actual time=0.020..1.325 rows=384 loops=1)
                      Index Cond: (col = col)
                      Filter: (col = 'value')
                      Rows Removed by Filter: 2720
                      Buffers: shared hit=841
          ->  Index Scan using index on tbl (cost=0.42..2.74 rows=1 
(Continue reading)

Jon Dufresne | 19 Apr 19:16 2015

extract(year from date) doesn't use index but maybe could?

Given the table:


With an *index* on field d. The following two queries are functionally

1. SELECT * FROM dates WHERE d >= '1900-01-01'
2. SELECT * FROM dates WHERE EXTRACT(year from d) >= 1900'

By functionally equivalent, they will return the same result set.

Query 2 does not use the index, adding a performance cost. It seems
there is an opportunity for optimization to handle these two queries
equivalently to take advantage of the index.

Some database abstraction layers have attempted to workaround this
limitation by rewriting EXTRACT(year ...) queries into a query more
like query 1. For example: Django's ORM does exctly this. Rather than
all abstraction layers trying to optimize this case, maybe it could be
pushed to the database layer.

I have written a test script that demonstrates that these functionally
equivalent queries have different performance characteristics. The
script and results are provide below:


EXPLAIN SELECT * FROM dates WHERE d >= '1900-01-01'
(Continue reading)

Andreas Joseph Krogh | 15 Apr 03:46 2015

Performance of vacuumlo

Hi all.
I have a pg_largeobject of ~300GB size and when I run "vacuumlo -n <dbname>", I get:
Would remove 82172 large objects from database "<dbname>".
So I'm running without "-n" to do the actual work, but it seems to take forever. The disks are 8 SAS 10K HDD drives in RAID5.
Any hints on how long this is supposed to take?
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Josh Berkus | 14 Apr 21:58 2015

Re: Some performance testing?


> Can confirm that for pg purposes, 3.2 is basically broken in some not
> to great ways. We've had servers that were overloaded at load factors
> of 20 or 30 drop down to 5 or less with just a change from 3.2 to
> 3.11/3.13 on ubuntu 12.04

That's correct, and 3.5 shares the same problems.  The underlying issue
was that 3.X was tweaked to be MUCH more aggressive about
cache-clearing, to the point where it would be evicting data from the FS
cache which had just been read in and hadn't even been used yet.  For
some reason, this aggressive eviction got worse the more processes on
the system which were using the FS cache, so where you really see it is
when you have more processes with cache than you have cores.

It's pretty easy to demonstrate just using pgbench, with a database
larger than RAM, and 2X as many clients as cores.  You'll see that
kernels 3.2 and 3.5 will do 3X to 5X as much IO for the same workload as
3.10 and later will do.


On 04/09/2015 04:01 AM, Graeme B. Bell wrote:> performance with 2.6:
(pgbench, size 100, 32 clients)
> 48 651 transactions per second (read only)
> 6 504 transactions per second (read-write)
> performance with 3.18 (pgbench, size 100, 32 clients)
> 129 303 transactions per second  (read only)
> 16 895 transactions (read-write)

Thanks for that data!  I'm glad to see that 3.18 has improved so much.

Josh Berkus
PostgreSQL Experts Inc.


Sent via pgsql-performance mailing list (pgsql-performance <at>
To make changes to your subscription:

Andreas Joseph Krogh | 9 Apr 23:39 2015

Cannot get query to use btree-gin index when ORDER BY

Hi all.
Using PG-9.4
I have a query which goes something like this
The simplified schema is like this:
id serial PRIMARY KEY,
subject varchar NOT NULL,
folder_id integer NOT NULL REFERENCES folder(id),
received timestamp not null,
fts_all tsvector
create index message_fts_all_folder_idx ON message using gin (fts_all, folder_id);
SELECT, m.subject
  FROM message m
WHERE m.folder_id = 1 AND m.fts_all <at> <at> to_tsquery('simple', 'foo')
ORDER BY received DESC LIMIT 10;
On my dataset it uses an index I have on (folder_id, received DESC), then filters the result, which is not optimal when searching in > 1million messages and the result is large and I'm only after the first (newest) 10.
What I'd like is to have an index like this:
create index message_fts_all_folder_idx ON message using gin (fts_all, folder_id, received DESC);
but GIN doesn't allow ASC/DESC modifiers.
Any hints on how to optimize this?
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Graeme B. Bell | 9 Apr 16:45 2015

NVMe or AHCI PCI-express? A comment for people benchmarking...

A tangent to the performance testing thread here, but an important issue that you will see come up in your
work this year or next. 

"PCIe SSD" may include AHCI PCI SSD or   NVMe PCI SSD.

AHCI = old style, basically it's faster than SATA3 but quite similar in terms of how the operating system
sees the flash device.
NVMe = new style, requires a very new motherboard, operating system & drivers, but extremely fast and low
latency, very high IOPS. 
For example, Macbooks have PCIe SSDs in them, but not NVMe (currently).

The difference is very important since NVMe offers multiples of performance in terms of everything we
love: lower latency, higher IOPS, lower CPU overhead and higher throughput.
scroll down to the "App to SSD IO Read Latency" graph. Look at the two bottom lines.

So I'd suggest it's probably worth noting in any benchmark if you are using NVMe and if so which driver
version, since development is ongoing.

On the topic of PCIe NVMe SSDs, some interesting reading:

"it can deliver 750,000 random read IOPS and 115,000 write IOPS "

- or any of these nice toys...

all with capacitor backing (which you should plug-pull test, of course).


> I currently have access to a matched pair of 20-core, 128GB RAM servers
> with SSD-PCI storage, for about 2 weeks before they go into production.
> Are there any performance tests people would like to see me run on
> these?  Otherwise, I'll just do some pgbench and DVDStore.


Sent via pgsql-performance mailing list (pgsql-performance <at>
To make changes to your subscription:

Josh Berkus | 8 Apr 21:05 2015

Re: Some performance testing?

On 04/07/2015 11:07 AM, Mel Llaguno wrote:
> Care to elaborate? We usually do not recommend specific kernel versions
> for our customers (who run on a variety of distributions). Thanks, M.

You should.

Performance is literally 2X to 5X different between kernels.

Josh Berkus
PostgreSQL Experts Inc.


Sent via pgsql-performance mailing list (pgsql-performance <at>
To make changes to your subscription:

Pietro Pugni | 8 Apr 12:37 2015

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

Ciao Pietro,
stavo seguendo thread sulla mailing list Postgresql.

Puoi farmi un piccolo riassunto delle conclusioni perchè non sono sicuro di aver capito tutto?

Ciao Domenico,
sì effettivamente la mailing list è un po’ dispersiva.
Utilizzando il collation di tipo “C” il Dell T420 impiega meno tempo rispetto al Mac Mini nell’esecuzione di qualsiasi query. La differenza finora è del 20%-30% circa. Adesso sto facendo un caricamento dati massiccio che sul Dell impiegava 3 giorni circa mentre sul Mac Mini impiega 1 giorno e mezzo circa e vi farò sapere il tempo di esecuzione.

Alla fine la differenza di performance tra il DELL R420 e il Mac MINI è dovuta al tipo di "collate" utilizzato nell'inizializzazione del DB?

Possono essere adoperati diversi collation all’interno dello stesso DB. Ogni tabella può averne uno diverso, persino ogni attributo può avere un collation diverso dagli altri attributi della stessa tabella o della stessa SELECT.

Il comando:
initdb —no-locale

imposta il collate “C” come predefinito per il DB creato.
Tuttavia nel file di postgresql.conf sono presenti le seguenti variabili di configurazione che definiscono il collate predefinito per l’istanza di Postgres avviata:
# These settings are initialized by initdb, but they can be changed.
lc_messages = 'C'                       # locale for system error message
                                        # strings
lc_monetary = 'C'                       # locale for monetary formatting
lc_numeric = 'C'                        # locale for number formatting
lc_time = 'C'                           # locale for time formatting

In ogni caso mi aspettavo performance migliori da questo Dell, soprattutto considerando il livello della macchina (controller RAID di un certo livello, 10 dischi rigidi SAS da 10k rpm per i dati e 2 dischi SAS da 15k rpm per sistema operativo e WAL, 128GB di RAM e 2 CPU Xeon per un totale di 16 core fisici e 16 logici).
Oltre al collation è bene modificare le impostazioni del BIOS e del controller RAID; il primo perché su questi sistemi sono impostate opzioni di risparmio energetico che tagliano la potenza, mentre per il RAID è bene scegliere una cache in scrittura del tipo WriteBack e in lettura del tipo ReadAhead e poi controllare che la batteria della cache di scrittura sia carica, altrimenti non viene adoperata nessuna cache.
Ovviamente poi la configurazione RAID fa tanto; per ora utilizziamo RAID5 che non è affatto performante in scrittura ma lo è in lettura.
Il tutto dipende dall’applicazione specifica. Nel nostro caso è una sorta di data warehouse ibrido, quindi una soluzione mono utenza che fa uso di grandi moli di dati (siamo nell’ordine dei 1.8TB di database): poche transazioni ma massicce. Le impostazioni sul kernel agiscono soprattutto sulle performance in ambienti multi utenza (migliaia di connessioni) e con le ultime versioni di Postgres e del kernel linux il sistema è già abbastanza auto bilanciato.

Spero di essere stato sufficientemente chiaro ed esaustivo.
Cordiali saluti,

Pietro Pugni | 7 Apr 19:45 2015

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

Hi Jeff

The default collation for the database cluster is set when you create the cluster with initdb (the package you used to install postgresql might provide scripts that wrap initdb and call it something else, sorry I can't be much use with those).  
You can set it with --lc-collate flag to initdb, otherwise it is set based on the environment variables (LANG or LC_* variables) set in the shell you use to run initdb.

Note that you can create a new database in the cluster which has its own default which is different from the cluster's default.

guess what? it worked! Now I run the following command:
/usr/local/pgsql/bin/initdb -D /mnt/raid5/pg_data --no-locale --encoding=UTF8

Time execution for my reference transaction went from 2m9s to 1m18s where Mac Mini is 1m43s.
This is the best improvement after modifying BIOS settings.
I’ll do some testing. In the meanwhile I’ve made some kernel changes which may help in heavy workloads (at the moment they don't affect performance).

I should offer you a dinner…
Thank you a lot. Now I’m looking to push it faster (I think it can goes faster than this!).

Best regards,
Josh Berkus | 7 Apr 19:41 2015

Re: Some performance testing?

On 04/07/2015 09:46 AM, Mel Llaguno wrote:
> FYI - all my tests were conducted using Ubuntu 12.04 x64 LTS (which I
> believe are all 3.xx series kernels).

If it's 3.2 or 3.5, then your tests aren't useful, I'm afraid.  Both of
those kernels have known, severe, memory management issues.

Josh Berkus
PostgreSQL Experts Inc.


Sent via pgsql-performance mailing list (pgsql-performance <at>
To make changes to your subscription: