pgsnapshot composite index results
I frequently use my pgsnapshot database for unusual purposes and end up
running non-standard queries.
The standard indexes for pgsnapshot nodes include a GiST index on geom.
Another common index suggested by the jxapi installation instructions is
a GIN index on tags.
These indexes work well when you have a query that is highly selective
spatially or against the tags but are frequently not ideal against queries
combining a medium selective spatial condition with a medium selective tag
While working on addressmerge I encountered a situation where the query
SELECT * FROM local_all; was quicker than SELECT * FROM local_all WHERE tags
? 'addr:housenumber'; local_all was a view of the nodes, ways and
multipolygons in the local area. The speed difference was caused by a
non-optimal query plan of a query of the form SELECT * FROM nodes WHERE
st_intersects (geom,'my_geom'::geometry) AND tags ? 'addr:housenumber';
where my_geom was the EWKT for a polygon covering the area of interest.
The query plan for the first query involved an index scan of the geom gist
index. The second involved a bitmap and of the geom gist and tags gin
indexes. Unfortunately, due to the limitations of hstore statistics this was
likely not the optimal plan. An exploration of options in #postgresql lead
to the discussion of a composite gist index on (geom, tags) as an
alternative indexing strategy, which is what this message is about (after
this rather lengthy preamble.)
A composite index would be created with a statement like CREATE INDEX [
CONCURRENTLY ] ON nodes USING gist (geom, tags); This index can benefit
statements that are moderately selective in both geom and tags, but is more
important that geom be selective than that tags be selective.
All tests were done with replication stopped on my home server on a 6 7200
RPM drive RAID10 array, 32GB RAM, queries repeated to ensure consistent
caching (i.e. everything in memory). The initial runs of the queries were
obviously substantially slower from disk, but similar behavior was observed
The creation of the composite index took 24 hours, non-currently. I do not
have the creation time for the non-composite index, but I would estimate it
at 18 hours. The indexes are 153GB and 84GB respectively.
With use of transactions it is possible to drop an index then ROLLBACK the
transaction, allowing for easy testing of different combinations of queries
For the following table to make sense, use a fixed-width font.
With the WHERE tags ? 'addr:housenumber' restriction:
geom index (geom, tags) index
Total time: 3000ms 222ms
Total cost: 61622 1166
Estimated rows: 28 28
Actual rows: 78873 78873
Without the WHERE restriction:
geom index (geom, tags) index
Total time: 386ms 400ms
Total cost: 345222 347339
Estimated rows: 27986 27986
Actual rows: 184644 184644
Index scan time: 47ms 59ms
The run to run variation in total speed without the tags restriction is
greater than the different in indexes, but there is a noticeable difference
in index scan time.
Using a rectangle covering the southwest of BC I ran some further queries to
investigate the index scan time. Total query time was about 9 seconds, but
it's the index scan part we're interested in.
Forcing the composite index to be used increased the scan time from 1.28s to
1.58s, an approximately 20% increase. The rest of the query time remained
Putting these results into an xapi context, the use of a composite index
would slow down map? type queries. The index scan is a small part of the
total response time. If most of the time is spent retrieving nodes for
backfilling (done by ID), serializing XML or doing joins with way_nodes or
relation_members, the time spent scanning the index is a minor issue.
As a gist composite index is substantially slower than a gin index it would
not replace it. This would mean there would be essentially no speed change
for *[key=value] queries without a bbox restriction.
Where it would substantially speed up queries is for moderately selective
ones, e.g. fetch all Starbucks in the bounding polygon for the US, or the
case where I fetched all addresses in a city.
Something that I haven't touched on yet is updates. The speed of osmosis
updates to pgsnapshot databases is not well explored. Toby has investigated
slow queries that occur during diff processing and the queries he
investigated did not involve any use of geometry indexes, but the statement
that he looked at does involve an update of the ways linestring. These
updates would require updates to the geometry index which would presumably
be slower with a composite index. I don't have much experience in reading
EXPLAIN results for updates, but I think about 25% of the time is spent
updating the row and indexes. I have no idea how much of this is spent on
linestring index updates.
What I also don't know is how much time is spent inserting nodes. I would
expect that most changes to OSM are the creation of nodes and even if these
queries are individually quick they may compose a significant portion of the
overall update time by sheer volume.
Is a composite index worth it? It depends on your use case. If you are
purely using a pgsnapshot database from osmosis which never uses the tags in
queries then it is clearly not worth it. For xapi map? queries it is also
not worth it. Anything involving both geographic filters and tag filters may
benefit from it, but at cost of potentially slower queries for purely
spatial queries and an unknown impact on updates. There is also a disk space
hit to consider, although an additional 70GB of indexes on a database that
is already 750GB may not be a huge issue.
Something to keep in mind for an xapi situation is that the more IO time
spent on updates the less that can be spent on queries, balancing out the
speed increase from faster queries. On the other hand, a 10x increase (or
better!) on the right queries is significant.
On case where it's a clear winner is where *all* queries involve both a
spatial and tag component and there isn't a need for a separate gin index if
the composite index is used. The separate gist geom and gin tag indexes
could then be replaced by one gist (geom, tags) index, saving space and not
slowing down updates with additional index updates.
: Complete MP tag handling is not required for this application so