### pgsnapshot alternative query optimizations: ways using nodes

In part 1 of a N+1 part series, I'm intending to outline some of the
benchmarking I have done into different queries and indexes with a
pgsnapshot database.
Meta: Where on the wiki should this type of stuff go?
All benchmarking times are with a hot cache, but similar trends were
observed with a cold cache. I tested with two areas: the large area
-87.75,41.80,-87.65,41.85 (322k nodes, 58.1k ways) and the small area
-111.92,33.82,-111.91,33.83 (159 nodes, 30 ways)
My testing machine is a 6 core AMD 1090T with all databases on a 6 x 7200
RPM drive array, 32GB ram.
The traditional way of running a query to get all ways which contain a set
of nodes as their children is to do two joins. This type of query is used in
extracting a bbox from the database.
First you need a set of nodes. This could be generated by this query for a
bbox
CREATE TEMPORARY TABLE bbox_nodes AS
SELECT * FROM nodes
WHERE geom &&
ST_SetSRID(ST_MakeBox2D(ST_Point(-87.75,41.80),ST_Point(-87.65,41.85)),4326)
;
The exact query doesn't matter much - it could be a polygon and
ST_Intersects(geom, area_geom) and everything below should still apply. I am
assuming that some kind of geometry query is used. A random list of nodes
scattered across the globe is a different problem. There's also not
optimization to be done here. What's important is the number of nodes in the
table.
The next step is to get rows for all ways that use at least one of the
nodes. There are two alternative strategies which give different results
that are ***not*** equivalent.
1) Use PostGIS geometry operations and take advantage of the fact that any
way that has a node in the bbox is also in the bbox. This can be done two
ways:
1a) Get a list of ways where their linestring is in the area. This would be
SELECT ways.id, ways.version, ways.user_id,
ways.tstamp, ways.changeset_id, ways.tags, ways.nodes
FROM ways WHERE ST_Intersects(linestring,
ST_SetSRID(ST_MakeBox2D(ST_Point(-87.75,41.80),ST_Point(-87.65,41.85)),4326)
)
This will return all the ways required plus any ways that pass through the
bbox but do not have a node there. Returning extra ways is generally
acceptable. To run this query the linestring column must be built and
maintained which adds significant time requirements for the initial import
and updates and increases the database size by about 40GB for the column +
15GB for the index.
This query took 125ms hot for the large area and 3ms for the small one.
Using && instead of ST_Intersects changes the time to 60ms for the large
area and 6ms for the small one.
1b) As with 1a, but use bboxes instead of linestrings. This will return all
the ways returned by 1a plus ways where the bbox and way bbox intersect even
though the way itself does not intersect the bbox.
This should take less space and be faster than 1a, but return more ways.
This may be an issue with long diagonal ways like some country boundary
ways.
2) Get a list of ways where one of their nodes is in the bbox. This can be
done two ways
2a) With the way_nodes table and JOINs. The way_nodes tables and indexes are
required for updates and require 112GB for the table + 155GB for indexes.
This takes a query of the form
SELECT DISTINCT ways.id, ways.version, ways.user_id,
ways.tstamp, ways.changeset_id, ways.tags, ways.nodes
FROM ways
JOIN way_nodes ON (ways.id=way_nodes.way_id)
JOIN bbox_nodes ON (way_nodes.node_id=bbox_nodes.id);
The query plan is hit by a query planner bug[1] but if appropriate costs are
set, it uses the index on way_nodes.node_id, ways.id and a sequential scan
of bbox_nodes.
This takes 4.7s for the large area and 10ms for the small area. About half
this time is the sort and de-duplicate for the DISTINCT. It takes about 50s
for the large and 1s for the small with unprepared caches.
2b) Use the ways.nodes column, a bigint[] column. This column contains an
array which is an ordered list of the nodes of the way. This is actually
what I originally set out to evaluate, but
Obviously an index is required, and a GIN index can be created. This index
is about 120GB and takes about 3 days to create. This is likely only worth
osmosis is recoded to use it instead of way_nodes
The GIN index allows the array operators < <at> , <at> >, = and && to use the index.
The first way is a query like
SELECT DISTINCT ways.id, ways.version, ways.user_id,
ways.tstamp, ways.changeset_id, ways.tags, ways.nodes
FROM ways
JOIN bbox_nodes ON (ARRAY[bbox_nodes.id] && ways.nodes)
A second way is a query like
SELECT ways.id, ways.version, ways.user_id,
ways.tstamp, ways.changeset_id, ways.tags, ways.nodes
FROM ways
WHERE ways.nodes && array(SELECT id FROM bbox_nodes);
Note: DISTINCT is not required here.
The second form is much faster, but still slow. A 75k bbox_nodes table takes
about 15 minutes and I never let the 350k node one finish. The small area
takes 1.3 seconds from hot caches.
Conclusions:
Clearly a GIN index on ways.nodes is not useful. It uses up slightly less
space but it's very slow.
Geometry indexes are fast, but take up space. It is necessary to "backfill"
nodes as well to include nodes outside the bbox that are members of ways in
the bbox, and this would need to be done outside of the geometry indexes.
A bbox column is likely fastest with minimal space requirements, but it
might be necessary to filter the results with JOINs.
[1]:
http://lists.openstreetmap.org/pipermail/osmosis-dev/2013-April/001538.html