Nicolas Colomer | 12 Mar 10:28 2013

OSM entity processing order

Hi Osmosis community!

When I manipulate an OSM file (compressed or not) using Osmosis, can we assume that entities will systematically be processed in this order: 1.bound, 2.node, 3.way, 4.relation?

This seems logical since the OSM file format guarantees that "blocks come in this order" (see the OSM XML #File format wiki page).

In addition, I reach a post where Brett told:

> This is due to the way Osmosis processing works because it finishes processing nodes before it sees the ways.

I just want to make sure my impression is good :)

Thank you very much!

Best regards,
Nicolas
_______________________________________________
osmosis-dev mailing list
osmosis-dev@...
http://lists.openstreetmap.org/listinfo/osmosis-dev
Brett Henderson | 17 Feb 04:05 2013

Osmosis 0.42 Released

Hi All,

I've just released Osmosis 0.42.  It was easier to create a new release than to continue responding to limitations in 0.41 :-)
http://bretth.dev.openstreetmap.org/osmosis-build/osmosis-0.42.tgz
http://bretth.dev.openstreetmap.org/osmosis-build/osmosis-0.42.zip
http://dev.openstreetmap.de:23457/hudson/job/osmosis-release/8/

From changes.txt:
  • Fix PostgreSQL timestamp bugs in apidb replication logic.
  • Fix replication file merging boundary corrections.  Occurs when catching up after outages.
  • Replication logic correctly honours the max timestamp parameter.
  • Prevent replication file downloader from reading beyond maximum available replication interval.
  • Prevent replication file downloader from stalling if interval is too long.
  • Improve error reporting when an unknown global option is specified.
  • Disable automatic state.txt creation for --read-replication-interval.
  • Add --tag-transform plugin and task.
  • Reduce number of file handles consumed by file-based sorting.
  • Make the default id tracker Dynamic for --used-node and --used-way.
  • Use Gradle for the automated build scripts instead of Ant/Ivy.
  • Fix PostgreSQL ident authentication.
  • Remove obsolete debian build scripts.
  • Eliminate use of deprecated Spring SimpleJdbcTemplate.
  • Improve handling of invalid geometries in --pgsql-xxx tasks.
  • Default keepInvalidWays option on --pgsql-xxx tasks to true.
  • Enable keepInvalidWays functionality for --pgsql-xxx replication.
  • Fix pgsnapshot COPY load script to use ST_ prefix for all PostGIS functions.
    Let me know if you see any issues.

    Brett

    _______________________________________________
    osmosis-dev mailing list
    osmosis-dev@...
    http://lists.openstreetmap.org/listinfo/osmosis-dev
    
    Ilya Zverev | 6 Feb 10:05 2013
    Picon

    32-bit limit in IdTrackers

    Hi! As some of you have read 
    (http://lists.openstreetmap.org/pipermail/dev/2013-February/026495.html), 
    in three days node ids are expected to surpass 2147483647, and this 
    method
    
    https://github.com/openstreetmap/osmosis/blob/master/core/src/main/java/org/openstreetmap/osmosis/core/util/LongAsInt.java#L30 
    will throw an exception "Cannot represent " + value + " as an integer." 
    It is used in every IdTracker implementation, so id trackers will become 
    unusable.
    
    This will affect tag and area filters. Regional extracts that are made 
    with osmosis will break. There is a comment at the start of each 
    IdTracker class: "The current implementation only supports 31 bit 
    numbers, but will be enhanced if and when required." I guess, now is the 
    time. Can anybody fix that? There must be a reason why this hasn't done 
    sooner.
    
    IZ
    
    Oliver Schrenk | 4 Feb 15:42 2013
    Picon

    Eclipse Setup, Missing task types

    Hi,
    
    Are there more current notes about how to setup Eclipse for osmosis development then the notes in [1] ?
    
    I know that ant has been deprecated in favor of gradle, so I installed Eclipse Grade Support via [2] and 
    
    	$ git clone https://github.com/openstreetmap/osmosis
    	$ cd osmosis
    	$ ./gradlew assemble
    
    and proceeded to import osmosis' multi-modules using `File > Import > Gradle`. Everything compiles fine.
    
    But when I try to execute a command like
    
    	osmosis --read-xml file="bremen.osm.bz2" --write-apidb-0.6 host="127.0.0.1"
    database="api06_test" user="osm" password="osm" validateSchemaVersion=no
    
    using a Run Configuration with `org.openstreetmap.osmosis.core.Osmosis` as the main class
    and
    
    	--read-xml file="bremen.osm.bz2" --write-apidb-0.6 host="127.0.0.1" database="api06_test"
    user="osm" password="osm" validateSchemaVersion=no
    
    as program arguments I get
    
    	Feb 04, 2013 3:31:39 PM org.openstreetmap.osmosis.core.Osmosis run
    	INFO: Osmosis Version 0.41-55-gb44b7d7-dirty
    	Feb 04, 2013 3:31:39 PM org.openstreetmap.osmosis.core.Osmosis run
    	INFO: Preparing pipeline.
    	Feb 04, 2013 3:31:39 PM org.openstreetmap.osmosis.core.Osmosis main
    	SEVERE: Execution aborted.
    	org.openstreetmap.osmosis.core.OsmosisRuntimeException: Task type read-xml doesn't exist.
    		at org.openstreetmap.osmosis.core.pipeline.common.TaskManagerFactoryRegister.getInstance(TaskManagerFactoryRegister.java:60)
    		at org.openstreetmap.osmosis.core.pipeline.common.Pipeline.buildTasks(Pipeline.java:51)
    		at org.openstreetmap.osmosis.core.pipeline.common.Pipeline.prepare(Pipeline.java:112)
    		at org.openstreetmap.osmosis.core.Osmosis.run(Osmosis.java:86)
    		at org.openstreetmap.osmosis.core.Osmosis.main(Osmosis.java:37)
    
    It doesn't seem to pickup the various tasks. 
    
    My end goal is to debug write-apidb-0.6 as I'm trying to write data to an unsupported database and run into
    problems with duplicate user entries and want to use Eclipse's Debugger to go through the code.
    
    Best regards
    Oliver
    
    [1] http://wiki.openstreetmap.org/wiki/Osmosis/Development#Eclipse_Setup
    [2] http://static.springsource.org/sts/docs/latest/reference/html/gradle/installation.html
    
    OSX 10.8.2
    Java 1.7.0_11-b21
    osmosis 0.41-55-gb44b7d7-dirty
    
    Toby Murray | 31 Jan 02:47 2013
    Picon

    Duplicate ways in pgsnapshot database

    Today my minutely replication started failing with a unique constraint
    violation error from postgres. Upon further investigation I found that
    there were *already* two copies of a way in my database. An incoming
    change was trying to modify the way which caused postgres to notice
    the duplication and error out. Basically a "hey wait there are two of
    them. Which one do you want me to modify?" Here is the osmosis output:
    
    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key
    value violates unique constraint "pk_ways"
      Detail: Key (id)=(26926573) already exists.
    
    It was erroring on this way:
    http://www.openstreetmap.org/browse/way/26926573/history
    
    So a few questions immediately come to mind.
    
    1) How did a duplicate record get into the database? There is
    definitely a primary key constraint on the id column. In this
    particular case it looks like it happened during the initial planet
    import. I did this from the January 2nd pbf file. The two rows are
    identical in every way and the way was last touched (before today's
    edit) in 2009. All constraints are disabled during the \copy operation
    so I can see a duplicate way being able to get in. Although this
    implies that there are either two copies of the way in the planet file
    or a bug in osmosis. I would have thought the primary key constraint
    would have been checked when it was recreated after the \copy
    operation though. Apparently not.
    
    2) How do I fix this? I believe deleting one of the rows would fix
    this but I can't actually delete only one since *every* column is the
    same. I think it was suggested on #osm-dev that I create a copy of one
    in temp table, delete both and then reinsert the copy. This is
    probably what I will try.
    
    3) Are there any others? Turns out: yes, there are 4 duplicated ways
    in my database. This may not come through with good formatting but
    here they are:
        id    | version | user_id |       tstamp
    ----------+---------+---------+---------------------
     26245218 |      12 |  163673 | 2011-02-06 06:54:10
     26245218 |      13 |  290680 | 2013-01-28 02:37:56
     26709186 |       4 |   64721 | 2008-09-02 04:39:21
     26709186 |       4 |   64721 | 2008-09-02 04:39:21
     26709284 |       4 |   70621 | 2008-10-26 14:06:03
     26709284 |       5 |   64721 | 2013-01-28 02:38:30
     26926573 |       4 |  118011 | 2009-12-27 07:13:28
     26926573 |       4 |  118011 | 2009-12-27 07:13:28
    
    A couple of interesting things here.
    - Two of them have identical duplicates (26709186 and 26926573). These
    can both be explained by an error in the planet file or import
    process.
    - The other two however are not the same and both of them must have
    been created during diff application because it happened 2 days ago -
    within 10 seconds of each other. It is possible that there were
    duplicates of these ways as well and for some reason they just didn't
    hit this error during diff application and one of the records was
    successfully updated.
    
    Soo... wtf? Does Does anyone have ideas about how postgres' primary
    key check could be circumvented? Is my theory about the \copy getting
    around it during import feasible? But what about the ones created
    during diff processing? Looking at my system monitoring I don't see
    anything unusual going on 2 days ago. I've been having problems with X
    on this machine but that won't affect postgres and osmosis is running
    inside of screen. Soo... yeah. Anything? :)
    
    Toby
    
    Frederik Ramm | 28 Jan 10:49 2013

    Un-Redacting Stuff

    Hi,
    
        with the license change we introduced the concept of "redacted" 
    objects. Since "redacting" an old version touches that version in the 
    database, initially such redactions made Osmosis issue diffs that 
    contained that old version; we then introduced a quick fix to stop that:
    
    https://github.com/openstreetmap/osmosis/blob/master/apidb/src/main/java/org/openstreetmap/osmosis/apidb/v0_6/impl/EntityDao.java#L450
    
    We're now also using "redaction" to suppress objects where a copyright 
    violation has occurred - but mistakes are possible, so we need to have a 
    way to un-redact things if necessary, i.e. remove the "redaction_id" 
    from a historic version again.
    
    Simply setting the column to NULL will, again, make Osmosis issue a diff 
    that contains the old version; this is unwanted.
    
    How could we proceed?
    
    Ideas:
    
    1. Introduce special value "0" (not NULL) to denote an un-redacted 
    object; leave Osmosis unchanged (so it treats NULL and 0 differently, 
    will only issue .osc for objects with redaction_id=NULL), and modify 
    other API code to treat 0 and NULL the same (so historic versions can be 
    accessed through the API if redaction_id=NULL or 0). Cheap, easy, but a 
    bit ugly.
    
    2. Introduce an additional column "suppress_diff" to 
    nodes/ways/relations tables; on un-redaction, set redaction_id=NULL and 
    suppress_diff=TRUE; modify Osmosis by assing an "and not suppress_diff" 
    to the SQL query. Would increase database size by something like 4 GB 
    for the extra column.
    
    3. Introduce an additional table "un-redacted objects", store object 
    type, version, and id; when an object is un-redacted, add it to that 
    table and clear the object's redaction_id, then modify the Osmosis query 
    to only output objects that are not found in that table. Uses little 
    space but makes diff creation slower.
    
    There might be more...
    
    Bye
    Frederik
    
    --
    
    -- 
    Frederik Ramm  ##  eMail frederik@...  ##  N49°00'09" E008°23'33"
    
    Daniel Kaneider | 24 Jan 21:47 2013
    Picon

    pgsimple/pgsnapshot possible bug

    Hi,

    I did some import of OSM data into a Postgresql 9.2 DB using osmosis 0.41. The pgsnapshot_load script stopped since some function could not be found (Envelope,Collect). If I am not wrong then

    UPDATE ways SET bbox = (
        SELECT Envelope(Collect(geom))
        FROM nodes JOIN way_nodes ON way_nodes.node_id = nodes.id
        WHERE way_nodes.way_id = ways.id
    );

    should be changed to

    UPDATE ways SET bbox = (
        SELECT ST_Envelope(ST_Collect(geom))
        FROM nodes JOIN way_nodes ON way_nodes.node_id = nodes.id
        WHERE way_nodes.way_id = ways.id
    );

    This should apply also to the pg_simple script.

    Best,
    Daniel Kaneider




    _______________________________________________
    osmosis-dev mailing list
    osmosis-dev@...
    http://lists.openstreetmap.org/listinfo/osmosis-dev
    
    Frederik Ramm | 23 Jan 23:02 2013

    Question regarding the replication file structure

    Hi,
    
        I'm toying with the idea of offering regionalised diffs - i.e. a 
    series of daily diffs for every regional extract that 
    download.geofabrik.de has to offer. To make it easy for consumers to 
    keep their extracts up to date, I thought about making an Osmosis-style 
    directory for each extract, e.g. something like
    
    download.geofabrik.de/openstreetmap/europe/germany/nordrhein-westfalen/000/000/001.osc.gz 
    
    or so. Just to be safe: What are the conventions that I will have to 
    follow so that this works seamlessly with existing clients? Simply have 
    a xxx.osc.gz and matching xxx.state.txt in the leaf directory, count 
    from 000 to 999 then wrap to the next directory, and have the most 
    recent state.txt file at the root directory as well - anything else?
    
    If the frequency wasn't exactly daily - if, say, because of some sort of 
    glitch there was extract for one day and therefore the diff is missing, 
    or if there were two extracts in one day - would that matter?
    
    Bye
    Frederik
    
    --
    
    -- 
    Frederik Ramm  ##  eMail frederik@...  ##  N49°00'09" E008°23'33"
    
    Paul Norman | 22 Jan 11:19 2013
    Picon

    Non-standard pgsnapshot indexes

    I've talked in other places about the non-standard indexes that I have on my
    pgsnapshot database, but I don't believe I've ever produced a full listing.
    I believe the following are all the non-standard indexes I have, with the
    size and applicable comments in []
    	
    On nodes:
    
    btree (changeset_id) [37GB, DWG stuff tends to a lot of changeset queries]
    
    gist (geom, tags) [153GB,
    http://lists.openstreetmap.org/pipermail/osmosis-dev/2013-January/001485.htm
    l]
    
    gin (tags) [24GB, xapi]
    
    btree (array_length(akeys(tags), 1)) WHERE array_length(akeys(tags), 1) > 10
    [92MB, for finding weirdly tagged stuff]
    
    On ways: 
    
    btree (changeset_id) [5.9GB]
    
    btree ((tags -> 'name'::text) text_pattern_ops) WHERE tags ? 'name'::text
    [1.3GB, for running tags -> 'name' LIKE queries as well as potentially
    quicker name queries]
    
    btree ((tags -> 'name_1'::text) text_pattern_ops) WHERE tags ?
    'name_1'::text [49MB]
    
    btree ((tags -> 'name_2'::text) text_pattern_ops) WHERE tags ?
    'name_2'::text [4.2MB]
    
    gin (tags) [19GB, xapi]
    
    btree (array_length(akeys(tags), 1)) WHERE array_length(akeys(tags), 1) > 10
    [274MB]
    
    On relations:
    gin (tags)
    btree (array_length(akeys(tags), 1)) WHERE array_length(akeys(tags), 1) > 10
    [1.9MB]
    
    Paul Norman | 22 Jan 10:53 2013
    Picon

    pgsnapshot composite index results

    I frequently use my pgsnapshot database for unusual purposes and end up
    running non-standard queries.
    
    The standard indexes for pgsnapshot nodes include a GiST index on geom.
    Another common index suggested by the jxapi installation instructions[1] is
    a GIN index on tags.
    
    These indexes work well when you have a query that is highly selective
    spatially or against the tags but are frequently not ideal against queries
    combining a medium selective spatial condition with a medium selective tag
    condition.
    
    While working on addressmerge[2] I encountered a situation where the query
    SELECT * FROM local_all; was quicker than SELECT * FROM local_all WHERE tags
    ? 'addr:housenumber'; local_all was a view of the nodes, ways and
    multipolygons[3] in the local area. The speed difference was caused by a
    non-optimal query plan of a query of the form SELECT * FROM nodes WHERE
    st_intersects (geom,'my_geom'::geometry) AND tags ? 'addr:housenumber';
    where my_geom was the EWKT for a polygon covering the area of interest.
    
    The query plan for the first query involved an index scan of the geom gist
    index. The second involved a bitmap and of the geom gist and tags gin
    indexes. Unfortunately, due to the limitations of hstore statistics this was
    likely not the optimal plan. An exploration of options in #postgresql lead
    to the discussion of a composite gist index on (geom, tags) as an
    alternative indexing strategy, which is what this message is about (after
    this rather lengthy preamble.)
    
    A composite index would be created with a statement like CREATE INDEX [
    CONCURRENTLY ] ON nodes USING gist (geom, tags); This index can benefit
    statements that are moderately selective in both geom and tags, but is more
    important that geom be selective than that tags be selective.
    
    All tests were done with replication stopped on my home server on a 6 7200
    RPM drive RAID10 array, 32GB RAM, queries repeated to ensure consistent
    caching (i.e. everything in memory). The initial runs of the queries were
    obviously substantially slower from disk, but similar behavior was observed
    there.
    
    The creation of the composite index took 24 hours, non-currently. I do not
    have the creation time for the non-composite index, but I would estimate it
    at 18 hours. The indexes are 153GB and 84GB respectively.
    
    With use of transactions it is possible to drop an index then ROLLBACK the
    transaction, allowing for easy testing of different combinations of queries
    and indexes.
    
    For the following table to make sense, use a fixed-width font.
    
    With the WHERE tags ? 'addr:housenumber' restriction:
                    geom index   (geom, tags) index
    Total time:       3000ms        222ms
    Total cost:      61622         1166
    Estimated rows:     28           28
    Actual rows:     78873        78873
    
    Without the WHERE restriction:
                    geom index  (geom, tags) index
    Total time:         386ms       400ms
    Total cost:      345222      347339
    Estimated rows:   27986       27986
    Actual rows:     184644      184644
    Index scan time:     47ms        59ms
    
    The run to run variation in total speed without the tags restriction is
    greater than the different in indexes, but there is a noticeable difference
    in index scan time.
    
    Using a rectangle covering the southwest of BC I ran some further queries to
    investigate the index scan time. Total query time was about 9 seconds, but
    it's the index scan part we're interested in.
    
    Forcing the composite index to be used increased the scan time from 1.28s to
    1.58s, an approximately 20% increase. The rest of the query time remained
    approximately constant.
    
    Putting these results into an xapi context, the use of a composite index
    would slow down map? type queries. The index scan is a small part of the
    total response time. If most of the time is spent retrieving nodes for
    backfilling (done by ID), serializing XML or doing joins with way_nodes or
    relation_members, the time spent scanning the index is a minor issue. 
    
    As a gist composite index is substantially slower than a gin index it would
    not replace it. This would mean there would be essentially no speed change
    for *[key=value] queries without a bbox restriction.
    
    Where it would substantially speed up queries is for moderately selective
    ones, e.g. fetch all Starbucks in the bounding polygon for the US, or the
    case where I fetched all addresses in a city.
    
    Something that I haven't touched on yet is updates. The speed of osmosis
    updates to pgsnapshot databases is not well explored. Toby has investigated
    slow queries that occur during diff processing[4] and the queries he
    investigated did not involve any use of geometry indexes, but the statement
    that he looked at does involve an update of the ways linestring. These
    updates would require updates to the geometry index which would presumably
    be slower with a composite index. I don't have much experience in reading
    EXPLAIN results for updates, but I think about 25% of the time is spent
    updating the row and indexes. I have no idea how much of this is spent on
    linestring index updates.
    
    What I also don't know is how much time is spent inserting nodes. I would
    expect that most changes to OSM are the creation of nodes and even if these
    queries are individually quick they may compose a significant portion of the
    overall update time by sheer volume. 
    
    Is a composite index worth it? It depends on your use case. If you are
    purely using a pgsnapshot database from osmosis which never uses the tags in
    queries then it is clearly not worth it. For xapi map? queries it is also
    not worth it. Anything involving both geographic filters and tag filters may
    benefit from it, but at cost of potentially slower queries for purely
    spatial queries and an unknown impact on updates. There is also a disk space
    hit to consider, although an additional 70GB of indexes on a database that
    is already 750GB may not be a huge issue.
    
    Something to keep in mind for an xapi situation is that the more IO time
    spent on updates the less that can be spent on queries, balancing out the
    speed increase from faster queries. On the other hand, a 10x increase (or
    better!) on the right queries is significant.
    
    On case where it's a clear winner is where *all* queries involve both a
    spatial and tag component and there isn't a need for a separate gin index if
    the composite index is used. The separate gist geom and gin tag indexes
    could then be replaced by one gist (geom, tags) index, saving space and not
    slowing down updates with additional index updates.
    
    [1]: https://github.com/iandees/xapi-servlet
    
    [2]:
    https://github.com/pnorman/addressmerge/blob/c4a26eb6/addressmerge.py#L56
    
    [3]: Complete MP tag handling is not required for this application so
    https://github.com/pnorman/addressmerge/blob/c4a26eb6/addressmerge.py#L66 is
    sufficient
    
    [4]:
    http://lists.openstreetmap.org/pipermail/osmosis-dev/2013-January/001478.htm
    l
    
    Brian Cavagnolo | 17 Jan 22:35 2013
    Picon

    cleaning up osmosis temp files?

    Is there a recommended way to clean up the temp files left behind by
    osmosis?  I've been just poking around the /tmp directory blowing away
    the copy* and nodeLocation* files.
    
    Thanks,
    Brian
    

    Gmane