My issue kind of hits multiple topics, but the main question is about performance. I think you need to understand the background a little bit to be able to help me. So I will firstly define the problem and my solutions to it and place the questions for you to the end of this message.
I have a table of observations of objects on the sky. The most important columns are the coordinates (x,y). All other columns in there are just additional information about the observation. The problem is that when I take an image of the same object on the sky twice, the coordinates x,y won’t be the same, they will be only close to each other. My task is to generate a common identifier to all of the observations of the same object and assign the observations to it (N:1 relation). The criterium is that all of the observations which are within 1 arcsec of this identifier are considered as the same object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.
The reason why I solve the performance issues here is that the table of observations has atm cca 3e8 rows after 1.5 year of gathering the data. The number growth is linear.
I’m trying to keep the whole algoritm in DB if possible because I have a good PostgreSQL plugin Q3C for indexing the coordinates of the objects on the sky (https://code.google.com/p/q3c/source/browse/README.q3c). It also has quite a few stored procedures to look in that table for near neighbors which is what I’m doing. The function q3c_join(x1, x2, y1, y2, radius) returns true if the object y is within radius of the object x. It simply generates a list of index bitmap or queries with the operators <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.
After lot of experimentation with clustering the objects and trying to process them all together in one “silver-bullet” SQL query I decided to use some more simple approach. The main problem with the “process all at once approach” is that the finding neighbours for each observation is in definition quadratic and for 3e8 rows just runs out of disk space (~TBs of memory for the temporary results).
The simplest approach I could think of is that I process each row of the 3e8 rows sequentially and ask:
Do I have an identifier in the radius of 1 arcsec?
No: Generate one and assign me to it.
Yes: Update it and assigne me to it. The update is done as weighted average – I keep the number of how many observations the identifier has been computed. The result is that the identifier will have average coordinates of all the observations it identifies – it will be the center.
So here I come with my procedure. It has 3 params. The first two are range of oids to list in the table. Used for scaling and parallelization of the algorithm. The third is the radius in which to search for the neighbours.
DROP TYPE IF EXISTS coords;
CREATE TYPE coords AS (
raj2000 double precision,
dej2000 double precision
DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);
CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int, radius double precision)
RETURNS VOID AS $$
cur1 CURSOR FOR
\schema.observation AS obs
obs.oid >= fromOID
obs.oid < toOID;
curr_raj2000 double precision;
curr_dej2000 double precision;
/*SELECT current_setting('transaction_isolation') into tmp;
raise notice 'Isolation level %', tmp;*/
LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;
FETCH cur1 INTO curr_raj2000, curr_dej2000;
EXIT WHEN NOT found;
(raj2000 * weight + curr_raj2000) / (weight + 1),
(dej2000 * weight + curr_dej2000) / (weight + 1)
raj2000 = (raj2000 * weight + curr_raj2000) / (weight + 1),
dej2000 = (dej2000 * weight + curr_dej2000) / (weight + 1),
(ipix_cat, raj2000, dej2000, weight)
WHERE NOT EXISTS
(SELECT * FROM upsert)
id_cat = (SELECT DISTINCT
WHERE id_cat IS NOT NULL
WHERE CURRENT OF cur1;
IF ((cnt % 100000 ) = 0) THEN
RAISE NOTICE 'Processed % entries', cnt;
$$ LANGUAGE plpgsql;
Results: When I run the query only once (1 client thread), it runs cca 1 mil rows per hour. Which is days for the whole dataset. When I run it in parallel with that lock to ensure pessimistic synchronization, it runs sequentially too J the other threads just waiting. When I delete that lock and hope to solve the resulting conflicts later, the ssd disk serves up to 4 threads relatively effectively – which can divide my days of time by 4 – still inacceptable.
The reason is quite clear here – I’m trying to write something in one cycle of the script to a table – then in the following cycle I need to read that information.
Questions for you:
1. The first question is if you can think of a better way how to do this and maybe if SQL is even capable of doing such thing – or do I have to do it in C? Would rewriting the SQL function to C help?
2. Could I somehow bend the commiting during the algorithm for my thing? Ensure that inside one cycle, the whole part of the identifiers table would be kept in memory for faster lookups?
3. Why is this so slow? J It is comparable to the quadratic algorithm in the terms of speed – only does not use any memory.
I tried to sum this up the best I could – for more information please don’t hesitate to contact me.
Thank you very much for even reading this far.