Olly Betts | 1 Nov 03:29 2010

Re: No new articles indexed for gmane.emacs.gnus.general?

On 2010-10-22, Olly Betts <olly <at> survex.com> wrote:
> After mysteriously getting stuck in an apparently infinite loop which didn't
> recur when I retried, rain should now be updating daily once again.

Sigh, my "delete the old database" code was too cautious, and refused to
delete anything unless it was on the expected partition, so it hit disk
full again, and then when I restarted, it got stuck on the truncated
file of articles to add.

But it's now up-to-date again, and should hopefully continue to update
daily.

Cheers,
    Olly
Lars Magne Ingebrigtsen | 2 Nov 19:06 2010
Face
Picon

Reticule redux

Last night while trying to fall asleep (very unsuccessfully), I started
thinking about what it would take to move Gmane to a storage model that
satisfies our needs on a long-term basis.  inn works fine for 99% of
what Gmane does, but there's that niggling last percent that's virtually
impossible to do with the inn model.

So here's an article to just sum up my current ideas, so that I have
them somewhere.  I won't have time to do anything about this until next
year at the earliest, and by then I'll have forgotten everything I was
thinking about while dozing off yesterday.  And we all know that the
thoughts we have while half-asleep are the best, right?

The main problems today are with:

* crossposting, post-hoc and otherwise
* "Replaces:", i.e., Gwene articles that update
* group renames
* many-small-files-in-a-big-directory scalability
* replication, backup and redundant setups
* recovery from inadvertent file deletion
* renumbering articles after importing

And these are my architectural ideas for dealing with these issues:

* all groups are identified by a token -- for instance, a number

* storage is based on these tokens

* still one file per article, because that's very convenient for so many
  things
(Continue reading)

Ted Zlatanov | 2 Nov 21:35 2010
X-Face

Re: Reticule redux

On Tue, 02 Nov 2010 19:06:23 +0100 Lars Magne Ingebrigtsen <larsi <at> gnus.org> wrote: 

LMI> The main problems today are with:

LMI> * crossposting, post-hoc and otherwise
LMI> * "Replaces:", i.e., Gwene articles that update
LMI> * group renames
LMI> * many-small-files-in-a-big-directory scalability
LMI> * replication, backup and redundant setups
LMI> * recovery from inadvertent file deletion
LMI> * renumbering articles after importing

Consider using SQLite instead of a directory structure (one DB file per
group).  It would make storage much easier and allow you to define
pretty much any metadata per group and per article.  But you lose the
one-file-per-article mapping.  Still, from a system management
viewpoint, it's much cleaner than millions of small files all over the
filesystem.

SQLite is significantly faster than a traditional RDBMS and supports
multithreaded access.  The command-line tools are pretty nice too.

Ted
Lars Magne Ingebrigtsen | 2 Nov 21:39 2010
Face
Picon

Re: Reticule redux

Ted Zlatanov <tzz <at> lifelogs.com> writes:

> Consider using SQLite instead of a directory structure (one DB file per
> group).

Nope.  Having the files right there is so useful and convenient for
feeding to search engines, accessing from PHP, accessing from C, etc,
etc. 

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi <at> gnus.org * Lars Magne Ingebrigtsen
Duncan | 3 Nov 00:40 2010
Picon
Picon

Re: Reticule redux

Lars Magne Ingebrigtsen posted on Tue, 02 Nov 2010 21:39:12 +0100 as
excerpted:

> Ted Zlatanov <tzz <at> lifelogs.com> writes:
> 
>> Consider using SQLite instead of a directory structure (one DB file per
>> group).
> 
> Nope.  Having the files right there is so useful and convenient for
> feeding to search engines, accessing from PHP, accessing from C, etc,
> etc.

Exactly.

There's a reason *ix strongly prefers plain text (tho possibly structured 
as XML in particular cases, but that too has strong resistance) config and 
text-data storage, to some binary format like the MS Windows Registry, or 
various databases.  Text files are simply easier to work with and easier 
to recover in case of disaster, because if it comes to it, they're human 
readable and editable with simple text management tools (text editors, 
grep, sed, etc).  For *ix admins, that tends to trump all the supposed 
better efficiencies of whatever human unreadable database or other format 
you may propose.  Let the filesystems do what they do best, and the text 
files do what they do best, and don't mess with a solution with a half 
century of demonstrated robustness and scalability behind it. =:^)

--

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman
(Continue reading)

Ted Zlatanov | 3 Nov 12:09 2010
X-Face

Re: Reticule redux

On Tue, 2 Nov 2010 23:40:23 +0000 (UTC) Duncan <1i5t5.duncan <at> cox.net> wrote: 

D> There's a reason *ix strongly prefers plain text (tho possibly structured 
D> as XML in particular cases, but that too has strong resistance) config and 
D> text-data storage, to some binary format like the MS Windows Registry, or 
D> various databases.

(This feels like a lecture, so I'll respond.  Lars' point about
convenience is valid and I won't argue with that, but you're preaching
from a tilted soapbox IMO.)

Configuration, yes.  General text data storage, no.  When efficiency
matters, Unix uses whatever works.  There are quite a few solutions
build on BerkeleyDB and all the *DB* file formats, MySQL, SQLite,
PostgreSQL, etc. that would have been much harder with a file spool.

D> Text files are simply easier to work with and easier to recover in
D> case of disaster, because if it comes to it, they're human readable
D> and editable with simple text management tools (text editors, grep,
D> sed, etc).

Yes, but they are terribly inefficient in other ways.  For instance,
it's hard to store metadata about a file.  Or a checksum.  Or compress
it.  Or encrypt it.  Or insert text in the middle of a file without
rewriting the whole thing.  So they are simple at first, but as features
are needed they tend to either hold development back or become the
foundation for arguments why features should be rejected.

D> For *ix admins, that tends to trump all the supposed better
D> efficiencies of whatever human unreadable database or other format
(Continue reading)

Adam Sjøgren | 3 Nov 17:48 2010
X-Face
Picon

Re: Reticule redux

On Wed, 03 Nov 2010 06:09:49 -0500, Ted wrote:

> Also, filesystems are not that good with millions of files.

Backup/restore also quite quickly becomes a nightmare when you have many
millions of small files.

  Best regards,

    Adam

--

-- 
 "Gravity is arbitrary!"                                      Adam Sjøgren
                                                         asjo <at> koldfront.dk
Lars Magne Ingebrigtsen | 3 Nov 20:27 2010
Face
Picon

Re: Reticule redux

asjo <at> koldfront.dk (Adam Sjøgren) writes:

> Backup/restore also quite quickly becomes a nightmare when you have many
> millions of small files.

Backup/restore is also a nightmare if you don't have millions of small
files.  :-)  Doing a Backup is pretty trivial when you have small files,
because you just copy over the new ones.  When you have ever-growing
huge files, doing incremental backup is a real pain.

Restoring is as simple -- copy them over the other way.

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi <at> gnus.org * Lars Magne Ingebrigtsen
Adam Sjøgren | 4 Nov 18:08 2010
X-Face
Picon

Re: Reticule redux

On Wed, 03 Nov 2010 20:27:03 +0100, Lars wrote:

> asjo <at> koldfront.dk (Adam Sjøgren) writes:

>> Backup/restore also quite quickly becomes a nightmare when you have many
>> millions of small files.

> Backup/restore is also a nightmare if you don't have millions of small
> files.  :-)

Yeah, but if you have solved that nightmare (somebody else bought and
set up the tape robot, somebody else configured something like Bacula)
it works quite okay. Until you have millions of small files.

Don't get me wrong: I think the file-per-article setup is the preferable
solution for Gmane, but as Duncan was lecturing Ted on the merits of
filesystems over databases in general, I thought I'd inject some input
about one general painpoint that isn't necessarily obvious until you
actually have many millions of small files you need to do backup off.

> Doing a Backup is pretty trivial when you have small files, because
> you just copy over the new ones. When you have ever-growing huge
> files, doing incremental backup is a real pain.

What I have seen is that there is a trade-off. If you have millions of
small, say, cache-files, merely traversing the filesystem looking for
changes can kill your backup schedule, where putting the same
information in a database, say, PostgreSQL, will - from a
backup/filesystem point of view - batch them up in GB-sized chunks,
which mostly won't change when the database just grows (nice for
(Continue reading)

Lars Magne Ingebrigtsen | 4 Nov 18:16 2010
Face
Picon

Re: Reticule redux

asjo <at> koldfront.dk (Adam Sjøgren) writes:

> Again, my point was more in regards to the general "Unix has solved all
> problems in the filesystem" story, and not for the specific Gmane case.

Right.  Yeah, all file systems suck.  They're adding stuff like fanotify
which might help with doing incremental backups, but it's still early
days...  and this is 2010!  Where's my fast file system!  Where's my
hovercraft!  Where's my teleporter!

--

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi <at> gnus.org * Lars Magne Ingebrigtsen

Gmane