Lars Magne Ingebrigtsen | 1 Dec 20:44 2002
X-Face
Picon

Search engine redux redux

So, the new search engine is coming along pretty nicely.  It remains
to be seen how fast it'll be once it's given a full data feed, but it
seems pretty quick.

And the indexer and the search interface will probably be running in
the same daemon, which will take both "index" and "search" commands.
This will allow me to let the indexer/searcher use a lot of memory to
keep often-used blocks in memory to speed stuff up.  This also means
that new messages will be available for searching immediately after
they've arrived -- even if the indexer hasn't written the data out to
disk yet, they can be searched for.  So it'll be an almost read-time
indexer.

Now we come to the most difficult part -- the name.  I first thought
about "msearch" (short for "Gmane search"), but that name was already
taken, so how about "re:search"?  But no, that's also taken (by a
publisher), so I thought I could just call it the same thing, but in
Received Pronunciation, so it's "we:search".  I think.

But I can't find anything that "we" is an acronym for, and that's a
must.  Perhaps.

--

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi <at> gnus.org * Lars Magne Ingebrigtsen
Steinar Bang | 1 Dec 23:13 2002
Picon
Picon

Re: Search engine redux redux

There was some talk earlier about a proposed NNTP extension for
searching (found in an expired internet draft?).  Would it be possible
to support this kind of search in INN and in Gnus?
Lars Magne Ingebrigtsen | 1 Dec 23:24 2002
X-Face
Picon

Re: Search engine redux redux

Steinar Bang <sb <at> dod.no> writes:

> There was some talk earlier about a proposed NNTP extension for
> searching (found in an expired internet draft?).  Would it be possible
> to support this kind of search in INN and in Gnus?

Yup.  That's my more long-range plans -- get the search engine up an
running, then do the INN extensions, and then implement Gnus support
for the extensions.

And since we:search is a general news search engine, hopefully other
news servers and/or agents will follow the lead.  :-)

--

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi <at> gnus.org * Lars Magne Ingebrigtsen
Samuel Liddicott | 2 Dec 10:47 2002

Re: Selecting X-Report-Spam renders background images


"Lars Magne Ingebrigtsen" <larsi <at> gnus.org> wrote in message
news:m37kew2edb.fsf <at> quimbies.gnus.org...
> "Samuel Liddicott" <sam <at> liddicott.com> writes:
>
> > It is highly flexible and very good in terms of speed and results and
> > flexibility of searches.
>
> How does it scale?  Has it been tested with tens of millions of
> documents?

It does scale very well if you scale the hardware too.  It generally
requires enough RAM to hold half the DB if you want to handle many
simultaneous queries well, but is good with less than this if you stripe
heavily.  At Orange we run with 8GB RAM on a 4 CPU  box (4x 700 Mhz) and
between half a million and one million documents but it can go much higher -
but we only want 30 days worth of news. We add about 20,000 documents a day,
on a big database the add limit can come down to a max of 30,000-40,000
documents a day (depending on hardware).

It also supports dynamic aggregation of DB's; so you can perform a search
accross multiple DB's and have the results combined.

It doesn't have the 2GB limit that some systems have and is not memory
mapped; it supports update-while-reading very efficiently; it the
lone-update application can gather updates and commit in bulk (perhaps once
every minute, no more) then you get a very good update-while-in-service
search system which is not common in the open source arena.

The codebase was derived from the new product of SmartLogik, a once famous
(Continue reading)

Satoshi Nagayasu | 2 Dec 21:37 2002

pgsql-hackers placement

Hi,

pgsql-hackers is now placed gmane.comp.db.postgresql.devel.general.
But I think it should be in gmane.comp.db.postgresql.hackers.

Why is the pgsql-hackers on g.c.d.postgresql.devel.general?
I think the name 'devel.general' is not match to 'pgsql-hackers'.

pgsql-* has already pgsql-general mailing list,
so I hope that pgsql-hackers will be in g.c.d.postgresql.hackers.

Regards,

--

-- 
NAGAYASU Satoshi <snaga <at> snaga.org>
Lars Magne Ingebrigtsen | 3 Dec 20:13 2002
X-Face
Picon

Re: Selecting X-Report-Spam renders background images

"Samuel Liddicott" <sam <at> liddicott.com> writes:

> It does scale very well if you scale the hardware too.  It generally
> requires enough RAM to hold half the DB if you want to handle many
> simultaneous queries well, but is good with less than this if you stripe
> heavily.  At Orange we run with 8GB RAM on a 4 CPU  box (4x 700 Mhz) and
> between half a million and one million documents but it can go much higher -
> but we only want 30 days worth of news. We add about 20,000 documents a day,
> on a big database the add limit can come down to a max of 30,000-40,000
> documents a day (depending on hardware).

That definitely sounds interesting.  I've got a 2GB 2xMP1900+ for the
search engine machine, so that sounds roughly similar, actually.
(Except for having a quarter of the RAM.  :-)

If the search engine I'm working on doesn't work out, this definitely
sounds like a candidate.

--

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi <at> gnus.org * Lars Magne Ingebrigtsen
Lars Magne Ingebrigtsen | 3 Dec 20:15 2002
X-Face
Picon

Re: pgsql-hackers placement

Satoshi Nagayasu <snaga <at> snaga.org> writes:

> pgsql-hackers is now placed gmane.comp.db.postgresql.devel.general.
> But I think it should be in gmane.comp.db.postgresql.hackers.

Anything sounding vaguely like a list for developers is called "devel"
on Gmane.  "dev", "devel", "developers", "hackers" and "workers" => "devel".

--

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi <at> gnus.org * Lars Magne Ingebrigtsen
Samuel Liddicott | 3 Dec 21:47 2002

Re: Selecting X-Report-Spam renders background images


"Lars Magne Ingebrigtsen" <larsi <at> gnus.org> wrote in message
news:m3isyasyom.fsf <at> quimbies.gnus.org...
> "Samuel Liddicott" <sam <at> liddicott.com> writes:
>
> > It does scale very well if you scale the hardware too.  It generally
> > requires enough RAM to hold half the DB if you want to handle many
> > simultaneous queries well, but is good with less than this if you stripe
> > heavily.  At Orange we run with 8GB RAM on a 4 CPU  box (4x 700 Mhz) and
> > between half a million and one million documents but it can go much
higher -
> > but we only want 30 days worth of news. We add about 20,000 documents a
day,
> > on a big database the add limit can come down to a max of 30,000-40,000
> > documents a day (depending on hardware).
>
> That definitely sounds interesting.  I've got a 2GB 2xMP1900+ for the
> search engine machine, so that sounds roughly similar, actually.
> (Except for having a quarter of the RAM.  :-)

Yeah, well the 8GB turne out to be overkill for us as the DB size was cut
since, so we end up buffering the whole DB and never use swap!!
But we also use it as a power-postgres box for dynamic TV listings at
http://www.ananova.com/tv_listings/_tv_full_listings.html?menu=tv.personalis
edlistings
It also runs our web spider fetching back 20,000 stories per day, so I
reckon 2GB will do you fine.

> If the search engine I'm working on doesn't work out, this definitely
> sounds like a candidate.
(Continue reading)

Jon Ericson | 3 Dec 23:22 2002
X-Face
Picon
Picon

Approval interface

Hi,

I know this is has been discussed in the past, but I wanted to throw
in my two cents on the spam approval process.

1) I'd rather have the choice between SPAM and NOT SPAM, rather
   than Approve and Reject.  I have to go through an extra layer of
   context to work out what I should pick.  Or to put it another way:
   when I first look at the article I try to figure out if the article
   is spam or not, and then I have to decide whether to approve it or
   not.  After a few articles, my mind is trained to do the right
   thing, but it is an annoyance.

   In any case, I had to search though the gmane.discuss archives to
   find out if I was rejecting the article or rejecting the decision
   to call the article spam.  This should be clear from the interface
   itself.

2) It'd be nice to have a progress indicator.  I noticed that
   articles are presented in date order, so it is possible to guess
   how far you've gone.  On the other hand, it would be nice to see
   "54 of 73 spam squashed!" or something at the top of the screen.
   Not only would it help volunteers allocate their time, it would
   give us positive feedback and a sense of accomplishment.

3) The group name should stand out more.  An article in Spanish in
   an English group is more likely to be spam then in a Spanish group.
   Sometimes job offers aren't spam, but it depends on the group.
   Context is important.

(Continue reading)

Jon Ericson | 4 Dec 02:16 2002
X-Face
Picon
Picon

Re: Approval interface

Jon Ericson <Jon.Ericson <at> jpl.nasa.gov> writes:
<snip other issues>

By coincidence, I ran across another issue.  There was an article
improperly marked as spam in one of the groups I read
(gmane.games.advanced-squad-leader:10027), so I followed the link in
the X-Report-Unspam header.  The webpage told me that I had just
reported the article as Spam.  I clicked the link to change my mind
and it still said I'd reported it as Spam.

Later I decided to look at the approval page and there was the article
marked as spam.  I can't figure out if I've misunderstood what was
supposed to happen or if this is a bug.

Jon
--

-- 
If a man has recently married, he must not be sent to war or have any
other duty laid on him. For one year he is to be free to stay at home
and bring happiness to the wife he has married.
-- Deuteronomy 24:5 (NIV)

Gmane