River Tarnell | 1 Sep 04:24 2006

PHP upgrade

PHP has been upgraded to 5.1.6.   please report any problems to the
list.

	- river.
Gregory Maxwell | 1 Sep 17:09 2006
Picon

Toolserver dewiki_p database corruption

It would appear that the dewiki replica on toolserver is experiencing
significant corruption.

SELECT count(el_to) FROM externallinks JOIN page ON el_from=page_id
WHERE page_title='Fabrixx' AND page_namespace=0;

returns 1346 rows

I just loaded the recent dumps locally and same query returns 13 rows,
which appears to be the correct result.

This is just an example, the condition occurs on many pages. Either
*links rows have not been deleted.. or they have been inserted to the
wrong id or some other problem which would cause a relational
integrity violation with respect to page table.

I would be surprised if this corruption was limited to externallinks.

Leon Weber | 2 Sep 06:47 2006
Picon

Re: Toolserver dewiki_p database corruption

Gregory Maxwell schrieb:
> It would appear that the dewiki replica on toolserver is experiencing
> significant corruption.
>
> SELECT count(el_to) FROM externallinks JOIN page ON el_from=page_id
> WHERE page_title='Fabrixx' AND page_namespace=0;
>
> returns 1346 rows
>
> I just loaded the recent dumps locally and same query returns 13 rows,
> which appears to be the correct result.
>
> This is just an example, the condition occurs on many pages. Either
> *links rows have not been deleted.. or they have been inserted to the
> wrong id or some other problem which would cause a relational
> integrity violation with respect to page table.
>
> I would be surprised if this corruption was limited to externallinks.
>   
The same query returns a better result to me now (yesterday it returned
Greg's result):

mysql> SELECT count(el_to) FROM dewiki_p.externallinks JOIN
dewiki_p.page ON el_from=page_id WHERE page_title='Fabrixx' AND
page_namespace=0\G
*************************** 1. row ***************************
count(el_to): 14
1 row in set (0.00 sec)

-- Leon
(Continue reading)

Gregory Maxwell | 2 Sep 06:59 2006
Picon

Re: Toolserver dewiki_p database corruption

On 9/2/06, Leon Weber <leon.weber <at> leonweber.de> wrote:
> The same query returns a better result to me now (yesterday it returned
> Greg's result):
>
> mysql> SELECT count(el_to) FROM dewiki_p.externallinks JOIN
> dewiki_p.page ON el_from=page_id WHERE page_title='Fabrixx' AND
> page_namespace=0\G
> *************************** 1. row ***************************
> count(el_to): 14
> 1 row in set (0.00 sec)

Yes, the page has been resaved since.
Other pages are just as bad.

SELECT count(el_to) FROM dewiki_p.externallinks JOIN dewiki_p.page ON
el_from=page_id WHERE page_title='KZ_Ladelund' and page_namespace=0;
\+--------------+
| count(el_to) |
+--------------+
|         1201 |
+--------------+
1 row in set (0.41 sec)

Unfortunately this is just evidence that it may not be one-time corruption.

Leon Weber | 2 Sep 07:27 2006
Picon

Re: Toolserver dewiki_p database corruption

Gregory Maxwell schrieb:
> On 9/2/06, Leon Weber <leon.weber <at> leonweber.de> wrote:
>> The same query returns a better result to me now (yesterday it returned
>> Greg's result):
>>
>> mysql> SELECT count(el_to) FROM dewiki_p.externallinks JOIN
>> dewiki_p.page ON el_from=page_id WHERE page_title='Fabrixx' AND
>> page_namespace=0\G
>> *************************** 1. row ***************************
>> count(el_to): 14
>> 1 row in set (0.00 sec)
>
> Yes, the page has been resaved since.
> Other pages are just as bad.
>
> SELECT count(el_to) FROM dewiki_p.externallinks JOIN dewiki_p.page ON
> el_from=page_id WHERE page_title='KZ_Ladelund' and page_namespace=0;
> \+--------------+
> | count(el_to) |
> +--------------+
> |         1201 |
> +--------------+
> 1 row in set (0.41 sec)
I've asked Tim Starling to run that query on one of the live DB servers.
It returns 1201 there. His current guess is a software bug.

--Leon

Rob Church | 2 Sep 13:19 2006
Picon

Re: Toolserver dewiki_p database corruption

On 02/09/06, Leon Weber <leon.weber <at> leonweber.de> wrote:
> I've asked Tim Starling to run that query on one of the live DB servers.
> It returns 1201 there. His current guess is a software bug.

Ask him to predict next week's lottery numbers, while he's there.

Rob Church

Edward Chernenko | 7 Sep 13:56 2006
Picon

About long query/queries

Hello all,

there is a project in Russian Wikipedia to analyze all articles with
scrict quality requirements (e.g. at least 500 symbols, or at least
1500 symbols for {{stub}}s, or at least 3 internal links and one
external, or at least one section subheader etc.). The result should
be:
  * Total number of "normal articles"
  * Lists of articles filtered by each requirement

To do so, I should either make a long query to iterate through all
articles or running small queries like 'SELECT page_title, page_latest
FROM page WHERE page_title > ? ORDER BY page_title LIMIT 1' (with
substitution of previous page_title fetched).

The problem is that the first way is much more efficient but I'm not
sure that someone will not kill this query.

--

-- 
Edward Chernenko <edwardspec <at> gmail.com>

Platonides | 7 Sep 15:22 2006
Picon

Re: About long query/queries

From: "Edward Chernenko"
Sent: Thursday, September 07, 2006 1:56 PM
Subject: [Toolserver-l] About long query/queries

> Hello all,
>
> there is a project in Russian Wikipedia to analyze all articles with
> scrict quality requirements (e.g. at least 500 symbols, or at least
> 1500 symbols for {{stub}}s, or at least 3 internal links and one
> external, or at least one section subheader etc.). The result should
> be:
>  * Total number of "normal articles"
>  * Lists of articles filtered by each requirement
>
> [......]

What about doing it locally with a dump? It seems much more efficient to me. 
Specially as the toolserver doesn't have direct access to article's text...

Gregory Maxwell | 7 Sep 15:54 2006
Picon

Re: About long query/queries

On 9/7/06, Edward Chernenko <edwardspec <at> gmail.com> wrote:
[snip]
> The problem is that the first way is much more efficient but I'm not
> sure that someone will not kill this query.

Are you talking about a query that will be run once or a query that
will be executed from a cgi script.

select page_namespace, page_title from page;  on ruwiki_p takes under
a second... I wouldn't call that a long query.

Edward Chernenko | 7 Sep 18:28 2006
Picon

Re: About long query/queries

2006/9/7, Platonides <platonides <at> gmail.com>:
> What about doing it locally with a dump? It seems much more efficient to me.

Good idea but I think that dump should be placed outside my account:
1. other users can use it for tasks which doesn't require making
complex SQL queries; 2. I have 256 Mb disk quota while ruwiki dump is
about 400 Mb.

2006/9/7, Gregory Maxwell <gmaxwell <at> gmail.com>
> Are you talking about a query that will be run once or a query that
> will be executed from a cgi script.
No, that will be run manually (or using cron - one time per day).

> select page_namespace, page_title from page;  on ruwiki_p takes under
> a second... I wouldn't call that a long query.

Not all rows of result are fetched right after executing the query.
Normal 'mysql' application receives all rows, prints it and exits. My
application need (after getting one row of result) to:

 1. make one more sql query: fetch page text
SELECT old_text, old_flags FROM text WHERE old_id = (SELECT rev_text
FROM revision WHERE rev_id = ? )
 (where '?' is page_latest from first query)
 2. uncompress text if there is 'gzip' in old_flags.
3. analyze text (that's fast, we can ignore this step).

As you can see, there is a small pause between fetching rows of result
from first query. If this pause is only 0.05 seconds, the first query
will be finished after ~ 83 minutes (for 100000 articles of ruwiki).
(Continue reading)


Gmane