Chrisil J. Arackaparambil | 1 Jul 18:20 2010

Re: Number of pages on Wikipedia

Thanks everybody!  I just got the figure for the number of redirects as
4.5 million:
~/7zip/p7zip_9.13/bin/7z -so e enwiki-20100130-pages-meta-history.xml.7z 
2>/dev/null | perl -ne 'print if m{<redirect />}' | wc -l
4493204

Chrisil

Greg Hewgill wrote:
> On Mon, Jun 28, 2010 at 06:06:07PM -0600, Chrisil J. Arackaparambil wrote:
>> enwiki-20100130-pages-meta-history.xml.7z.  What I found to my surprise
>> is that there are (at least) 7 million pages in the main namespace.  I
>> got this figure by grepping for page titles that do not contain a ":"
>> character.  Is this really the case or am I missing something?
> 
> Your page count likely includes redirect pages. Normally article counts
> exclude redirects.
> 
> Greg Hewgill
> http://hewgill.com

Chrisil J. Arackaparambil | 3 Jul 01:54 2010

Order of pages/revisions

Hello folks,

I had some questions about the order or pages and revisions in the dump.
As I understand, the order is according to the respective IDs.  But
where do these IDs come from?  Are they the keys of the corresponding
table in the database?  So then they are more or less in order of
creation?  If that's the case, why does the dump begin with pages with
titles mostly beginning with "A"?

Thank you,
Chrisil

Greg Hewgill | 3 Jul 02:06 2010

Re: Order of pages/revisions

On Fri, Jul 02, 2010 at 05:54:35PM -0600, Chrisil J. Arackaparambil wrote:
> table in the database?  So then they are more or less in order of
> creation?  If that's the case, why does the dump begin with pages with
> titles mostly beginning with "A"?

As far as I can tell, the Mediawiki database was preloaded with an
alphabetical list of articles from some previous system. You will find
that there is some number of articles in roughly alphabetical order at
the start, followed by arbitrarily ordered articles. The are some
out-of-order articles in the first mostly-alphabetical section due to
article renames (which does not change their unique id).

Greg Hewgill
http://hewgill.com

Andreas Meier | 6 Jul 22:56 2010
Picon
Picon

elwiki and simplewiki stopped

  Hello,

look at http://download.wikipedia.org/simplewiki/20100705/ and 
http://download.wikipedia.org/elwiki/20100705/

Best regards

Andreas

Ariel T. Glenn | 7 Jul 23:26 2010
Picon

Re: elwiki and simplewiki stopped

These jobs are hung because my code changes were overwritten on the host
running them.  I suspect this has to do with the outage and the recovery
on the 4th-5th.  

First I need to see which recent jobs may have completed incorrectly so
that they can be cleared out of the way.  After that I'll be able to get
these jobs restarted.  There will be a bit of a delay as I am at a
conference (and facilitating a couple of sessions) with intermittent
internet access.

Ariel

Στις 06-07-2010, ημέρα Τρι, και ώρα 22:56 +0200, ο/η Andreas Meier
έγραψε:
> Hello,
> 
> look at http://download.wikipedia.org/simplewiki/20100705/ and 
> http://download.wikipedia.org/elwiki/20100705/
> 
> Best regards
> 
> Andreas
> 
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@...
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Nicolas Vervelle | 15 Jul 13:36 2010

Database dumps are stopped ?

Hi,
 
The database dump progress page (http://dumps.wikimedia.org/backup-index.html) seems to indicate that no dump has been made for more than a week for any Wikipedia.
 
The first line is about the enwiki dump which is still in progress and seems to be updated.
But all the other lines are dated back to 2010-07-06 or older.
 
Nico
Ariel T. Glenn | 15 Jul 14:55 2010
Picon

Re: Database dumps are stopped ?

Yes.  I thought I had answered a similar enquiry to this list earlier.
Due to a regression in deployed code on the host, the dumps are stuck.
I was on the road and did not have a chunk of time to clear out the
dumps that have bad data.  I expect to get to that today actually, but
I'll send an update when it's fixed.

Ariel

Στις 15-07-2010, ημέρα Πεμ, και ώρα 13:36 +0200, ο/η Nicolas Vervelle
έγραψε:
> Hi,
>  
> The database dump progress page
> (http://dumps.wikimedia.org/backup-index.html) seems to indicate that
> no dump has been made for more than a week for any Wikipedia.
>  
> The first line is about the enwiki dump which is still in progress and
> seems to be updated.
> But all the other lines are dated back to 2010-07-06 or older.
>  
> Nico
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@...
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Federico Leva (Nemo | 15 Jul 14:57 2010
Picon

Re: Database dumps are stopped ?

Nicolas Vervelle, 15/07/2010 13:36:
> The first line is about the enwiki dump which is still in progress and 
> seems to be updated.
> But all the other lines are dated back to 2010-07-06 or older.

According to Ganglia (if I can read it) storage2,3 are down, dataset1 is 
working (a little bit): 
http://ganglia.wikimedia.org/?r=month&c=Miscellaneous&h=dataset1.wikimedia.org
http://ganglia.wikimedia.org/?c=Miscellaneous&h=storage2.wikimedia.org&m=load_one&r=hour&s=descending&hc=3&mc=3
http://ganglia.wikimedia.org/?c=Miscellaneous&h=storage3.wikimedia.org&m=load_one&r=hour&s=descending&hc=3&mc=3

Nemo

Dmitry Chichkov | 21 Jul 00:30 2010
Picon

enwiki dump progress on 20100622 - failed again

Subj: http://download.wikimedia.org/enwiki/20100622/

Is there anything that can be done to alleviate that problem?

By the way, what's the point of producing .bz2 version of the pages-meta-history.xml dump? Is it easier on the system to produce .bz2 first and .7z after that? From the user's perspective I can tell that .7z is all I need, there is simply no point in working with .bz2 (if .7z is available).

-- Regards, Dmitry

Jamie Morken | 21 Jul 13:03 2010
Picon

Re: enwiki dump progress on 20100622 - failed again


Hi,

I was polling the http://download.wikimedia.org/enwiki/20100622/ page during the pages-meta-history.xml.bz2 database dump and here is some timestamped output from that page showing some errors that caused the dump to fail.  Regarding the .bz2 dump format, Tomasz earlier suggested removing it and using .7z.  I thought it might be good to keep the .bz2 format due to there being several programs that use it (ie. wikitaxi, bzreader).  7z format is probably the way to go though for the future, but I don't know if this would fix the database dump errors.

cheers,
Jamie


-----------------------------------------------

20100719 2:22:14am

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
2010-07-19 09:22:11: enwiki 889057 pages (0.613/sec), 110108000 revs (75.931/sec), 83.6% prefetched, ETA 2010-08-28 05:12:01 [max 371385750]

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.7 GB (written)



-----------------------------------------------

20100719 3:07:16am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
2010-07-19 10:07:15: enwiki 894194 pages (0.615/sec), 110399000 revs (75.990/sec), 83.6% prefetched, ETA 2010-08-28 04:08:46 [max 371385750]

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 3:22:17am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 2 of allowed 5 retrieving revision text for text id 10595737! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 3:37:18am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 3 of allowed 5 retrieving revision text for text id 13930238! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)


-----------------------------------------------

20100719 3:52:19am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 4 of allowed 5 retrieving revision text for text id 355313550! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:07:20am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 3 of allowed 5 retrieving revision text for text id 346806445! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:22:21am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 4 of allowed 5 retrieving revision text for text id 351921561! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:37:21am PST

# 2010-07-02 14:33:44  in-progress All pages with complete page edit history (.bz2)
Error 5 of allowed 5 retrieving revision text for text id 358280940! Pausing 5 seconds before retry...

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2 119.9 GB (written)

-----------------------------------------------

20100719 4:52:24am PST

# 2010-07-19 11:37:24 failed All pages with complete page edit history (.bz2)
#6 {main}

    * These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.
    * pages-meta-history.xml.bz2

-----------------------------------------------




pages referenced in the above errors:

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=10595737

Brothers in Arms: Road to Hill 30
"This is an old revision of this page, as edited by Colonel Cow (talk | contribs) at 01:02, 17 February 2005."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=13930238

Brothers in Arms: Road to Hill 30
"This is an old revision of this page, as edited by 213.212.58.66  (talk) at 12:34, 19 May 2005."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=355313550

User:Peter I. Vardy/sandbox
This is an old revision of this page, as edited by Peter I. Vardy (talk | contribs)  at 10:53, 11 April 2010.

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=346806445

Talk:Amy Shearn
"This is an old revision of this page, as edited by Yobot (talk | contribs) at 02:49, 28 February 2010."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=351921561

User:Ohms Law Bot/Cleanup/Roy D. Bridges, Jr.
"This is an old revision of this page, as edited by Ohms Law Bot (talk | contribs) at 06:26, 25 March 2010."

-----------------------------------------------

http://en.wikipedia.org/w/index.php?oldid=358280940

The Tower Treasure
"This is an old revision of this page, as edited by 69.144.24.63  (talk) at 21:36, 25 April 2010."

-----------------------------------------------




----- Original Message -----
From: Dmitry Chichkov <dchichkov <at> gmail.com>
Date: Tuesday, July 20, 2010 3:31 pm
Subject: [Xmldatadumps-l] enwiki dump progress on 20100622 - failed again
To: xmldatadumps-l <at> lists.wikimedia.org

> Subj: http://download.wikimedia.org/enwiki/20100622/
>
> Is there anything that can be done to alleviate that problem?
>
> By the way, what's the point of producing .bz2 version of the
> pages-meta-history.xml dump? Is it easier on the system to
> produce .bz2
> first and .7z after that? From the user's perspective I can tell
> that .7z is
> all I need, there is simply no point in working with .bz2 (if
> .7z is
> available).
>
> -- Regards, Dmitry
>

Gmane