Ariel T. Glenn | 1 Feb 2011 07:07
Picon

Re: Problem with jawiki-dump

I am rerunning the history dump phase of jawiki.

A third thread for the large dumps has been restarted.

Ariel

Στις 31-01-2011, ημέρα Δευ, και ώρα 15:34 +0100, ο/η Andreas Meier
έγραψε:
> Hello,
> there seems to be a problem with the current jawiki-dump. The size of the complete history dump is only 4.3
GB, but the size of the dump before was 19 GB.
> 
> Another issue: Acccording to http://wikitech.wikimedia.org/view/Dumps#Worker_nodes there
shoulde be 3 threads for the large dumps, but since a few days there are only 2 running threads.
> 
> Best regards,
> Andreas

Ariel T. Glenn | 6 Feb 2011 07:56
Picon

upcoming 1.17 deployment and the xml dumps

A little bit before the scheduled deployment of the 1.17 branch on our
production servers, I will be halting production of XML dumps.
Deployment is set for Tuesday Feb 8 at 07:00 UTC, so a few hours before
that I'll start shutting down processes. 

This is a precautionary measure; after the deployment and any hasty
fixes that may be needed, I will be doing some testing to ensure that
dumps are not impacted, before we restart them. Barring some bizarre
problem, we should be back up and running within a day or two.

Ariel

Jamie Morken | 9 Feb 2011 22:44
Picon
Favicon

Re: upcoming 1.17 deployment and the xml dumps


Hi Ariel,

I don't really understand why the dumps need to be halted as I thought the mediawiki code and database dump code were basically two separate entities already*.  I guess the 1.17 branch code changes the structure of the database causing potential errors in the database dump?  I also don't understand the "precautionary" logic of halting the dumps, as a dump with errors is better than no dump in the case where there are a limited supply of recent dumps due to the RAID server failure as well.  If its only a couple day halt as you mentioned that's probably irrelevant, but it sounds like it may be a longer period of limited testing from your last wikitech email, which makes me wonder if it is even worth halting the dumps in the first place.. Also wouldn't potential dump errors be detected better if they continue to be produced and check them for errors, rather than halt them?

cheers,
Jamie


*
http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_17/
http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/


----- Original Message -----
From: "Ariel T. Glenn" <ariel <at> wikimedia.org>
Date: Saturday, February 5, 2011 10:56 pm
Subject: [Xmldatadumps-l] upcoming 1.17 deployment and the xml dumps
To: xmldatadumps-l <at> lists.wikimedia.org, wikitech-l <at> lists.wikimedia.org

> A little bit before the scheduled deployment of the 1.17 branch
> on our
> production servers, I will be halting production of XML dumps.
> Deployment is set for Tuesday Feb 8 at 07:00 UTC, so a few hours
> beforethat I'll start shutting down processes.
>
> This is a precautionary measure; after the deployment and any hasty
> fixes that may be needed, I will be doing some testing to ensure that
> dumps are not impacted, before we restart them. Barring some bizarre
> problem, we should be back up and running within a day or two.
>
> Ariel
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
Ariel T. Glenn | 9 Feb 2011 23:14
Picon

Re: upcoming 1.17 deployment and the xml dumps

We halted them because we can have bad data creep on during times when
the codebase is badly broken.  I don't want to have to walk through and
detrmine later which 30 or 50 wiki dumps those are and toss them, so I
have them on hold til things are sorted out or until we have a date for
deployment that is a number of days off. 

A dump with errors isn't better than no dump in that it is possible for
bad data to be carried forward into subsequent dumps, even with the
revision length check in the code.

The only certain check involves doing an md5sum of the revision text,
something that can only be accomplished right now by retrieving the text
from the database, thus making prefetch from the previous dump file a
pointless exercise.

After a brief meeting just now about deployment, it appears we are going
to make another stab at testing tomorrow at this time.  (Check
http://techblog.wikimedia.org/ in a couple of hours for the details.)

After that we should have several days of a break; if that pans out,
I'll happily crank dumps back up for that interval.

Ariel

Στις 09-02-2011, ημέρα Τετ, και ώρα 13:44 -0800, ο/η Jamie Morken
έγραψε:
> 
> Hi Ariel,
> 
> I don't really understand why the dumps need to be halted as I thought
> the mediawiki code and database dump code were basically two separate
> entities already*.  I guess the 1.17 branch code changes the structure
> of the database causing potential errors in the database dump?  I also
> don't understand the "precautionary" logic of halting the dumps, as a
> dump with errors is better than no dump in the case where there are a
> limited supply of recent dumps due to the RAID server failure as well.
> If its only a couple day halt as you mentioned that's probably
> irrelevant, but it sounds like it may be a longer period of limited
> testing from your last wikitech email, which makes me wonder if it is
> even worth halting the dumps in the first place.. Also wouldn't
> potential dump errors be detected better if they continue to be
> produced and check them for errors, rather than halt them?
> 
> cheers,
> Jamie
> 
> 
> *
> http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_17/
> http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/
> 
> 
> ----- Original Message -----
> From: "Ariel T. Glenn" <ariel@...>
> Date: Saturday, February 5, 2011 10:56 pm
> Subject: [Xmldatadumps-l] upcoming 1.17 deployment and the xml dumps
> To: xmldatadumps-l@..., wikitech-l@...
> 
> > A little bit before the scheduled deployment of the 1.17 branch 
> > on our
> > production servers, I will be halting production of XML dumps.
> > Deployment is set for Tuesday Feb 8 at 07:00 UTC, so a few hours 
> > beforethat I'll start shutting down processes. 
> > 
> > This is a precautionary measure; after the deployment and any hasty
> > fixes that may be needed, I will be doing some testing to ensure
> that
> > dumps are not impacted, before we restart them. Barring some bizarre
> > problem, we should be back up and running within a day or two.
> > 
> > Ariel
> > 
> > 
> > 
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l@...
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> > 

Ariel T. Glenn | 9 Feb 2011 23:16
Picon

Re: [Wikitech-l] upcoming 1.17 deployment and the xml dumps

The xmldatadumps branch I have mods to is only the python code; the
maintenance scripts run out of the deployment branch. When 1.17 is
deployed those are the versions that run.

Ariel

Στις 09-02-2011, ημέρα Τετ, και ώρα 23:08 +0100, ο/η Roan Kattouw
έγραψε:
> 2011/2/9 Jamie Morken <jmorken@...>:
> >
> > Hi Ariel,
> >
> > I don't really understand why the dumps need to be halted as I thought the mediawiki code and database dump
code were basically two separate entities already*.  I guess the 1.17 branch code changes the structure of
the database causing potential errors in the database dump?  I also don't understand the "precautionary"
logic of halting the dumps, as a dump with errors is better than no dump in the case where there are a limited
supply of recent dumps due to the RAID server failure as well.  If its only a couple day halt as you mentioned
that's probably irrelevant, but it sounds like it may be a longer period of limited testing from your last
wikitech email, which makes me wonder if it is even worth halting the dumps in the first place.. Also
wouldn't potential dump errors be detected better if they continue to be produced and check them for
errors, rather than halt them?
> >
> The dump code and MW code are intertwined, they're not separate. I
> think the xmldumps-backup branch you're linking to is just a branch
> Ariel is playing around in or working on dump-specific code or
> something.
> 
> We really don't want any maintenance scripts running when doing 1.17
> stuff, and the dumps use maintenance scripts.
> 
> Roan Kattouw (Catrope)

Roan Kattouw | 9 Feb 2011 23:08
Picon

Re: [Wikitech-l] upcoming 1.17 deployment and the xml dumps

2011/2/9 Jamie Morken <jmorken@...>:
>
> Hi Ariel,
>
> I don't really understand why the dumps need to be halted as I thought the mediawiki code and database dump
code were basically two separate entities already*.  I guess the 1.17 branch code changes the structure
of the database causing potential errors in the database dump?  I also don't understand the
"precautionary" logic of halting the dumps, as a dump with errors is better than no dump in the case where
there are a limited supply of recent dumps due to the RAID server failure as well.  If its only a couple day
halt as you mentioned that's probably irrelevant, but it sounds like it may be a longer period of limited
testing from your last wikitech email, which makes me wonder if it is even worth halting the dumps in the
first place.. Also wouldn't potential dump errors be detected better if they continue to be produced and
check them for errors, rather than halt them?
>
The dump code and MW code are intertwined, they're not separate. I
think the xmldumps-backup branch you're linking to is just a branch
Ariel is playing around in or working on dump-specific code or
something.

We really don't want any maintenance scripts running when doing 1.17
stuff, and the dumps use maintenance scripts.

Roan Kattouw (Catrope)

xiang wang | 25 Feb 2011 03:35
Picon

Problem about zhwiki

Hello,
As you know, Chinese contains two similar language: "Traditional Chinese" and "The simplified Chinese" , but it's hard to do translation between them correctly. I know Wiki can do this translation properly. I think why not release "Traditional Chinese" Dump and "The simplified Chinese" Dump, rather than together.  This can save a lot of time for Chinese language researchers.
Thanks. Just a serious advice!
Ariel T. Glenn | 25 Feb 2011 20:52
Picon

post-1.17 deployment restart of dumps

I have done a small amount of testing, the tests look good. Acccordingly
I have started up one process to do dumps; please get your eyeballs on
them and let me know thumbs up or down.  I'd like to start up the rest
of the processes by tomorrow at this time so if you can squeeze in some
time to look at them sooner rather than later that would be awesome.
Thanks!

Ariel

p.s. Yes this means I am done travelling for a while, thank goodness.  I
think I am sick of airplanes.  And *very* sick of jet lag.

White Cat | 26 Feb 2011 16:43

[xmldatadumps-l] Torrents

http://dumps.wikimedia.org/enwiki/20110115/

Hi, has anyone got plans to create individual torrents for "All pages with complete page edit history (.bz2)" ? I downloaded them and turns out I have several files that seem to be corrupted. I am unable to re-download them but feel the torrent would be able to fix the corrupted parts. All of the individual parts for the dumps except 1st,8th,9th,10th ones are complete.

I need these dumps because I will analyse revisions in hopes of better identifying vandalism on the wikis through machine learning. I however need the database to process this soon as my assignment is due in about a month.
Ariel T. Glenn | 26 Feb 2011 18:07
Picon

Re: post-1.17 deployment restart of dumps

Irritatingly enough we haven't quite switched all the paths of
everything to use php-1.17.  For example, the dumps.

So the previous tests aren't very useful.  I'm shooting the svwiki dump
in process and doing another round of tests with the correct path; after
that I'll restart svwiki from its current step and call for community
review again.  Sorry for the mixup.

Ariel


Gmane