emijrp | 2 Sep 2011 13:44
Picon

Re: another month...

What about the Library of Congress? Any news about that old contact attempt?

I heard about Internet Archive downloading the dumps several times every year, but not official confirmation.

2011/8/18 Erik Moeller <erik-AeOJrEpdGNeGglJvpFV4uA@public.gmane.org>
On Tue, Aug 16, 2011 at 3:00 AM, Ariel T. Glenn <ariel-AeOJrEpdGNeGglJvpFV4uA@public.gmane.org> wrote:
> ...another dump.  August is done, July 7z are done, the last of the May
> history and 7z are done.  That brings us up to date.

\o/

Great to see we're back on track. :-)

We talked a while ago about doing more to promote mirroring of the
dumps, and you said you'd been thinking about an approach to that.
Would now be a good time to start making a push for more mirrors?

Thanks,
Erik

--
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Ariel T. Glenn | 8 Sep 2011 09:22
Picon

another month, another dump. ho hum :-P

The September en wikipedia dumps are done.  Folks who use them, note
that this is the first run with the generation of a pile of smaller
files.  The naming scheme as you will have noticed has an additional
string: -p<first-page-id-contained>p<last-pageid-contained>  Expect the
specific groupings to change from one run to the next; it's time-based,
rather than based on the number of pages or revisions.

You may notice a gap of a few numbers between files; this would indicate
that those pages were deleted and not included in the dump at all.

Since there were no issues with the network, database servers, broken MW
deployments etc., the run finished without any need for restarts of a
particular step; this is probably the fastest we'll ever see it run, in
a little under 8 days.

Any issues, please let me know.  I expect people will need a script to
download these files easily; didn't someone on this list have a tool in
the works?

Ariel

Jamie Morken | 8 Sep 2011 22:49
Picon
Favicon

Re: another month, another dump. ho hum :-P



----- Original Message -----
From: "Ariel T. Glenn" <ariel <at> wikimedia.org>
Date: Thursday, September 8, 2011 12:22 am
Subject: [Xmldatadumps-l] another month, another dump. ho hum :-P
To: xmldatadumps-l <at> lists.wikimedia.org

> The September en wikipedia dumps are done.  Folks who use
> them, note
> that this is the first run with the generation of a pile of smaller
> files.  The naming scheme as you will have noticed has an
> additionalstring: -p<first-page-id-contained>p<last-pageid-
> contained>  Expect the
> specific groupings to change from one run to the next; it's time-
> based,rather than based on the number of pages or revisions.
>
> You may notice a gap of a few numbers between files; this would
> indicatethat those pages were deleted and not included in the
> dump at all.
>
> Since there were no issues with the network, database servers,
> broken MW
> deployments etc., the run finished without any need for restarts
> of a
> particular step; this is probably the fastest we'll ever see it
> run, in
> a little under 8 days.
>
> Any issues, please let me know.  I expect people will need
> a script to
> download these files easily; didn't someone on this list have a
> tool in
> the works?

Hi Ariel,

This download addon for firefox works quite well, and is cross-platform:

http://en.wikipedia.org/wiki/DownThemAll!
https://addons.mozilla.org/en-US/firefox/addon/downthemall/
http://www.downthemall.net/

cheers,
Jamie

>
> Ariel
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
Eric Sun | 9 Sep 2011 01:05
Picon

Re: another month, another dump. ho hum :-P

Just to confirm, the enwiki-20110901-pages-articles.xml.bz2 file is the concatenation of all those sub-files, right?


Would it be possible to restore the filename of this file to enwiki-latest-pages-articles.xml.bz2 for consistency with all the other wikipedias?

For example, the latest full dump in http://dumps.wikimedia.org/dewiki/latest/ 
is called dewiki-latest-pages-articles.xml.bz2 and it's the same in all other languages.

Thanks,
Eric

On Thu, Sep 8, 2011 at 12:22 AM, Ariel T. Glenn <ariel-AeOJrEpdGNeGglJvpFV4uA@public.gmane.org> wrote:
The September en wikipedia dumps are done.  Folks who use them, note
that this is the first run with the generation of a pile of smaller
files.  The naming scheme as you will have noticed has an additional
string: -p<first-page-id-contained>p<last-pageid-contained>  Expect the
specific groupings to change from one run to the next; it's time-based,
rather than based on the number of pages or revisions.

You may notice a gap of a few numbers between files; this would indicate
that those pages were deleted and not included in the dump at all.

Since there were no issues with the network, database servers, broken MW
deployments etc., the run finished without any need for restarts of a
particular step; this is probably the fastest we'll ever see it run, in
a little under 8 days.

Any issues, please let me know.  I expect people will need a script to
download these files easily; didn't someone on this list have a tool in
the works?

Ariel


_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Bernd Fehling | 9 Sep 2011 08:28
Picon
Picon

dump of ptwikibooks broken

Dear list,

I noticed that ptwikibooks has some problems since end of May.
Is someone able to fix this?

Regards,
Bernd Fehling

Ariel T. Glenn | 9 Sep 2011 08:49
Picon

Re: another month, another dump. ho hum :-P

Yes it is, and the new naming scheme of the "latest" files is a bug, I
need to fix that. Grrr.

Ariel

Στις 08-09-2011, ημέρα Πεμ, και ώρα 16:05 -0700, ο/η Eric Sun έγραψε:
> Just to confirm, the enwiki-20110901-pages-articles.xml.bz2 file is
> the concatenation of all those sub-files, right?
> 
> 
> Would it be possible to restore the filename of this file
> to enwiki-latest-pages-articles.xml.bz2 for consistency with all the
> other wikipedias?
> 
> 
> For example, the latest full dump
> in http://dumps.wikimedia.org/dewiki/latest/ 
> is called dewiki-latest-pages-articles.xml.bz2 and it's the same in
> all other languages.
> 
> 
> Thanks,
> Eric
> 
> On Thu, Sep 8, 2011 at 12:22 AM, Ariel T. Glenn <ariel@...>
> wrote:
>         The September en wikipedia dumps are done.  Folks who use
>         them, note
>         that this is the first run with the generation of a pile of
>         smaller
>         files.  The naming scheme as you will have noticed has an
>         additional
>         string: -p<first-page-id-contained>p<last-pageid-contained>
>          Expect the
>         specific groupings to change from one run to the next; it's
>         time-based,
>         rather than based on the number of pages or revisions.
>         
>         You may notice a gap of a few numbers between files; this
>         would indicate
>         that those pages were deleted and not included in the dump at
>         all.
>         
>         Since there were no issues with the network, database servers,
>         broken MW
>         deployments etc., the run finished without any need for
>         restarts of a
>         particular step; this is probably the fastest we'll ever see
>         it run, in
>         a little under 8 days.
>         
>         Any issues, please let me know.  I expect people will need a
>         script to
>         download these files easily; didn't someone on this list have
>         a tool in
>         the works?
>         
>         Ariel
>         
>         
>         _______________________________________________
>         Xmldatadumps-l mailing list
>         Xmldatadumps-l@...
>         https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> 
> 

Ariel T. Glenn | 9 Sep 2011 11:42
Picon

Re: another month, another dump. ho hum :-P

Please have a look at these links again.  If folks see any anomalies,
please let me know.  The names should be fixed at any rate.

Ariel

Στις 09-09-2011, ημέρα Παρ, και ώρα 09:49 +0300, ο/η Ariel T. Glenn
έγραψε:
> Yes it is, and the new naming scheme of the "latest" files is a bug, I
> need to fix that. Grrr.
> 
> Ariel
> 
> Στις 08-09-2011, ημέρα Πεμ, και ώρα 16:05 -0700, ο/η Eric Sun έγραψε:
> > Just to confirm, the enwiki-20110901-pages-articles.xml.bz2 file is
> > the concatenation of all those sub-files, right?
> > 
> > 
> > Would it be possible to restore the filename of this file
> > to enwiki-latest-pages-articles.xml.bz2 for consistency with all the
> > other wikipedias?
> > 
> > 
> > For example, the latest full dump
> > in http://dumps.wikimedia.org/dewiki/latest/ 
> > is called dewiki-latest-pages-articles.xml.bz2 and it's the same in
> > all other languages.
> > 
> > 
> > Thanks,
> > Eric
> > 
> > On Thu, Sep 8, 2011 at 12:22 AM, Ariel T. Glenn <ariel@...>
> > wrote:
> >         The September en wikipedia dumps are done.  Folks who use
> >         them, note
> >         that this is the first run with the generation of a pile of
> >         smaller
> >         files.  The naming scheme as you will have noticed has an
> >         additional
> >         string: -p<first-page-id-contained>p<last-pageid-contained>
> >          Expect the
> >         specific groupings to change from one run to the next; it's
> >         time-based,
> >         rather than based on the number of pages or revisions.
> >         
> >         You may notice a gap of a few numbers between files; this
> >         would indicate
> >         that those pages were deleted and not included in the dump at
> >         all.
> >         
> >         Since there were no issues with the network, database servers,
> >         broken MW
> >         deployments etc., the run finished without any need for
> >         restarts of a
> >         particular step; this is probably the fastest we'll ever see
> >         it run, in
> >         a little under 8 days.
> >         
> >         Any issues, please let me know.  I expect people will need a
> >         script to
> >         download these files easily; didn't someone on this list have
> >         a tool in
> >         the works?
> >         
> >         Ariel
> >         
> >         
> >         _______________________________________________
> >         Xmldatadumps-l mailing list
> >         Xmldatadumps-l@...
> >         https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> > 
> > 
> 
> 
> 
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@...
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

fox | 16 Sep 2011 14:52
Picon
Gravatar

Re: another month, another dump. ho hum :-P

Il 08/09/2011 09:22, Ariel T. Glenn ha scritto:
> I expect people will need a script to download these files easily;
> didn't someone on this list have a tool in the works?

I wrote this simple bash script
https://github.com/SoNetFBK/wiki-network/blob/master/download_dumps.sh

It's really simple to use.
Usage: download_dumps.sh LANG [OUTPUT_DIR] [MATCHING_STRING]

Examples:
- download_dumps.sh en -> downloads every lastest file from enwiki
- download_dumps.sh en /mydata/dumps -> the same but saves everything in
/mydata/dumps
- download_dumps.sh en /mydata/dumps history -> the same but downloads
only the files that contain the word "history" in the name (you can use
regex too!)

p.s.: in the same repo you'll find other interesting stuff to analyze
the dumps (extracting a social network from user talk pages, content
analysis, ecc..). If you need other info write me ;)

--

-- 
f.

  "I didn't try, I succeeded"
  (Dr. Sheldon Cooper, PhD)

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

http://about.me/fox91

Platonides | 16 Sep 2011 18:02
Picon

Re: another month, another dump. ho hum :-P

fox wrote:
> Il 08/09/2011 09:22, Ariel T. Glenn ha scritto:
>> I expect people will need a script to download these files easily;
>> didn't someone on this list have a tool in the works?
>
> I wrote this simple bash script
> https://github.com/SoNetFBK/wiki-network/blob/master/download_dumps.sh

I like your trick to fetch the date. What I used to do was to manually 
download the md5sums file, then parse it to fetch everything from there.

What doesn't seem to work is the files one.
In the output of
> elinks -no-references -no-numbering -dump http://dumps.wikimedia.org/enwiki/20110901/
there's nothing matching "enwiki-"

fox | 19 Sep 2011 14:09
Picon
Gravatar

Re: another month, another dump. ho hum :-P

Il 16/09/2011 18:02, Platonides ha scritto:
> I like your trick to fetch the date. What I used to do was to manually
> download the md5sums file, then parse it to fetch everything from there.

oh yes! i didn't see that, maybe is better

> What doesn't seem to work is the files one.
> In the output of
>> elinks -no-references -no-numbering -dump
>> http://dumps.wikimedia.org/enwiki/20110901/
> there's nothing matching "enwiki-"

I don't understand your problem, sorry. That page contains a lot of
strings matching "enwiki-"

┌[fox☮MachI]-(~)
└> elinks -no-references -no-numbering -dump
http://dumps.wikimedia.org/enwiki/20110901/  | grep enwiki

enwiki dump progress on 20110901
* enwiki-20110901-pages-meta-history1.xml-p000000010p000002326.7z
* enwiki-20110901-pages-meta-history1.xml-p000002327p000004609.7z
* enwiki-20110901-pages-meta-history1.xml-p000004610p000006654.7z
[...]

--

-- 
f.

  "I didn't try, I succeeded"
  (Dr. Sheldon Cooper, PhD)

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

http://about.me/fox91


Gmane