Ashish Mukherjee | 3 Mar 11:28 2012
Picon

Wikipedia data extraction

Hi,

I am using the following perl modules to extract data from Wikipedia and Wikitravel respectively -

- WWW::Wikipedia
- MediaWiki::API

From both these APIs and also by looking at the MediaWiki APIs, I seem to get the entire chunk of text in the Web Service response. To extract different sections of the Wiki entry, I have to rely on pattern matching and regular expressions.

Is there a better way to achieve this? Is there some sample code in any language (preferably, perl) which anyone can share, or is there some tool which does this out of the box?

Any help would be appreciated.

Regards,
Ashish



_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Amir E. Aharoni | 3 Mar 11:31 2012
Picon
Picon

Re: Wikipedia data extraction

MediaWiki::DumpFile has some facilities for this, although they were
very basic the last time i checked.

Its developer is active and responsive to bug reports, enhancements and patches.

2012/3/3 Ashish Mukherjee <ashish.mukherjee <at> gmail.com>:
> Hi,
>
> I am using the following perl modules to extract data from Wikipedia and
> Wikitravel respectively -
>
> - WWW::Wikipedia
> - MediaWiki::API
>
> From both these APIs and also by looking at the MediaWiki APIs, I seem to
> get the entire chunk of text in the Web Service response. To extract
> different sections of the Wiki entry, I have to rely on pattern matching and
> regular expressions.
>
> Is there a better way to achieve this? Is there some sample code in any
> language (preferably, perl) which anyone can share, or is there some tool
> which does this out of the box?
>
> Any help would be appreciated.
>
> Regards,
> Ashish
>
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
Ashish Mukherjee | 3 Mar 12:32 2012
Picon

Re: Wikipedia data extraction

Thanks, Amir.

Do the dumps give very granular-level data for a Wiki entry?

- Ashish

On Sat, Mar 3, 2012 at 4:01 PM, Amir E. Aharoni <amir.aharoni <at> mail.huji.ac.il> wrote:
MediaWiki::DumpFile has some facilities for this, although they were
very basic the last time i checked.

Its developer is active and responsive to bug reports, enhancements and patches.


2012/3/3 Ashish Mukherjee <ashish.mukherjee <at> gmail.com>:
> Hi,
>
> I am using the following perl modules to extract data from Wikipedia and
> Wikitravel respectively -
>
> - WWW::Wikipedia
> - MediaWiki::API
>
> From both these APIs and also by looking at the MediaWiki APIs, I seem to
> get the entire chunk of text in the Web Service response. To extract
> different sections of the Wiki entry, I have to rely on pattern matching and
> regular expressions.
>
> Is there a better way to achieve this? Is there some sample code in any
> language (preferably, perl) which anyone can share, or is there some tool
> which does this out of the box?
>
> Any help would be appreciated.
>
> Regards,
> Ashish
>
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Amir E. Aharoni | 3 Mar 12:58 2012
Picon
Picon

Re: Wikipedia data extraction

2012/3/3 Ashish Mukherjee <ashish.mukherjee <at> gmail.com>:
> Thanks, Amir.
>
> Do the dumps give very granular-level data for a Wiki entry?

The XML dumps give the complete text of every page, in the same
wikitext format that you see when you edit it. It also has metadata,
like title, authors, timestamp, namespace etc.

The MediaWiki::DumpFile module also provides some functions that allow
you to analyze page info even if it doesn't necessarily come from a
dump, but these functions are relatively basic. Just see the module's
docs and check whether it has the particular thing that you need.

I used this module quite a lot; you can find the biggest thing i did
with it here:
http://perlwikibot.svn.sourceforge.net/viewvc/perlwikibot/trunk/no-interwiki/prepare_noiw_list.pl?revision=93&view=markup

I haven't maintained it in a long while, but it should still be
functional and you are welcome to recycle the functions and the
regular expressions there.

If there's any particular kind of data that you need, let me know -
maybe i already have code that can extract it.

--
Amir
Amir E. Aharoni | 3 Mar 20:34 2012
Picon
Picon

Re: Wikipedia data extraction

MaxSem on IRC gave a solution that may help you.

Using the following call, you can get section titles, numbers and
offsets from the beginning of the page:
https://en.wikipedia.org/w/api.php?action=parse&page=Pittsburgh&prop=sections

Using the following call, you can get a section's text by its number:
https://en.wikipedia.org/w/api.php?action=parse&page=Pittsburgh&prop=wikitext&section=2

You can tweak your calls using the API sandbox:
https://en.wikipedia.org/wiki/Special:ApiSandbox

--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬

2012/3/3 Ashish Mukherjee <ashish.mukherjee <at> gmail.com>:
> Hi,
>
> I am using the following perl modules to extract data from Wikipedia and
> Wikitravel respectively -
>
> - WWW::Wikipedia
> - MediaWiki::API
>
> From both these APIs and also by looking at the MediaWiki APIs, I seem to
> get the entire chunk of text in the Web Service response. To extract
> different sections of the Wiki entry, I have to rely on pattern matching and
> regular expressions.
>
> Is there a better way to achieve this? Is there some sample code in any
> language (preferably, perl) which anyone can share, or is there some tool
> which does this out of the box?
>
> Any help would be appreciated.
>
> Regards,
> Ashish
>
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Timothy Helck | 8 Mar 03:09 2012
Picon

Re: can I use the API to search for images in commons.wikimedia.org?

Gentlemen,

Thanks for your previous suggestions.Unfortunately I've been busy with other things and I am just getting back to this.

I think I have found a reasonable strategy for searching images. I can use this query (siilar to what was suggested by Platonides):
https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22&srlimit=20&sroffset=20&prop=imageinfo

Then I can parse the result set and create another query which gets me the urls:
https://commons.wikimedia.org/w/api.php?action=query&titles=File:Chartres%20cathedral%202881.jpg|File:Chartres%20cathedral%202880.jpg|File:Chartres%20cathedral%202879.jpg&prop=imageinfo&iiprop=url

However, I am still encountering a problem -- commons.wikimedia.org doesn't like requests that don't come from a browser. When I put these queries into php I get a 403 error. Is there a url where I can search wikimedia programatically?

regards,

Tim



On Sun, Jan 1, 2012 at 3:51 PM, Platonides <platonides <at> gmail.com> wrote:
On 30/12/11 21:01, Timothy Helck wrote:
> Roan,
>
> I've looked at search, but it only seems to return names of pages, not
> images. Is there a way to make it return images?
>
> Tim

Files are at namespace 6, which wasn't included in the provided query:
https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22


_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Liangent | 8 Mar 07:30 2012
Picon

Re: can I use the API to search for images in commons.wikimedia.org?

On Thu, Mar 8, 2012 at 10:09 AM, Timothy Helck <timothy.helck <at> gmail.com> wrote:
> Gentlemen,
>
> Thanks for your previous suggestions.Unfortunately I've been busy with other
> things and I am just getting back to this.
>
> I think I have found a reasonable strategy for searching images. I can use
> this query (siilar to what was suggested by Platonides):
> https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22&srlimit=20&sroffset=20&prop=imageinfo
>
> Then I can parse the result set and create another query which gets me the
> urls:
> https://commons.wikimedia.org/w/api.php?action=query&titles=File:Chartres%20cathedral%202881.jpg|File:Chartres%20cathedral%202880.jpg|File:Chartres%20cathedral%202879.jpg&prop=imageinfo&iiprop=url
>
> However, I am still encountering a problem -- commons.wikimedia.org doesn't
> like requests that don't come from a browser. When I put these queries into
> php I get a 403 error. Is there a url where I can search wikimedia
> programatically?
>

Please see http://meta.wikimedia.org/wiki/User-Agent_policy

> regards,
>
> Tim
>
>
>
>
> On Sun, Jan 1, 2012 at 3:51 PM, Platonides <platonides <at> gmail.com> wrote:
>>
>> On 30/12/11 21:01, Timothy Helck wrote:
>> > Roan,
>> >
>> > I've looked at search, but it only seems to return names of pages, not
>> > images. Is there a way to make it return images?
>> >
>> > Tim
>>
>> Files are at namespace 6, which wasn't included in the provided query:
>>
>> https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22
>>
>>
>> _______________________________________________
>> Mediawiki-api mailing list
>> Mediawiki-api <at> lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

-Liangent
Timothy Helck | 8 Mar 14:49 2012
Picon

Re: can I use the API to search for images in commons.wikimedia.org?

thanks!

On Thu, Mar 8, 2012 at 1:30 AM, Liangent <liangent <at> gmail.com> wrote:
On Thu, Mar 8, 2012 at 10:09 AM, Timothy Helck <timothy.helck <at> gmail.com> wrote:
> Gentlemen,
>
> Thanks for your previous suggestions.Unfortunately I've been busy with other
> things and I am just getting back to this.
>
> I think I have found a reasonable strategy for searching images. I can use
> this query (siilar to what was suggested by Platonides):
> https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22&srlimit=20&sroffset=20&prop=imageinfo
>
> Then I can parse the result set and create another query which gets me the
> urls:
> https://commons.wikimedia.org/w/api.php?action=query&titles=File:Chartres%20cathedral%202881.jpg|File:Chartres%20cathedral%202880.jpg|File:Chartres%20cathedral%202879.jpg&prop=imageinfo&iiprop=url
>
> However, I am still encountering a problem -- commons.wikimedia.org doesn't
> like requests that don't come from a browser. When I put these queries into
> php I get a 403 error. Is there a url where I can search wikimedia
> programatically?
>

Please see http://meta.wikimedia.org/wiki/User-Agent_policy

> regards,
>
> Tim
>
>
>
>
> On Sun, Jan 1, 2012 at 3:51 PM, Platonides <platonides <at> gmail.com> wrote:
>>
>> On 30/12/11 21:01, Timothy Helck wrote:
>> > Roan,
>> >
>> > I've looked at search, but it only seems to return names of pages, not
>> > images. Is there a way to make it return images?
>> >
>> > Tim
>>
>> Files are at namespace 6, which wasn't included in the provided query:
>>
>> https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22
>>
>>
>> _______________________________________________
>> Mediawiki-api mailing list
>> Mediawiki-api <at> lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

-Liangent

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Platonides | 9 Mar 00:36 2012
Picon

Re: can I use the API to search for images in commons.wikimedia.org?

On 08/03/12 03:09, Timothy Helck wrote:
> Gentlemen,
> 
> Thanks for your previous suggestions.Unfortunately I've been busy with
> other things and I am just getting back to this.
> 
> I think I have found a reasonable strategy for searching images. I can
> use this query (siilar to what was suggested by Platonides):
> https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22&srlimit=20&sroffset=20&prop=imageinfo
> 
> Then I can parse the result set and create another query which gets me
> the urls:
> https://commons.wikimedia.org/w/api.php?action=query&titles=File:Chartres%20cathedral%202881.jpg|File:Chartres%20cathedral%202880.jpg|File:Chartres%20cathedral%202879.jpg&prop=imageinfo&iiprop=url

No need to perform two requests. The MediaWiki API generators can
combine them for you:
 https://commons.wikimedia.org/w/api.php?action=query&generator=search&gsrnamespace=6&gsrsearch=%22chartres+cathedral%22&gsrlimit=20&gsroffset=20&prop=imageinfo&iiprop=url

> However, I am still encountering a problem -- commons.wikimedia.org
> <http://commons.wikimedia.org> doesn't like requests that don't come
> from a browser. When I put these queries into php I get a 403 error. Is
> there a url where I can search wikimedia programatically?
> 
> regards,
> 
> Tim

You need to use an User-Agent which identifies your tool.
Timothy Helck | 9 Mar 14:10 2012
Picon

Re: can I use the API to search for images in commons.wikimedia.org?

Platonides,

Thank you, that's really interesting.

Tim

On Thu, Mar 8, 2012 at 6:36 PM, Platonides <platonides <at> gmail.com> wrote:
On 08/03/12 03:09, Timothy Helck wrote:
> Gentlemen,
>
> Thanks for your previous suggestions.Unfortunately I've been busy with
> other things and I am just getting back to this.
>
> I think I have found a reasonable strategy for searching images. I can
> use this query (siilar to what was suggested by Platonides):
> https://commons.wikimedia.org/w/api.php?action=query&list=search&srnamespace=6&srsearch=%22chartres+cathedral%22&srlimit=20&sroffset=20&prop=imageinfo
>
> Then I can parse the result set and create another query which gets me
> the urls:
> https://commons.wikimedia.org/w/api.php?action=query&titles=File:Chartres%20cathedral%202881.jpg|File:Chartres%20cathedral%202880.jpg|File:Chartres%20cathedral%202879.jpg&prop=imageinfo&iiprop=url

No need to perform two requests. The MediaWiki API generators can
combine them for you:
 https://commons.wikimedia.org/w/api.php?action=query&generator=search&gsrnamespace=6&gsrsearch=%22chartres+cathedral%22&gsrlimit=20&gsroffset=20&prop=imageinfo&iiprop=url



> However, I am still encountering a problem -- commons.wikimedia.org
> <http://commons.wikimedia.org> doesn't like requests that don't come
> from a browser. When I put these queries into php I get a 403 error. Is
> there a url where I can search wikimedia programatically?
>
> regards,
>
> Tim

You need to use an User-Agent which identifies your tool.

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Gmane