Sebastian Hellmann | 4 Mar 2012 09:11
Picon
Favicon

Re: Wiki Query Language

Dear all,
I had a misconfigured mail client and did not receive any of your 
answers in January. I concluded, that the mailing list was not 
populated.  I really have to apologize for not replying to your answers.

Since we assumed that nobody replied, we already started to develop a 
generic, configurable scraper and used it on the Englsih and German 
Wiktionary. The config files and data can be found here (it is part of 
DBpedia): [1] [2] [3] . We hope that it is generic enough to be applied 
to all languages of Wiktionary and that it can also be used on other 
MediaWikis (e.g. travelwiki.org).
Normally a transformation is done by an Extract-Transform-Load (ETL) 
process. Generally the E (extract) can also be considered a "select" or 
"query" procedure. Hence my initial question about the "Wiki Query 
Language".  If you have a good language for E, then T and L are easy ;)

One of the main unsolved problems, yet, is scraping infos from 
templates: to effectively build a generic scraper, it would require to 
be able to "interpret" templates right.  Templates are a good way to 
structure information, and  are easy to scrape (technically speaking) . 
The problem is more that you would need one config file for each 
template to get "good" data. In Wikipedia, infoboxes can all be  parsed 
with the same algorithm, but in DBpedia we still have to do so-called 
"mappings" to get good data: http://mappings.dbpedia.org/   Infoboxes 
are a special case however, as they are all structured in a similar way. 
So the "mapping solution" only works for infoboxes.

It comes down to these two options:
a) create one scraper configuration for each template, which captures 
the intention of the creator and allows to "correctly" scrape the data 
(Continue reading)

Gabriel Wicke | 4 Mar 2012 14:09
Favicon

Re: Wiki Query Language

Hello Sebastian,

> It comes down to these two options:
> a) create one scraper configuration for each template, which captures
> the intention of the creator and allows to "correctly" scrape the data
> from all pages.
> b) load all necessary template definitions into MediaWiki and then do a
> transformation to HTML or XML and use XPath (or JQuery)
>
> On 01/12/2012 03:38 PM, Oren Bochman wrote:
>> 2. the only aplication which (correctly!?) expands templates is
>> MedaiWiki itself.
> (Thanks for your answer) I agree, that only Mediawiki can "correctly"
> expand templates, as it can interpret the code on the template pages.
> The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am
> currently not aware of any other transformation options.)

we are currently working on http://www.mediawiki.org/wiki/Parsoid, a JS 
parser that by now expands templates well and also supports a few parser 
functions. We need to mark up template parameters for the visual editor 
in any case, and plan to employ HTML5 microdata or RDFa for this purpose 
(see http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I 
intend to start implementing this sometime this month. Let us know if 
you have feedback / ideas on the microdata or RDFa design.

> To ask more precisely:
> Is there a best practice for scraping data from Wikipedia? What is the
> smartest way to resolve templates for scraping? Am I not seeing any
> third option?

(Continue reading)

Amgine | 5 Mar 2012 05:40
Picon

Re: Wiki Query Language

On 03/04/2012 05:09 AM, Gabriel Wicke wrote:
> Hello Sebastian,
>
>> It comes down to these two options:
>> a) create one scraper configuration for each template, which captures
>> the intention of the creator and allows to "correctly" scrape the data
>> from all pages.
>> b) load all necessary template definitions into MediaWiki and then do a
>> transformation to HTML or XML and use XPath (or JQuery)
>>
>> On 01/12/2012 03:38 PM, Oren Bochman wrote:
>>> 2. the only aplication which (correctly!?) expands templates is
>>> MedaiWiki itself.
>> (Thanks for your answer) I agree, that only Mediawiki can "correctly"
>> expand templates, as it can interpret the code on the template pages.
>> The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am
>> currently not aware of any other transformation options.)
>
> we are currently working on http://www.mediawiki.org/wiki/Parsoid, a
> JS parser that by now expands templates well and also supports a few
> parser functions. We need to mark up template parameters for the
> visual editor in any case, and plan to employ HTML5 microdata or RDFa
> for this purpose (see
> http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I
> intend to start implementing this sometime this month. Let us know if
> you have feedback / ideas on the microdata or RDFa design.
>
>> To ask more precisely:
>> Is there a best practice for scraping data from Wikipedia? What is the
>> smartest way to resolve templates for scraping? Am I not seeing any
(Continue reading)

Sebastian Hellmann | 5 Mar 2012 17:49
Picon
Favicon

Re: Wiki Query Language

On 03/04/2012 02:09 PM, Gabriel Wicke wrote:
> Hello Sebastian,
>
>> It comes down to these two options:
>> a) create one scraper configuration for each template, which captures
>> the intention of the creator and allows to "correctly" scrape the data
>> from all pages.
>> b) load all necessary template definitions into MediaWiki and then do a
>> transformation to HTML or XML and use XPath (or JQuery)
>>
>> On 01/12/2012 03:38 PM, Oren Bochman wrote:
>>> 2. the only aplication which (correctly!?) expands templates is
>>> MedaiWiki itself.
>> (Thanks for your answer) I agree, that only Mediawiki can "correctly"
>> expand templates, as it can interpret the code on the template pages.
>> The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am
>> currently not aware of any other transformation options.)
>
> we are currently working on http://www.mediawiki.org/wiki/Parsoid, a 
> JS parser that by now expands templates well and also supports a few 
> parser functions. We need to mark up template parameters for the 
> visual editor in any case, and plan to employ HTML5 microdata or RDFa 
> for this purpose (see 
> http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I 
> intend to start implementing this sometime this month. Let us know if 
> you have feedback / ideas on the microdata or RDFa design.
Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid 
project might interest some of our people. How is it possible to join? 
Or is it Wikimedia internal development? Is there a parsoid mailing list?

(Continue reading)

Sumana Harihareswara | 5 Mar 2012 18:39
Picon
Gravatar

Re: Wiki Query Language

On 03/05/2012 08:49 AM, Sebastian Hellmann wrote:
> On 03/04/2012 02:09 PM, Gabriel Wicke wrote:
>> we are currently working on http://www.mediawiki.org/wiki/Parsoid, a
>> JS parser that by now expands templates well and also supports a few
>> parser functions. We need to mark up template parameters for the
>> visual editor in any case, and plan to employ HTML5 microdata or RDFa
>> for this purpose (see
>> http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I
>> intend to start implementing this sometime this month. Let us know if
>> you have feedback / ideas on the microdata or RDFa design.
> Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid
> project might interest some of our people. How is it possible to join?
> Or is it Wikimedia internal development? Is there a parsoid mailing list?

This list (wikitext-l) is also the Parsoid mailing list.

As part of the MediaWiki project, Parsoid development is also open
source and available for anyone to help with.  We'd welcome your help!

Check out https://www.mediawiki.org/wiki/Parsoid#Getting_started and
https://www.mediawiki.org/wiki/Parsoid/Todo for ways to jump in.

--

-- 
Sumana Harihareswara
Volunteer Development Coordinator
Wikimedia Foundation
Gabriel Wicke | 5 Mar 2012 18:43
Favicon

Re: Wiki Query Language

> Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid
> project might interest some of our people. How is it possible to join?
> Or is it Wikimedia internal development? Is there a parsoid mailing list?

You are very welcome to join- http://www.mediawiki.org/wiki/Parsoid has 
most of the information to get you started. We are using this mailing 
list for discussions. You can also catch me in the #mediawiki IRC 
channel as gwicke.

> Can JS handle this? I read somewhere, that it was several magnitudes
> slower than other languages... Maybe this is not true for node-JS.

Competition between JS runtimes has improved performance a lot in the 
last years. See for example the fun Computer Language Benchmarks Game: 
http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastest.php 

It is still hard to beat C or C++ performance for memory-dominated tasks 
of course.

> All the data in our mappings wiki was created to "mark up" Wikipedia
> template parameters. So please try to reuse it. I think there are almost
> 200 active users in http://mappings.dbpedia.org/ who have added extra
> parsing information to thousands of templates in Wikipedia across 20
> languages. You can download and reuse it or we can also add your
> requirements to it.

Our primary requirement is marking up all top-level template arguments 
(and generated content like image thumbnails) to enable editing in the 
visual editor. The editor could however also benefit from type 
information, so refining vocabulary information (and perhaps mapping 
(Continue reading)

Inez Korczynski | 7 Mar 2012 01:21
Favicon

VisualEditor - massive code changes

Hello,


As the VisualEditor team agreed last week to go forward with the approach of using ContentEditable in a view layer, I performed some major cleanup in the source code (structural changes and class naming):


Basically the ContentEditable approach is no longer a "hack" built around EditingSurface and now has it's own set of view files.

The EditingSurface demo is not working properly anymore, but I don't think it has to (this is due to changes in model classes which refers now to ve.ce vs. ve.es, as before).

Let me know if you have any questions.

Thanks,
Inez

_______________________________________________
Wikitext-l mailing list
Wikitext-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Sebastian Hellmann | 9 Mar 2012 06:50
Picon
Favicon

Re: Wiki Query Language

Dear Gabriel,
I cross-posted to dbpedia-developers list.
 <at> DBpedia Team: although the text below might seem out of context, the 
Wikitext-list is actually the one that overlaps the most with our main 
topic: parsing Wiki syntax and templates. Please have a look at the 
http://www.mediawiki.org/wiki/Parsoid project.

 <at> Gabriel:
I think we should not include the markup in the <noinclude> section, but 
on the doc page of the template, so it also helps normal editors to 
better know, what the templates mean.
Back then we actually designed our approach to work in this way and also 
attempted to add it to WP.
Of course, we were using a naive WP:BOLD approach, which got deleted: 
http://en.wikipedia.org/w/index.php?title=Template:Infobox_person/doc&oldid=324287076#DBpedia_Template_Annotation

But we nevertheless used template syntax, hoping that one day it will be 
included in Wikipedia.

see http://mappings.dbpedia.org/index.php/Mapping:Infobox_actor

{{TemplateMapping
| mapToClass = Actor
| mappings =
    {{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
    {{ PropertyMapping | templateProperty = birth_place | ontologyProperty = birthPlace }}
    {{ DateIntervalMapping | templateProperty = yearsactive | startDateOntologyProperty =
activeYearsStartYear | endDateOntologyProperty = activeYearsEndYear }}
....
}}

It kind of helps to interpret the template and parse out the values 
correctly. It seems that you try to do something similar. MAybe we can 
just modify or change out approach, so it also fits your requirements.
I will be on holidays in the rest of March, so there will not be any 
mails from me any more.
Sebastian

On 03/05/2012 06:43 PM, Gabriel Wicke wrote:
>> Awesome!! I forwarded it to DBpedia developers. I think, the Parsoid
>> project might interest some of our people. How is it possible to join?
>> Or is it Wikimedia internal development? Is there a parsoid mailing 
>> list?
>
> You are very welcome to join- http://www.mediawiki.org/wiki/Parsoid 
> has most of the information to get you started. We are using this 
> mailing list for discussions. You can also catch me in the #mediawiki 
> IRC channel as gwicke.
>
>> Can JS handle this? I read somewhere, that it was several magnitudes
>> slower than other languages... Maybe this is not true for node-JS.
>
> Competition between JS runtimes has improved performance a lot in the 
> last years. See for example the fun Computer Language Benchmarks Game: 
> http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastest.php 
>
>
> It is still hard to beat C or C++ performance for memory-dominated 
> tasks of course.
>
>> All the data in our mappings wiki was created to "mark up" Wikipedia
>> template parameters. So please try to reuse it. I think there are almost
>> 200 active users in http://mappings.dbpedia.org/ who have added extra
>> parsing information to thousands of templates in Wikipedia across 20
>> languages. You can download and reuse it or we can also add your
>> requirements to it.
>
> Our primary requirement is marking up all top-level template arguments 
> (and generated content like image thumbnails) to enable editing in the 
> visual editor. The editor could however also benefit from type 
> information, so refining vocabulary information (and perhaps mapping 
> into an ontology) is also interesting to us. We should definitely 
> collaborate on this.
>
> What do you think about embedding schema information (maybe RDFa 
> profiles?) into the noinclude section of a template page?
>
> Gabriel
>
>
> _______________________________________________
> Wikitext-l mailing list
> Wikitext-l <at> lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>

--

-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Jonas Brekle | 13 Mar 2012 13:22
Picon
Gravatar

[ANN] wiktionary.dbpedia.org online - Linked Data, SPARQL and Dumps

Hi lists,

we are proud to announce that we now host the data we extract from
wiktionary publicly on wiktionary.dbpedia.org.

We offer Linked Data: http://wiktionary.dbpedia.org/resource/word
a SPARQL endpoint: http://wiktionary.dbpedia.org/sparql
and N-Triple Dumps: http://downloads.dbpedia.org/wiktionary/

There is also a wiki explaining some details:
http://wiki.dbpedia.org/Wiktionary/

We currently extracted data from the English and German Wiktionary (28M
triples and 3.7M triples), but plan to extend that to at least the
biggest 5 wiktionaries within the next weeks, as our approach focuses on
extendability. The data for each word is structured hierarchically (as
wiktionary is) and contains information about language, part of speech,
definitions, translations, synonyms, hyperonyms and hyponyms etc.
There might be some quality issues, but we want to release early, so
bear with us and report major problems.

Thanks goes to the wiktionary community which does a great job creating
this dataset, and we hope to enable new use cases and consequently
promote the contribution to the wiktionary project.

Regards,
Jonas Brekle
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Adam Wight | 21 Mar 2012 08:43
Favicon

Long live the visual editor!

In-place editing is going to be the future of all
collaborative content, or at least, it's hard to imagine
otherwise.

This is possibly the greatest step forward that the
VisualEditor promises to bring Mediawiki.  Odd that it was
hardly mentioned in the tldr thread "Odd plans on
Future/Parser_plan - are they true?".  Perhaps some people
have not tried the new editor?
  http://www.mediawiki.org/wiki/Special:VisualEditorSandbox

Let me please not reopen the questions of l33tism or
wikitext whimsey, I simply wanted to state that clicking
directly on the cursor location you want to edit is a big
deal.  One's poor brain does not have to shift gears; the
computer actually saves you a little bit of your life (how
often can you say that?); this is a good interface.

...and having a parser in Javascript will really broaden the
horizons for wikipedia: embedding wiki content in another
application, offline editing, abandoning legacy mediawiki
code, to name a few.

-Adam

Gmane