4 Mar 2012 09:11
Re: Wiki Query Language
Sebastian Hellmann <hellmann <at> informatik.uni-leipzig.de>
2012-03-04 08:11:08 GMT
2012-03-04 08:11:08 GMT
Dear all, I had a misconfigured mail client and did not receive any of your answers in January. I concluded, that the mailing list was not populated. I really have to apologize for not replying to your answers. Since we assumed that nobody replied, we already started to develop a generic, configurable scraper and used it on the Englsih and German Wiktionary. The config files and data can be found here (it is part of DBpedia): [1] [2] [3] . We hope that it is generic enough to be applied to all languages of Wiktionary and that it can also be used on other MediaWikis (e.g. travelwiki.org). Normally a transformation is done by an Extract-Transform-Load (ETL) process. Generally the E (extract) can also be considered a "select" or "query" procedure. Hence my initial question about the "Wiki Query Language". If you have a good language for E, then T and L are easy ;) One of the main unsolved problems, yet, is scraping infos from templates: to effectively build a generic scraper, it would require to be able to "interpret" templates right. Templates are a good way to structure information, and are easy to scrape (technically speaking) . The problem is more that you would need one config file for each template to get "good" data. In Wikipedia, infoboxes can all be parsed with the same algorithm, but in DBpedia we still have to do so-called "mappings" to get good data: http://mappings.dbpedia.org/ Infoboxes are a special case however, as they are all structured in a similar way. So the "mapping solution" only works for infoboxes. It comes down to these two options: a) create one scraper configuration for each template, which captures the intention of the creator and allows to "correctly" scrape the data(Continue reading)
RSS Feed