Improvements to the rewrite branch
Santiago M. Mola <cooldwind <at> gmail.com>
2010-03-07 12:07:07 GMT
I'm currently using the rewrite branch for a project. This project is
not a bot, but a tool for vandalism analysis.
Here I'll explain how I used it and what changes I made, so it may be
useful for the new design of the rewrite. Also, I'd like to get
recommendations about my approaches so I can made them suitable for
integration with pywikipedia.
First of all, my main unit of information is Edit. An Edit is an
object composed of a Page and two consecutive revision IDs of such
page. Edit supports some operations such as getting the edition
comment, user, timestamp and the old and the new text.
I had to implement a method similar to BaseSite.loadrevisions():
Given a list of edits, which have associated their revision IDs but
NOT their Page, fetch them and associate them with their Page object.
This method retrieves all the revisions, creates Page objects for them
and Revision objects which are assigned to the corresponding
Then, I have to store all this info in-disk for later use. So I wrote
a function for exporting my list edits to XML, using WikiMedia's
format Export 0.4. To ease this process, I added a to_element() method
to Page and Revision objects. to_element() returns an Element object
(from the ElementTree API) representing the object. So, exporting is
as easy as iterating over all Pages, calling their to_element()
method() and appending it to a common root. What do you think about
this? Should it be included in pywikipedia? Do you prefer a different
approach for exporting to XML?
For importing again from XML, I adapted the old XmlDump. My version
yields Page objects instead of revisions. Of course this might be a
performance nightmare when working with XML dumps with full history,
so it can be modified to yield Revision objects.
I think the Revision class should include a page attribute, containing
the Page object that the Revision belongs to. That would be of use,
for example, when writing an XmlDump yielding Revisions and, in
general, for more applications that are Revision oriented.
And last but not least, currently it's easy to end up with multiple
Page objects representing the same page, but with different object
state. Do you think that BaseSite should implement a Page factory or
some way to "create a Page object for this title if it doesn't exist
or give me the one that already exists"?
Well, that's all at the moment.
Santiago M. Mola
Jabber ID: cooldwind <at> gmail.com