As a test though I then removed portal_historiesstorage (IIRC - I think that's the one that physically stores the version objects) entirely from a test site and then repacked. The database went from 34G to 4G. So at least it's thoroughly confirmed now that old versions of binary content are the sole cause of ZODB bloat.
I'm going to keep investigating this. One option I'm considering (not yet sure how seriously) is dumping CMFEditions and just keeping wiki history in a VCS repository (probably bazaar). The opencore UI doesn't really support versioning of anything other than the body of a wikipage -- titles are technically versioned but I believe old page titles don't ever show up in the UI, and we don't version any complex objects -- so CMFEditions' object-versioning may just be unnecessary overkill anyway. I'm already planning on writing code to transform a project wiki into a VCS repo for project export purposes, so it might make sense. Curious if anyone feels otherwise.
(I also noticed on plone-dev recently there was talk of replacing CMFEditions, and IIRC Martin Aspeli proposed a VCS backend. This is by the way, but does make me think it's not crazy to jettison CMFEditions as a dependency.)
I checked in some code on trunk which converts a project wiki history to a bzr repository. The immediate goal is just getting a complete history for project export, but if this continues to seem promising I'll be looking into using it for live backups as well.
Sorting all commits globally by date across a given project is important to me. This is tricky, because CMFEditions storage is CVS-like (no global revision number) so an in-memory sort really doesn't work for large wikis. Instead, I do the procedure in two phases: first, store all the commits' metadata in a sqlite db; then, retrieve them, sorted by timestamp, from the rdb and commit them to a bzr repo in order. This is working surprisingly well.
I've tested it against some pretty difficult data from Coactivate: nearly up-to-date copies of the opencore project wiki; another large and long-running wiki; and a smaller/younger wiki with lots of non-ASCII text. All are working well, after a few rounds of trial and error.[*]
Interesting stats:
opencore wiki: 5551 checkins on 694 pages; 4.3M .bzr directory; 4.1M checkout
another wiki: 5540 checkins on 193 pages; 6.7M .bzr directory; 4.2M checkout
Next steps:
* Look into running this during project export, or as a separate project export action; maybe split it up into batches to be run by separate calls to the export queue
* Look into implementing edit functionality by checking out a copy of the repo on edit and checking it back in after save
Later steps:
* Look into providing the opencore wiki history / revert / version interfaces on top of a live BZR repo
* Look into what interesting analyses can be done (just from casual glancing at `bzr log` output I'm seeing potentially interesting patterns)
* Research bzr plugins
If I do move forward with this for a live wiki history backend, it will also be important to figure out how to cleanly uninstall CMFEditions and all its persistent objects -- which was my original motivation back in July.
[*] On Coactivate, some very old wiki pages have versions that used to be WickedDocs instead of our current object type. These objects break the portal_repository storage - they're actually not even accessible on the web. A patch to CMFEditions worked around this and allowed me to retrieve their contents and metadata :
Index: src/opencore-bundle/CMFEditions/StandardModifiers.py
===================================================================
--- src/opencore-bundle/CMFEditions/StandardModifiers.py (revision 62256)
+++ src/opencore-bundle/CMFEditions/StandardModifiers.py (working copy)
<at> <at> -350,7 +350,10 <at> <at>
if attr_name not in clone_ids:
new_ob = getattr(obj, attr_name, None)
if new_ob is not None:
- repo_clone._setOb(attr_name, new_ob)
+ try:
+ repo_clone._setOb(attr_name, new_ob)
+ except AttributeError:
+ pass
# Delete references that are no longer relevant
for attr_name in clone_ids: