1 Jan 14:58
Re: Mendely: beyond PDF, annotations in general, other thoughts (some OT)
One, as yet undocumented, iDEA lab project at Edinburgh is to generate topic indexes for browsing relatively large collections (currently several thousands, planning for 10x - 100x that) of academic papers. (See http://homepages.inf.ed.ac.uk/mfourman/research/topics/uoe.xml for an early test example. Best viewed with a WebKit browser [Safari, Chrome], but also with latest Firefox [with some UI features missing].) We're mining online pdf texts, and find that around one third of the pdfs that academics at Edinburgh publish online don't easily yield text. I have slightly different needs from someone wanting a text version for annotation (I just need a bag of words). I'm resorting to OCR, using a combination of convert (ImageMagick), tesseract (code.google.com/p/tesseract-ocr/), aspell, and a stemmer to produce the bag of words I need. The ocropus project, which also builds on tesseract, may be closer to what you want. (code.google.com/p/ocropus/) VelOCRaptor (http://blog.velocraptor.com/) provides an OSX tool (not open, but based on ocropus) for using ocr to add searchable text to pdfs. It would be good to establish an open version of something similar, together with tools for manual correction, and learning from manual corrections to improve automation. I plan to propose an MSc project along these lines. With best wishes for the New Year, Michael On 1 Jan 2010, at 12:00, okfn-discuss-request@... wrote: > On Fri, Dec 4, 2009 at 9:44 AM, Philippe Aigrain(Continue reading)
I wonder whether you'd consider putting the content under, e.g.,
either CC-BY or CC-BY-SA or using a license for data such as those at:
RSS Feed