Can pdfminer be used to get a list of the absolute position
(bounding box) of every text fragment in a pdf?
By fragment I mean the the raw fragments as they occur in the file
not merged using some word spacing heuristics.
Starting with 'doc' as found here
and coding very informally:
for p in doc.get_pages():
for c in p.contents:
while not hasattr(r,'decode'):
r=r.resolve() # resolve the reference
if not r.data:
r.decode() # decode the stream
gives the raw postscript commands, including text fragments and (a mix of absolute and relative) positioning commands. Whereas we get bigger chunks of text (fragments merged according to laparams which are layout analysis parameters) with this approach (starting with 'layout', also found here
for el in layout:
Is there a way to get still-raw text fragments that are not yet merged but are processed enough to know their absolute positions?
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org