Robert Schroll | 9 Jan 22:45 2014
Picon

API Changes

PDFMiner developers,

I just learned from a user that the PDFMiner API changed, breaking my 
application.  It's easy enough to fix, but it's a bit annoying to need 
try to support two versions of the API at once.  I'd like to make two 
suggestions about future changes to the API.

First, announce them here on the list!  Preferably a few days before 
the new version is released, so we developers can update our 
applications to use the new API.

Second, leave the old API in place and mark it as deprecated where 
possible, instead of removing it completely.  For instance, the way to 
make pages from a document changed from the instance method 
PDFDocument.get_pages() to the class method PDFPages.create_pages(doc). 
 There's no reason not to leave the get_pages() method in place and 
have it call PDFPages.create_pages(self).  Mark it as deprecated so new 
developers won't use it, but leave it around so existing code still 
works.  Similarly, LTAnon was renamed to LTAnno (I'm sure there was a 
good reason...), but existing code wouldn't have been broken if the 
module included the line "LTAnon = LTAnno  # Deprecated name; do not 
use in new code!".

I really do appreciate all the work that's gone into PDFMiner -- it's a 
great tool that's made my life much easier.

Thanks,
Robert

(Continue reading)

Baruch Volkov | 10 Dec 18:52 2013
Picon

ImportError: No module named psparser

Hi,

can somebody point me to what I am doing wrong


C:\tools\pdfminer-20131113>\Python27\python.exe build\scripts-2.7\pdfinterp.py --help
Traceback (most recent call last):
  File "build\scripts-2.7\pdfinterp.py", line 9, in <module>
    from psparser import PSException, PSSyntaxError, PSTypeError, PSEOF, \
ImportError: No module named psparser


Thank you

edabxv

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users.
For more options, visit https://groups.google.com/groups/opt_out.
Baruch Volkov | 10 Dec 18:36 2013
Picon

No module named psparser

Hi,

i am new to this and trying to run pdfinterp getting this error.
what I am missing?

C:\tools\pdfminer-20131113>\Python27\python.exe build\scripts-2.7\pdfinterp.py --help
Traceback (most recent call last):
  File "build\scripts-2.7\pdfinterp.py", line 9, in <module>
    from psparser import PSException, PSSyntaxError, PSTypeError, PSEOF, \
ImportError: No module named psparser


Thank you

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users.
For more options, visit https://groups.google.com/groups/opt_out.
john saponara | 11 Oct 17:42 2013
Picon

position of each text fragment

Can pdfminer be used to get a list of the absolute position (bounding box) of every text fragment in a pdf?

By fragment I mean the the raw fragments as they occur in the file
(lik)..(e th)..(is.)
not merged using some word spacing heuristics.

Starting with 'doc' as found here and coding very informally:

a=[]
for p in doc.get_pages():
  for c in p.contents:
    r=c
    while not hasattr(r,'decode'):
        r=r.resolve()  # resolve the reference
    if not r.data:
        r.decode()     # decode the stream
    assert r.data
    a.append(r.data)
print '\n'.join(a)

gives the raw postscript commands, including text fragments and (a mix of absolute and relative) positioning commands.  Whereas we get bigger chunks of text (fragments merged according to laparams which are layout analysis parameters) with this approach (starting with 'layout', also found here):

b=[]
for el in layout:
    if hasattr(el,'get_text'):
        b.append(str((el.x0,el.y0,el.get_text())))
print '\n'.join(b)

Is there a way to get still-raw text fragments that are not yet merged but are processed enough to know their absolute positions?

Thanks,
John

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users.
For more options, visit https://groups.google.com/groups/opt_out.
rimilythomas | 19 Feb 10:16 2013
Picon

PDF To xml convertion

Hii all
 
Iam new to this and also python.Please help me in how to use pdfminer to convert pfd to xml.
How to use the command.
 
i have used a comand:pdf2txt.py -o output.xml -t xml samples/xyz.pdf
and it gave me somw error <open file 'xyz.pdf, mode rb at 0x016bcbd0
 
 
Thanks to all

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
rimilythomas | 19 Feb 10:05 2013
Picon

pdf to xml

how to start of with it. wat is the command that is used to convert ibto xml.
 
I have tried the command:pdf2txt.py -o output.xml -t xml samples/xyz.pdf
 
where xzy.pdf i have placed in sample folder of pdfminer.

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Orion Osborn | 14 Feb 00:20 2013
Picon

pdfminer module error

I keep getting the following error when I attempt the test at http://www.unixuser.org/~euske/python/pdfminer/index.html#source

orion <at> socrates:/usr/lib/pymodules/python2.7/pdfminer$ ./tools/pdf2txt.py samples/simple1.pdf
Traceback (most recent call last):
  File "./tools/pdf2txt.py", line 3, in <module>
    from pdfminer.pdfparser import PDFDocument, PDFParser
ImportError: No module named pdfminer.pdfparser
orion <at> socrates:/usr/lib/pymodules/python2.7/pdfminer$

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
guyga123 | 18 Feb 22:23 2013
Picon

Unreadable fonts extracted

I tried to use pdf2txt to extract text from a pdf bill and got unreadable fonts.

debug log and output xml generated can be downloaded from:

out.xml - ~3MB
https://docs.google.com/file/d/0B00aqeKcQExYbnFnUHJfREJ0ZXM/edit?usp=sharing

debug.log - 27KB
https://docs.google.com/a/betterbill.com/file/d/0B00aqeKcQExYWlJDTnpyeDFQRkE/edit?usp=sharing

pdf has some private data, but if necessary I can share it as well or send it directly.

I tried googling and found:
http://stackoverflow.com/questions/12675471/pdfminer-gives-strange-letters

not sure if its the same issue,

any direction or hint will be appreciated,

Thanks in advance,
Guy

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Derek Dohler | 14 Nov 12:50 2012
Picon

pdf2txt.py fails with AssertionError in layout.py

Hi,


I'm having trouble getting pdf2txt.py to work on the attached file; it fails with an AssertionError unless I turn layout analysis off. These are automatically-generated files that I'm analyzing as part of a web scraping process. I have noticed that the PDF generator appears to be using an ugly hack to generate bolded text by layering multiple copies of the same character on top of one another; I don't know if that might be contributing to the problem.

I've tried both the version in PyPi, and the GitHub version, without success. Suggestions appreciated!

Derek

pdf2txt.py main1.pdf 
Traceback (most recent call last):
  File "/home/ddohler/Development/pubreg_scrapy/bin/pdf2txt.py", line 105, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/ddohler/Development/pubreg_scrapy/bin/pdf2txt.py", line 99, in main
    caching=caching, check_extractable=True)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 832, in process_pdf
    interpreter.process_page(page)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 758, in process_page
    self.device.end_page(page)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/converter.py", line 35, in end_page
    self.cur_item.analyze(self.laparams)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/layout.py", line 635, in analyze
    self.groups = self.group_textboxes(laparams, textboxes)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/layout.py", line 618, in group_textboxes
    assert len(plane) == 1

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pdfminer-users/-/mvGlsbVvBSYJ.
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pdfminer-users+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
Attachment (main1.pdf): application/pdf, 44 KiB
Aaron Joseph | 19 Nov 23:07 2012
Picon

How to extract all pdf form fields, field name, field type and coordinates using pdfminer?

So I have tried pdf2text.py a couple of times but I really want to ignore all the non field values and extract these values from a PDF.


  • form fields
  • field name
  • field type
  • coordinates
If there is any good way to format this in JSON rather than XML, much appreciated.

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pdfminer-users/-/HlTxSaiL0zkJ.
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pdfminer-users+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
mjsirajahmed | 17 Sep 10:20 2012
Picon

how to extract the coordinates for an image in PDF?

Can any one tell me how to extract images and the coordinates for images ?


Gmane