rimilythomas | 19 Feb 2013 10:16
Picon

PDF To xml convertion

Hii all
 
Iam new to this and also python.Please help me in how to use pdfminer to convert pfd to xml.
How to use the command.
 
i have used a comand:pdf2txt.py -o output.xml -t xml samples/xyz.pdf
and it gave me somw error <open file 'xyz.pdf, mode rb at 0x016bcbd0
 
 
Thanks to all

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
rimilythomas | 19 Feb 2013 10:05
Picon

pdf to xml

how to start of with it. wat is the command that is used to convert ibto xml.
 
I have tried the command:pdf2txt.py -o output.xml -t xml samples/xyz.pdf
 
where xzy.pdf i have placed in sample folder of pdfminer.

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Orion Osborn | 14 Feb 2013 00:20
Picon

pdfminer module error

I keep getting the following error when I attempt the test at http://www.unixuser.org/~euske/python/pdfminer/index.html#source

orion <at> socrates:/usr/lib/pymodules/python2.7/pdfminer$ ./tools/pdf2txt.py samples/simple1.pdf
Traceback (most recent call last):
  File "./tools/pdf2txt.py", line 3, in <module>
    from pdfminer.pdfparser import PDFDocument, PDFParser
ImportError: No module named pdfminer.pdfparser
orion <at> socrates:/usr/lib/pymodules/python2.7/pdfminer$

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
guyga123 | 18 Feb 2013 22:23
Picon

Unreadable fonts extracted

I tried to use pdf2txt to extract text from a pdf bill and got unreadable fonts.

debug log and output xml generated can be downloaded from:

out.xml - ~3MB
https://docs.google.com/file/d/0B00aqeKcQExYbnFnUHJfREJ0ZXM/edit?usp=sharing

debug.log - 27KB
https://docs.google.com/a/betterbill.com/file/d/0B00aqeKcQExYWlJDTnpyeDFQRkE/edit?usp=sharing

pdf has some private data, but if necessary I can share it as well or send it directly.

I tried googling and found:
http://stackoverflow.com/questions/12675471/pdfminer-gives-strange-letters

not sure if its the same issue,

any direction or hint will be appreciated,

Thanks in advance,
Guy

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfminer-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Derek Dohler | 14 Nov 2012 12:50
Picon

pdf2txt.py fails with AssertionError in layout.py

Hi,


I'm having trouble getting pdf2txt.py to work on the attached file; it fails with an AssertionError unless I turn layout analysis off. These are automatically-generated files that I'm analyzing as part of a web scraping process. I have noticed that the PDF generator appears to be using an ugly hack to generate bolded text by layering multiple copies of the same character on top of one another; I don't know if that might be contributing to the problem.

I've tried both the version in PyPi, and the GitHub version, without success. Suggestions appreciated!

Derek

pdf2txt.py main1.pdf 
Traceback (most recent call last):
  File "/home/ddohler/Development/pubreg_scrapy/bin/pdf2txt.py", line 105, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/ddohler/Development/pubreg_scrapy/bin/pdf2txt.py", line 99, in main
    caching=caching, check_extractable=True)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 832, in process_pdf
    interpreter.process_page(page)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 758, in process_page
    self.device.end_page(page)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/converter.py", line 35, in end_page
    self.cur_item.analyze(self.laparams)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/layout.py", line 635, in analyze
    self.groups = self.group_textboxes(laparams, textboxes)
  File "/home/ddohler/Development/pubreg_scrapy/local/lib/python2.7/site-packages/pdfminer/layout.py", line 618, in group_textboxes
    assert len(plane) == 1

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pdfminer-users/-/mvGlsbVvBSYJ.
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pdfminer-users+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
Attachment (main1.pdf): application/pdf, 32 KiB
Aaron Joseph | 19 Nov 2012 23:07
Picon

How to extract all pdf form fields, field name, field type and coordinates using pdfminer?

So I have tried pdf2text.py a couple of times but I really want to ignore all the non field values and extract these values from a PDF.


  • form fields
  • field name
  • field type
  • coordinates
If there is any good way to format this in JSON rather than XML, much appreciated.

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pdfminer-users/-/HlTxSaiL0zkJ.
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pdfminer-users+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
mjsirajahmed | 17 Sep 2012 10:20
Picon

how to extract the coordinates for an image in PDF?

Can any one tell me how to extract images and the coordinates for images ?

Marc Stober | 5 Aug 2012 23:32
Picon

Re: Page links from TOC (outlines)


Thank you!!! Was looking for a way to get the page numbers from the 
get_outlines method and this worked perfectly.

Paulo Scardine | 14 Jun 2012 05:29
Picon

bug in utils.drange?

I'm trying to mine a document and got a couple AssertionErrors at layout.py:533:


neighbors = line.find_neighbors(plane, laparams.line_margin) 
assert line in neighbors, line

Well, the LTTextLineHorizontal is not in neighbors because it is not in the plane._grid, so find_neighbors returns [].

[Dbg]>>> line
<LTTextLineHorizontal 225.900,500.061,359.781,500.734 u'reclamante o importe de R$ 18.750,00 em 09/11/09'>

I've traced it to a call of utils.drange in Plane._getrange at utils.py:230. 

In this case, drange(line.y0, line.y1, plane.gridsize) returns xrange(10, 10), which is void, so this line never gets into plane._grid.

# drange
def drange(v0, v1, d):
    """Returns a discrete range."""
    assert v0 < v1
    return xrange(int(v0)/d, int(v1+d-1)/d)

Should it ever be able to return a value like xrange(10, 10)? I mean, if int(v0)/d == int(v1+d-1)/d should we return  xrange(int(v0)/d, int(v0)/d+1instead? If so, I can provide a patch.

Thanks in advance,
--
Paulo

--
You received this message because you are subscribed to the Google Groups "pdfminer-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pdfminer-users/-/QLY9o1WnQ4cJ.
To post to this group, send email to pdfminer-users-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pdfminer-users+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pdfminer-users?hl=en.
sagar priyadarshi | 4 Jan 2012 11:41
Picon

Problem in getting xml file while using dumppdf.py

Dear All

I am using pdfminer's dumppdf.py program to extract text from a pdf
using command as:

dumppdf.py -a [pdf file] > [output xml file]

Which works well with all the pdf but suddenly with one pdf , It gave
me below errors:

Traceback (most recent call last):
  File "/usr/local/bin/dumppdf.py", line 226, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/dumppdf.py", line 223, in main
    dumpall=dumpall, codec=codec)
  File "/usr/local/bin/dumppdf.py", line 162, in dumppdf
    doc.set_parser(parser)
  File "/usr/local/lib64/python2.6/site-packages/pdfminer/
pdfparser.py", line 327, in set_parser
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib64/python2.6/site-packages/pdfminer/
pdftypes.py", line 132, in dict_value
    x = resolve1(x)
  File "/usr/local/lib64/python2.6/site-packages/pdfminer/
pdftypes.py", line 60, in resolve1
    x = x.resolve()
  File "/usr/local/lib64/python2.6/site-packages/pdfminer/
pdftypes.py", line 49, in resolve
    return self.doc.getobj(self.objid)
  File "/usr/local/lib64/python2.6/site-packages/pdfminer/
pdfparser.py", line 418, in getobj
    (strmid, index) = xref.get_pos(objid)
  File "/usr/local/lib64/python2.6/site-packages/pdfminer/
pdfparser.py", line 211, in get_pos
    pos = nunpack(ent[self.fl1:self.fl1+self.fl2])
  File "/usr/local/lib64/python2.6/site-packages/pdfminer/utils.py",
line 116, in nunpack
    raise TypeError('invalid length: %d' % l)
TypeError: invalid length: 8

What may be the possible reasons? I am using latest pdfminer build and
python installed is 2.6
PDF is not secured and protected and has similar properties as any
other pdf has.

Waiting for  responses!

Regards
Sagar

sagar priyadarshi | 12 Nov 2011 11:52
Picon

How to find page no using dumppdf.py

Hi All!

My basic purpose is to get the xfdf for any pdf which contains
comments (annotation). I am trying dumppdf.py to use it as:
dumppdf.py -a mypdf.pdf > output.xml

and then parse output.xml to get attributes and values which are
required to form xfdf. I am getting all the attributes in output.xml
except the PAGE NO.

For each comment, there is one object in xml but that object tag does
not contain any information about location of comment (page no in
original pdf file).

Kindly share your ideas. Or is there any other way to get xfdf from
given annotated pdf?

Regards
Sagar


Gmane