A.M. Sabuncu | 24 Dec 21:30 2014
Picon

Parsing PDF files

I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and using the following curl command to test text extraction from PDF files:
curl -X PUT -d <at> GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"On trivial PDF files (e.g. created using Word 2010's convert-to-pdf functionality and containing only the text "Testing", about 81 KB in size), I get errors in that there's nothing returned from the curl command, and on the tika-server end, I see the following errors:

<lots of garbage characters displayed on screen, followed by>

WARNING: Did not found XRef object at specified startxref position 0

Being new to Tika, I would like to know whether I am doing something wrong, or if PDF parsing is not yet an exact science.

Many thanks in advance.

Sabuncu
Allison, Timothy B. | 18 Dec 16:51 2014
Picon

Tika 2.0???

I feel Tika 2.0 coming up soon (well, April-ish?!) and the breaking of some other areas of back compat, esp.
parser class loading -> config ... 

What other areas for breaking or revamping do others see for 2.0?

We need a short-term fix to get the tesseract ocr integration+metadata out the door with 1.7, of course.


-----Original Message-----
From: Chris Mattmann [mailto:chris.mattmann <at> gmail.com] 
Sent: Thursday, December 18, 2014 10:42 AM
To: user <at> tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

Yeah I think we should probably combine them..and make
JSON the default (which unfortunately would break back
compat, but in my mind would make a lot more sense)

------------------------
Chris Mattmann
chris.mattmann <at> gmail.com




-----Original Message-----
From: "Allison, Timothy B." <tallison <at> mitre.org>
Reply-To: <user <at> tika.apache.org>
Date: Thursday, December 18, 2014 at 7:20 AM
To: "user <at> tika.apache.org" <user <at> tika.apache.org>
Subject: RE: Outputting JSON from tika-server/meta

>Do you have any luck if you call /metadata instead of /meta?
>
>That should trigger MetadataEP which will return Json, no?
>
>I'm not sure why we have both handlers, but we do...
>
>
>-----Original Message-----
>From: Sergey Beryozkin [mailto:sberyozkin <at> gmail.com]
>Sent: Thursday, December 18, 2014 9:56 AM
>To: user <at> tika.apache.org
>Subject: Re: Outputting JSON from tika-server/meta
>
>Hi Peter
>Thanks, you are too nice, it is a minor bug :-)
>Cheers, Sergey
>On 18/12/14 14:50, Peter Bowyer wrote:
>> Thanks Sergey, I have opened TIKA-1497 for this enhancement.
>>
>> Best wishes,
>> Peter
>>
>> On 18 December 2014 at 14:31, Sergey Beryozkin <sberyozkin <at> gmail.com
>> <mailto:sberyozkin <at> gmail.com>> wrote:
>>
>>     Hi,
>>     I see MetadataResource returning StreamingOutput and it has
>>      <at> Produces(text/csv) only. As such this MBW has no effect at the
>>moment.
>>
>>     We can update MetadataResource to return Metadata directly if
>>     application/json is requested or update MetadataResource to directly
>>     convert Metadata to JSON in case of JSON being accepted
>>
>>     Can you please open a JIRA issue ?
>>
>>     Cheers, Sergey
>>
>>
>>
>>     On 18/12/14 13:58, Peter Bowyer wrote:
>>
>>         Hi,
>>
>>         I suspect this has a really simple answer, but it's eluding me.
>>
>>         How do I get the response from
>>         curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta

>>         to be JSON and not CSV?
>>
>>         I've discovered JSONMessageBodyWriter.java
>>         
>>(https://github.com/apache/__tika/blob/__af19f3ea04792cad81b428f1df9f5e__

>>bbb2501913/tika-server/src/__main/java/org/apache/tika/__server/JSONMessa
>>geBodyWriter.__java
>>         
>><https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb250

>>1913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWrit
>>er.java>)
>>         so I think the functionality is present, tried adding --header
>>         "Accept:
>>         application/json" to the cURL call, in line with the
>>         documentation for
>>         outputting CSV, but no luck so far.
>>
>>         Many thanks,
>>         Peter
>>
>>
>>
>>
>> --
>> Maple Design Ltd
>> http://www.mapledesign.co.uk

>> <http://www.mapledesign.co.uk/>+44 (0)845 123 8008
>>
>> Reg. in England no. 05920531
>
>


Peter Bowyer | 18 Dec 14:58 2014
Picon

Outputting JSON from tika-server/meta

Hi,

I suspect this has a really simple answer, but it's eluding me.

How do I get the response from 
curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
to be JSON and not CSV? 

I've discovered JSONMessageBodyWriter.java (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java) so I think the functionality is present, tried adding --header "Accept: application/json" to the cURL call, in line with the documentation for outputting CSV, but no luck so far.

Many thanks,
Peter
Peter Bowyer | 11 Dec 18:42 2014
Picon

Encrypted PDF issues & build issues

Hi list,

I'm having issues with encrypted PDFs



PDF Testcases pass, but fail on my own encrypted PDF (sample file at https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64                       1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64                 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64                       1:1.7.0.71-2.5.3.1.el6 <at> updates
java-1.7.0-openjdk-devel.x86_64                 1:1.7.0.71-2.5.3.1.el6 <at> updates


Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher
        at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
        at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
        at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
        ... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
        at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
        ... 19 more

        
        
        
I searched the pdfbox issue tracker and found https://issues.apache.org/jira/browse/PDFBOX-2469 and https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to related issues. The ticket status says a number of these issues are fixed in the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set <pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit `tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties` and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
 INFO [main] (PDFParser.java:259) - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec <<< FAILURE!
...
Results :

Tests in error:
  testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): Non-Sequential Parser failed on test file /root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf
  testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract PDF content

  
System info:
root <at> 31 [~/tika-trunk]# java -version
java version "1.7.0_71"
OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

root <at> 31 [~/tika-trunk]# mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T21:58:10+01:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family: "unix"

I tried originally with both Java 1.7 and Java 1.6. In the latest attempts I've tested only with Java 1.7.

Can anyone advise please?

Thanks,
Peter

Allison, Timothy B. | 20 Nov 15:48 2014
Picon

internal vs external property?

All,

  What is the difference between an internal and an external Property?  I’m not (quickly) seeing how Metadata is using that Boolean.  Are there other pieces of code that make use of the distinction?

  Thank you.

 

             Best,

 

                    Tim

 

Milos Kovacevic | 30 Oct 12:34 2014

Setting tesseract properties when using tika-server

Hello,
I am using tika-server-1.7-SNAPSHOT.jar which incorporates tesseract ocr
engine. I am curious how can i set different tesseract parameters such as
default language or output format (hOCR) in a separate request to tika
server?
Regards, Milos

Karol Abramczyk | 24 Oct 17:07 2014

How to add Parser to existing DefaultParser object

Hello,

I’m using Apache Tika to parse different documents formats in my application. I created custom CSV Parser that has to be configured during runtime. I tried to add this custom CSV parser by getting parsers map from the DefaultParser, adding my CSV parser to this map and setting it into DefaultParser.

```
Default Parser parser = TikaConfig.getDefaultConfig().getParser();
Map<MediaType,Parser> parsers = parser.getParsers(context);
CsvParserWrapper csvParser = new CsvParserWrapper(settingscsvHeaderscsvHh);
parsers.put(MediaType.text("csv"), csvParser);
parser.setParsers(parsers);
```
Unfortunately, setting parsers this way changes the ordering of parsers and number of parsers in the DefaultParser (duplicates I think). After this I started to receive errors, because i.e. content-type of the parsed html file (which is http://lucene.apache.org/solr/solrnews.html) is recognized as „application/xml” and parsed by DcXMLParser. 

So my qestions are:
1) How can I add a custom parser to the DefaultParser at a runtime. I can’t initialize it at the startup, because it has to be configurable.
2) How can I get Parsers list from DefaultParser where parsers are in the same order as in the DefaultParser?

Kind regards,
Karol Abramczyk
Aeham Abushwashi | 21 Oct 01:27 2014

Tika 1.6 update in Maven Central?

Hi,

We use Tika 1.6, which is pulled, along with all of its dependencies, via maven. We've hit some issues with the conversion of 7z files but I believe these issues are addressed by recent changes (r1623593).
Unfortunately, the 1.6 artifacts in the central maven repository are a couple of months old and predate the fix.

Any ideas if/when the artifacts would be updated with the latest and greatest working code?

Any suggestions for workarounds would be greatly appreciated too.

Many thanks,
Aeham
Lewis John Mcgibbney | 16 Oct 06:55 2014
Picon

[ANNOUNCEMENT] crawler-commons 0.5 is released

15th October 2014 - crawler-commons 0.5 is released

We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6.

See the CHANGES.txt file included with the release for a full list of details. Additionally the Java documentation can be found here.

We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at Maven Central.

Thank you

Lewis

On Behalf of the Crawler Commons Team


--
Lewis
Kamil Żyta | 14 Oct 12:55 2014
Picon

External parser

Hi,
I want to use external parser but on web there isn't complex howto/tutorial.
I only found parser/external/tika-external-parsers.xml sample configuration
but I don't know how to register/enable this parser in tika parsers.

I would be thankful for any help.

regards,
KŻ

imyuka | 14 Oct 08:04 2014

proceed with the limitation of character length

Hi all,

    I catch a 'more than 100000 characters' exception while processing a document, to avoid this, I can either use the abridged text or increase the maximum limit. In these cases, how can I increase the limit or retrieve only the first 100000 characters of the document without throwing an exception?

    Thanks. 


Gmane