Shannon Brown | 17 Jul 20:23 2014

Avoiding Out of Memory Errors

Tika 1.5, Java 7, Desktop Java application using Tika for content
parsing (to feed into a machine learning application)

Problem:
How to avoid Out of Memory errors during Tika parsing.
One user reported issues with Out of Memory errors (Java heap) with
files when attempting to pre-process documents.
The application essentially parses each file (of varying sizes and
content-type) using Tika and then generates a data dictionary entry for
each document.
The parsing uses a fairly simple Tika method.
StringBuffer bodytext =3D new StringBuffer() ;
StringBuffer metatext =3D new StringBuffer() ;
int writelimit =3D -1 ;
BodyContentHandler content =3D new BodyContentHandler(writelimit);
{Parser).parse(stream, content, metadata, context);

The body content is placed into bodytext and metadata into metatext.
This is returned to the dataset dictionary routine.

What appears to be happening is some files are throwing
OutOfMemoryErrors during Tika parsing. This fails the entire system and
results in an unresponsive application.

I understand that new BodyContentHandler(writelimit) can limit the
number of chars via the writelimit but simply cutting off the output at
a fixed amount will not reasonably work for this application--as content
might be missed.

Has anyone else needed to handle this issue in a simple desktop, Java
(Continue reading)

Clemens Wyss DEV | 7 Jul 08:17 2014
Picon

Determine binary pdf?

What, if at all possible, is the preferred way to determine if a document (namely a pdf) is of "binary nature"?

I am extracting text of many pdf user manuals for lucene indexing and some of them deliver "absurd binary
terms", which I would like 
to omit

Thx
Clemens
Sergey Beryozkin | 2 Jul 14:27 2014
Picon

How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

Allison, Timothy B. | 1 Jul 16:45 2014
Picon

RE: Stack Overflow Question

Good to hear.  Let us know if you have any other questions or when you run into surprises.

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i forgot to change the BodyContentHandler to ToXMLContentHandler in RecursiveMetada, i changed it only in my 

calling method,

 

now i am getting the entire document as the structure u specified.

 

thanks a ton.

 

-yeshwanth

 

On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Hmmm….

 

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:

 

<div class="embedded" id="embed4.zip" />

<div class="package-entry"><h1>embed4.zip</h1>

<div class="embedded" id="embed4.txt" />

<div class="package-entry"><h1>embed4.txt</h1>

<p>embed_4</p>

</div>

</div>

</div>

</div>

 

That’s a text file inside of a zip file that is itself embedded.  I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.

 

If I committed TIKA-1329, would that be of any use to you?   That returns a list of metadata objects.  There is one metadata object per embedded file.  The text content of each file can be retrieved from each metadata object by this key: “tika:content.”

 

Best,

 

        Tim

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

output is same even with ToXMLHandler

 

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

 

 

Allison, Timothy B. | 1 Jul 15:46 2014
Picon

RE: Stack Overflow Question

Hmmm….

 

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:

 

<div class="embedded" id="embed4.zip" />

<div class="package-entry"><h1>embed4.zip</h1>

<div class="embedded" id="embed4.txt" />

<div class="package-entry"><h1>embed4.txt</h1>

<p>embed_4</p>

</div>

</div>

</div>

</div>

 

That’s a text file inside of a zip file that is itself embedded.  I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.

 

If I committed TIKA-1329, would that be of any use to you?   That returns a list of metadata objects.  There is one metadata object per embedded file.  The text content of each file can be retrieved from each metadata object by this key: “tika:content.”

 

Best,

 

        Tim

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

output is same even with ToXMLHandler

 

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

 

Allison, Timothy B. | 1 Jul 14:29 2014
Picon

RE: Stack Overflow Question

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

Allison, Timothy B. | 30 Jun 21:54 2014
Picon

RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

PRANEESH KUMAR | 30 Jun 09:17 2014
Picon

Getting IOException: Resetting to invalid mark while reseting the stream

Using Tika 1.5 getting java.io.IOException: Resetting to invalid mark while reseting the stream passed.

IOException occurs mostly for parsing pdf, zip formats.

Code snipped that I have used is 


try {

// I have set the stream as BufferedInputStream of some sample.pdf

stream.mark(Integer.MAX_VALUE);
Tika t = new Tika();

String content = t.parseToString(stream);
} finally {
if(stream!=null ) {
stream.reset();
}
}


Does anybody experience this case, whether this is a bug or behaviour. 


Thanks,
Praneesh
Daniel Gibby | 27 Jun 19:08 2014

IOException should be TikaException?

Using the latest 1.5 Tika release (not snapshot), I get an IOException 
when a PDF doesn't have certain headers.

java.io.IOException: Error: Header doesn't contain versioninfo
     at 
org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:335)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:177)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
...

Since I'm dealing in my code with file uploads, IOExceptions can easily 
happen in other ways.
Shouldn't this be a TikaException of some type, or at least something 
other than just an IOException?

--

-- 

Thanks,

Daniel Gibby

Krüger, Sven | 25 Jun 15:22 2014
Picon

getLanguage returns "lt" if pdf-file contains only images

Hello,

 

if a pdf-file only contains graphics without extractable text, getLanguage returns "lt".

 

Currently I can filter that because the length of the extracted content is 2 * metadata.get("xmpTPg:NPages") - but I don‘t think this is supposed to work that way.

 

Is there any way to get a value that indicates the probability of  the detected language or another way to get a proper (in this case no) language?

Regards Sven

 

george | 17 Jun 10:03 2014
Picon

Detecting html file which is urf-16 encoded

I want to be able to detect when a file is html even when it is utf-16 encoded. I can see from the default tika-mimetypes.xml that normally files with a BOM will be detected as text/plain, which is the case.  I have tried creating my own versions of the html and text mime types in a custom-mimetypes.xml and these successfully overwrite the original ones but changing the priority of these does not force the utf-16 files to be identified as html. Even removing the BOM matches completely from the text mimetype in the custom-mimetypes.xml does not work. 

So I tried another approach by removing the BOM from the inputstream before detecting. However the utf-16 file is still not recognised as html, despite the tect having multiple matches. It seems that the detect method does not realise what encoding is being used for the file. Is there a way to tell a detector what encoding a file is in to aid detection?

Thanks

George


Gmane