AarKay | 31 Jul 06:29 2014
Picon

Tika - Outlook msg file with another Outlook msg as an attachment - OutlookExtractor passes empty stream

I am using Tika Server (TikaJaxRs) for text extraction needs.
I also have a need to extract the attachments in the file and save it to the 
disk in its native format.
I was able to do it by having CustomParser and write the file to disk using 
'stream' in parse method.

Here is the post I used as a reference for building CustomParser.
http://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-
files-using-apache-tika

I was able to get it work fine if the attachment is anything but Outlook msg 
file.

I am running into an issue when the attachment is a Outlook msg file.
When CustomParser.parse method gets invoked the stream passed to it is empty 
because of which the file thats being written to disk is always 0 KB.

Digging through the code I noticed that in OutlookExtractor.java class the 
attachment is handled by OfficeParser because msg.attachdata is always null 
when attachment is a Outlook msg and thats where it is always sending empty 
stream to CustomParser.

Here is the snippet of code from OutlookExtractor where it iterates through 
Attachment files and uses handleEmbeddedResource method only when 
msg.attachData is not null.
But msg.attachData is always null if the Attachment is of type Outlook msg 
because of which stream is always empty when delegating the request to 
CustomParser.parse method.

Can someone please tell me how can i access the msg attachment and save it 
(Continue reading)

Mattmann, Chris A (3980 | 28 Jul 06:22 2014
Picon
Picon

[VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/

The release candidate is a zip archive of the sources in:

    http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/

Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

    [ ] +1 Release this package as Apache Tika 1.6
    [ ] -1 Do not release this package because҆

Thank you!

Cheers,
Chris

P.S. Here is my +1!
(Continue reading)

Avi Hayun | 23 Jul 14:50 2014
Picon

How to identify a language of text

Hi,

I saw that Tika can identify language of a given text by using the following:
http://tika.apache.org/1.4/api/org/apache/tika/language/LanguageIdentifier.html


How many languages does Tika support?
Where can I find more information about it ?
Shannon Brown | 17 Jul 20:23 2014

Avoiding Out of Memory Errors

Tika 1.5, Java 7, Desktop Java application using Tika for content
parsing (to feed into a machine learning application)

Problem:
How to avoid Out of Memory errors during Tika parsing.
One user reported issues with Out of Memory errors (Java heap) with
files when attempting to pre-process documents.
The application essentially parses each file (of varying sizes and
content-type) using Tika and then generates a data dictionary entry for
each document.
The parsing uses a fairly simple Tika method.
StringBuffer bodytext =3D new StringBuffer() ;
StringBuffer metatext =3D new StringBuffer() ;
int writelimit =3D -1 ;
BodyContentHandler content =3D new BodyContentHandler(writelimit);
{Parser).parse(stream, content, metadata, context);

The body content is placed into bodytext and metadata into metatext.
This is returned to the dataset dictionary routine.

What appears to be happening is some files are throwing
OutOfMemoryErrors during Tika parsing. This fails the entire system and
results in an unresponsive application.

I understand that new BodyContentHandler(writelimit) can limit the
number of chars via the writelimit but simply cutting off the output at
a fixed amount will not reasonably work for this application--as content
might be missed.

Has anyone else needed to handle this issue in a simple desktop, Java
application? I thought of various fixes including using
Runtime.maxMemory, freeMemory, etc. to indirectly detect low memory
situations before parsing but that has not worked well. Also a Java
OutOfMemoryError essentially freezes the system and limits any recovery
ability so this is not nice for the users. I also thought of perhaps
some type of object-to-disk caching but the one implementation that I
saw was for J2EE applications and I am not sure how it could be
integrated into Tika. Ia also though of processing files in chunks but
the BodyContentHandler does not seem to handle chunking (with offsets)
right now. NOTE: I have already tweaked the Java heap at runtime via
-Xmx (max heap) and -Xms (initial heap) but some files exceed the
physical RAM in the system.

Any ideas?

Shannon

--

-- 
-----------------------------------------------------------------------8
Shannon Brown
sbrown@...

"[Courage is] when you know you're licked
before you begin but you begin anyway and
you see it through no matter what. You
rarely win, but sometimes you do."

Atticus Finch in
To Kill a Mockingbird by Harper Lee

Clemens Wyss DEV | 7 Jul 08:17 2014
Picon

Determine binary pdf?

What, if at all possible, is the preferred way to determine if a document (namely a pdf) is of "binary nature"?

I am extracting text of many pdf user manuals for lucene indexing and some of them deliver "absurd binary
terms", which I would like 
to omit

Thx
Clemens
Sergey Beryozkin | 2 Jul 14:27 2014
Picon

How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

Allison, Timothy B. | 1 Jul 16:45 2014
Picon

RE: Stack Overflow Question

Good to hear.  Let us know if you have any other questions or when you run into surprises.

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i forgot to change the BodyContentHandler to ToXMLContentHandler in RecursiveMetada, i changed it only in my 

calling method,

 

now i am getting the entire document as the structure u specified.

 

thanks a ton.

 

-yeshwanth

 

On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Hmmm….

 

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:

 

<div class="embedded" id="embed4.zip" />

<div class="package-entry"><h1>embed4.zip</h1>

<div class="embedded" id="embed4.txt" />

<div class="package-entry"><h1>embed4.txt</h1>

<p>embed_4</p>

</div>

</div>

</div>

</div>

 

That’s a text file inside of a zip file that is itself embedded.  I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.

 

If I committed TIKA-1329, would that be of any use to you?   That returns a list of metadata objects.  There is one metadata object per embedded file.  The text content of each file can be retrieved from each metadata object by this key: “tika:content.”

 

Best,

 

        Tim

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

output is same even with ToXMLHandler

 

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

 

 

Allison, Timothy B. | 1 Jul 15:46 2014
Picon

RE: Stack Overflow Question

Hmmm….

 

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:

 

<div class="embedded" id="embed4.zip" />

<div class="package-entry"><h1>embed4.zip</h1>

<div class="embedded" id="embed4.txt" />

<div class="package-entry"><h1>embed4.txt</h1>

<p>embed_4</p>

</div>

</div>

</div>

</div>

 

That’s a text file inside of a zip file that is itself embedded.  I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.

 

If I committed TIKA-1329, would that be of any use to you?   That returns a list of metadata objects.  There is one metadata object per embedded file.  The text content of each file can be retrieved from each metadata object by this key: “tika:content.”

 

Best,

 

        Tim

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

output is same even with ToXMLHandler

 

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

 

Allison, Timothy B. | 1 Jul 14:29 2014
Picon

RE: Stack Overflow Question

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

Allison, Timothy B. | 30 Jun 21:54 2014
Picon

RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

PRANEESH KUMAR | 30 Jun 09:17 2014
Picon

Getting IOException: Resetting to invalid mark while reseting the stream

Using Tika 1.5 getting java.io.IOException: Resetting to invalid mark while reseting the stream passed.

IOException occurs mostly for parsing pdf, zip formats.

Code snipped that I have used is 


try {

// I have set the stream as BufferedInputStream of some sample.pdf

stream.mark(Integer.MAX_VALUE);
Tika t = new Tika();

String content = t.parseToString(stream);
} finally {
if(stream!=null ) {
stream.reset();
}
}


Does anybody experience this case, whether this is a bug or behaviour. 


Thanks,
Praneesh

Gmane