Sergey Beryozkin | 2 Jul 14:27 2014
Picon

How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

Allison, Timothy B. | 1 Jul 16:45 2014
Picon

RE: Stack Overflow Question

Good to hear.  Let us know if you have any other questions or when you run into surprises.

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i forgot to change the BodyContentHandler to ToXMLContentHandler in RecursiveMetada, i changed it only in my 

calling method,

 

now i am getting the entire document as the structure u specified.

 

thanks a ton.

 

-yeshwanth

 

On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Hmmm….

 

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:

 

<div class="embedded" id="embed4.zip" />

<div class="package-entry"><h1>embed4.zip</h1>

<div class="embedded" id="embed4.txt" />

<div class="package-entry"><h1>embed4.txt</h1>

<p>embed_4</p>

</div>

</div>

</div>

</div>

 

That’s a text file inside of a zip file that is itself embedded.  I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.

 

If I committed TIKA-1329, would that be of any use to you?   That returns a list of metadata objects.  There is one metadata object per embedded file.  The text content of each file can be retrieved from each metadata object by this key: “tika:content.”

 

Best,

 

        Tim

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

output is same even with ToXMLHandler

 

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

 

 

Allison, Timothy B. | 1 Jul 15:46 2014
Picon

RE: Stack Overflow Question

Hmmm….

 

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this:

 

<div class="embedded" id="embed4.zip" />

<div class="package-entry"><h1>embed4.zip</h1>

<div class="embedded" id="embed4.txt" />

<div class="package-entry"><h1>embed4.txt</h1>

<p>embed_4</p>

</div>

</div>

</div>

</div>

 

That’s a text file inside of a zip file that is itself embedded.  I could see doing some parsing on the XML to scrape out <div class=”package-entry”> contents and grab the file name from the <h1> element.

 

If I committed TIKA-1329, would that be of any use to you?   That returns a list of metadata objects.  There is one metadata object per embedded file.  The text content of each file can be retrieved from each metadata object by this key: “tika:content.”

 

Best,

 

        Tim

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

output is same even with ToXMLHandler

 

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM


To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

 

Allison, Timothy B. | 1 Jul 14:29 2014
Picon

RE: Stack Overflow Question

Did you try the ToXMLHandler?

 

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

i tried in all possible ways,

instead of reading entire zip file i parsed individual zipentries,

but even then i faced exceptions such as

 

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 37ba3e33

Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

 

org.apache.tika.exception.TikaException: Unable to unpack document stream

 

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser <at> 6f0ee75a

 

org.apache.tika.exception.TikaException: Error creating OOXML extractor

 

 

any suggestions regarding these issues,

 

thanks,

yeshwanth

 

 

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwanth43 <at> gmail.com> wrote:

 

hi tim,

 

thanks, for sharing the resources but i am unable to figure out how to implement it in my code,

what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser 

it still giving the same kind of output as filenames combined with content of the files,

 

i am totally confused.

 

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

Or use the ToXMLHandler and parse the XML?

 

From: Allison, Timothy B. [mailto:tallison <at> mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user <at> tika.apache.org
Subject: RE: Stack Overflow Question

 

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

 

 

 

Allison, Timothy B. | 30 Jun 21:54 2014
Picon

RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser

http://wiki.apache.org/tika/RecursiveMetadata

 

Or

 

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

 

hi tim,

 

thanks for quick reply,

 

i changed the contenthandler to bodyContentHandler i got exception for maximum word limit,

i used -1 in the bodycontenthandler constructor,

 

now its another problem, filenames and content are present in string returned from handler.tostring()

 

how can i map a fileName to its content.

 

thanks,

yeshwanth

 

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <tallison <at> mitre.org> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

 

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

 

 

If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler.

 

QUOTE:

0down votefavorite

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

    Parser parser = new AutoDetectParser();

    ParseContext context = new ParseContext();

    context.set(Parser.class, parser);

    ContentHandler handler = new DefaultHandler();

    Metadata metadata = new Metadata();

    InputStream stream = null;

    try {

        stream = TikaInputStream.get(new File(zipFilePath));

    } catch (FileNotFoundException e) {

        e.printStackTrace();

    }

    try {

 

        parser.parse(stream, handler, metadata, context);

 

        logger.info("Content:\t" + handler.toString());

    } catch (IOException e) {

        e.printStackTrace();

    } catch (SAXException e) {

        e.printStackTrace();

    } catch (TikaException e) {

        e.printStackTrace();

    } finally {

        try {

            stream.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler <at> 5bd8e367

i am missing something, unable to figure it out, looking for some help

 

-----Original Message-----

From: yeshwanth kumar [mailto:yeshwanth43 <at> gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: dev <at> tika.apache.org

Subject: Stack Overflow Question

 

Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

 

PRANEESH KUMAR | 30 Jun 09:17 2014
Picon

Getting IOException: Resetting to invalid mark while reseting the stream

Using Tika 1.5 getting java.io.IOException: Resetting to invalid mark while reseting the stream passed.

IOException occurs mostly for parsing pdf, zip formats.

Code snipped that I have used is 


try {

// I have set the stream as BufferedInputStream of some sample.pdf

stream.mark(Integer.MAX_VALUE);
Tika t = new Tika();

String content = t.parseToString(stream);
} finally {
if(stream!=null ) {
stream.reset();
}
}


Does anybody experience this case, whether this is a bug or behaviour. 


Thanks,
Praneesh
Daniel Gibby | 27 Jun 19:08 2014

IOException should be TikaException?

Using the latest 1.5 Tika release (not snapshot), I get an IOException 
when a PDF doesn't have certain headers.

java.io.IOException: Error: Header doesn't contain versioninfo
     at 
org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:335)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:177)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
...

Since I'm dealing in my code with file uploads, IOExceptions can easily 
happen in other ways.
Shouldn't this be a TikaException of some type, or at least something 
other than just an IOException?

--

-- 

Thanks,

Daniel Gibby

Krüger, Sven | 25 Jun 15:22 2014
Picon

getLanguage returns "lt" if pdf-file contains only images

Hello,

 

if a pdf-file only contains graphics without extractable text, getLanguage returns "lt".

 

Currently I can filter that because the length of the extracted content is 2 * metadata.get("xmpTPg:NPages") - but I don‘t think this is supposed to work that way.

 

Is there any way to get a value that indicates the probability of  the detected language or another way to get a proper (in this case no) language?

Regards Sven

 

george | 17 Jun 10:03 2014
Picon

Detecting html file which is urf-16 encoded

I want to be able to detect when a file is html even when it is utf-16 encoded. I can see from the default tika-mimetypes.xml that normally files with a BOM will be detected as text/plain, which is the case.  I have tried creating my own versions of the html and text mime types in a custom-mimetypes.xml and these successfully overwrite the original ones but changing the priority of these does not force the utf-16 files to be identified as html. Even removing the BOM matches completely from the text mimetype in the custom-mimetypes.xml does not work. 

So I tried another approach by removing the BOM from the inputstream before detecting. However the utf-16 file is still not recognised as html, despite the tect having multiple matches. It seems that the detect method does not realise what encoding is being used for the file. Is there a way to tell a detector what encoding a file is in to aid detection?

Thanks

George

UTF-16 encoded HTML files detected as plain/text

I can successfully detect valid html files in other encodings but when a valid file is encoded as UTF-16 it is identified as plain/text.  I can see that in tika-mimetypes.xml the UTF_16 BOMs are used to identify files as text/plain with a priority of 20 and *.html identification is set to a priority of 40. I’m not sure why this is the case.

 

I see the advice here is not to alter  tika-mimetypes.xml (and indeed that would be a pain to maintain) and suggests that custom-mimetypes.xml should be used for new file types. However, I want to overwrite the definition for the existing text/plain type to reduce the priority or remove the UTF-16 magic signs so my valid UTF-16 html files are correctly identified.

 

Is this possible or is there a better way to achieve my aim of correctly identifying my UTF-16 html files as I can with those in other encodings?

 

George

 

 

 

 



 

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
David Meikle | 16 Jun 06:44 2014
Picon

ApacheCon CFP closes June 25

Dear Apache Tika Enthusiast,

As you may be aware, ApacheCon will be held this year in Budapest, on
November 17-23. (See http://apachecon.eu for more info.)

The Call For Papers for that conference is still open, but will be
closing soon. We need you talk proposals, to represent Apache Tika at
ApacheCon. We need all kinds of talks - deep technical talks, hands-on
tutorials, introductions for beginners, or case studies about the
awesome stuff you're doing with Apache Tika.

Please consider submitting a proposal, at
http://events.linuxfoundation.org//events/apachecon-europe/program/cfp

Thanks!

Gmane