Lewis John Mcgibbney | 16 Oct 06:55 2014

[ANNOUNCEMENT] crawler-commons 0.5 is released

15th October 2014 - crawler-commons 0.5 is released

We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6.

See the CHANGES.txt file included with the release for a full list of details. Additionally the Java documentation can be found here.

We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at Maven Central.

Thank you


On Behalf of the Crawler Commons Team

Kamil Żyta | 14 Oct 12:55 2014

External parser

I want to use external parser but on web there isn't complex howto/tutorial.
I only found parser/external/tika-external-parsers.xml sample configuration
but I don't know how to register/enable this parser in tika parsers.

I would be thankful for any help.


imyuka | 14 Oct 08:04 2014

proceed with the limitation of character length

Hi all,

    I catch a 'more than 100000 characters' exception while processing a document, to avoid this, I can either use the abridged text or increase the maximum limit. In these cases, how can I increase the limit or retrieve only the first 100000 characters of the document without throwing an exception?


Lewis John Mcgibbney | 12 Oct 02:30 2014

Problematic PDF

Hi Folks,
I have a problematic PDF which I keeps on crashing my Nutch crawl.
I am trying to get all data from the PDF, so content is not truncated at all.
Can someone please try to see if they have any issues parsing this document with Tika 1.6?
I have tried it locally, and it seems OK. If I can confirm this with some other folks then I can isolate this to my Nutch crawl.
Thank you

imyuka | 9 Oct 14:22 2014

Formatted Content Extraction and Title Detection

Hi all,

    Here is my problem: I have extracted plain texts from a serious of doc(x) documents and their titles via the "dc:title" label of metadata, but I'm not sure this is the right way to attain a title of a document. In many cases, a title inside a document could be of the largest font-size and bold-style, which I want to utilized to extract the very title, however, I have no idea how to get a formatted content and font-size/bold-style detection. please let me know if I miss something.
    Thank you very much!
Can Duruk | 9 Oct 02:59 2014

Customizing Metadata Keys

Hi all,

My question is regarding setting the metadata keys coming from the parsers to my own keys.

For my application, I am using Tika to extract the metadata for a bunch of files. I am using the embedded HTTP server which I modified for my needs to return instead of CSV. (Hoping to submit that as a patch soon)

However, the keys in the JSON are all in different formats and I need them to conform to my own requirements.

So for example in this redacted example this is what I get:

  "meta:author": "Maxim Valyanskiy",

However, what I need is this:

  "my_author_key": "Maxim Valyanskiy",

I have a bunch (several dozens) of these modifications I need to make on the metadata keys in various places.

What is the best way to approach this problem? I've thought about extending each of the parsers to but that seems a bit too decentralized. Ideally it'd be something I can manage in a single file.

Thanks a lot in advance.
Harsh Singh | 29 Sep 04:14 2014

Tika - XHTML to Json

Hi All,

I am trying to build a XHTML to Json parser in TIka. After my research, I decided to got with ToTextContextHandler to parse the XHTML data and convert it to Json. For this I am mostly overriding the methods of  ToTextContextHandler to create the custom Json files. 
So I was wondering if this approach is appropriate or should I look in more generic approach like using ContextHandler? or should I try some other ContextHandlers?
Any suggestion or comments are highly appreciated.

Best Regards,
Harsh Singh

Vineet Ghatge Hemantkumar | 26 Sep 05:05 2014

Apache Tika - JSON?

Hello all, 

I was wondering if there any in built parser to get help in conversion from XHTML to JSON. 

My research showed that there is one named org.apache.io.json which just one method implemented. Also, I tried GJSON library to do this, but it does not seem to work with Tika. Any suggestions will be appreciated?

Nick Burch | 23 Sep 00:21 2014

Tika at ApacheCon Europe - 2 months time!

Hi All

It's only 2 months to go until ApacheCon Europe in Budapest. I'm 
simultaneously exciting by all the great Tika stuff going on, and worried 
by how many talks I need to finish writing...

As usual for an ApacheCon, we've a number of talks about Tika going on, 
and almost certainly a hackathon and/or meetup one evening. There's also 
lots of related talks too, covering technologies that Tika builds on, and 
ones you can use Tika with. For a full schedule, see:

If you're based in Europe, and involved in Tika, we'd love to see you 
there in November!

For those who can't make it to Budapest, we hope to have a similar level 
of talks at ApacheCon US 2015 in Austin, Texas in April, so save the date! 


Mugat Gurkowsky | 13 Sep 15:05 2014

very large xml-file parsing


i am trying to use tika in combination with lucene to parse and index of very large xml-files. so far, without success, because of memory limitations. tika's BodyContentHandler seems to try to copy the whole content in memory, which doesn't work as files are several giga-bytes large.

is there a way of getting around this problem? can i use any other handler which can deal with streams?

thanks in advance
Devaraja Swami | 9 Sep 04:12 2014

HTML parsing error with <a> tag inside <h1> tag

In the following HTML document, the <a> is inside the <h1> tag which is inside the <p> tag:
<!DOCTYPE html>
<div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
But when I parse it with Tika 1.5 HtmlParser, 
it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.

The same error happens when I replace the <h1> tag with other header tags <h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
[Haven't experimented with other replacements].

This seems to be a basic issue.
Any help would be deeply appreciated.