Ben Gould | 30 Jul 20:34 2015

Charset Encoding

Hi all,

I'm working on dynamically parsing a large set of Farsi documents 
(mostly txt, pdf, doc and docx), and am having issues when I come across 
text files encoded in CP1256 (an old windows-arabic format).

I'm using the Tika facade to return a Reader implementation (wrapping 
the input in a TikaInputStream) and then tokenizing the Reader using a 
Lucene Analyzer.  However, whenever it hits CP1256 encoded text files, 
it tries to decode them as (Content-Type -> text/plain; 
charset=x-MacCyrillic).  In the input metadata, I do provide the 
following properties:

Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256

Any ideas on how I can force the TXTParser to use CP1256?


Stephan M├╝hlstrasser | 30 Jul 12:10 2015

Overriding built-in parser for TikaCLI with Tika 1.9


I have an existing example how to override a built-in parser with the 
ServiceRegistry mechanism. The file 
"META-INF/services/org.apache.tika.parser.Parser" lists my desired 
parser class, and the directory containing the META-INF directory is 
added to the class path.

With Tika 1.0 this worked fine for both TikaCLI and direct use of the 
Tika API.

Now I tested with Tika 1.9, and the TikaCLI class seems to ignore the 
ServiceRegistry mechanism. The only way I could get TikaCLI to use my 
externally supplied parser was by creating a Tika XML configuration 
file, and by specifying that one with the "--config" option.

Is that intended behavior for TikaCLI now? It looks like the "--config" 
options is fairly new.

And is there any documentation on the syntax of the Tika XML 
configuration file? I was able to find some examples of configuration 
files that I used as blueprints, but I could not find a description of 
the XML syntax on the Tika website.


Christian Wolfe | 23 Jul 04:11 2015

TesseractOCRParser on Linux

Hi folks,

It looks to me that TesseractOCRParser doesn't work on Linux unless the Tesseract executable and the 'tessdata' folder are in the same location on the filesystem. This makes sense in a Windows environment (where everything is installed together by default), but in linux, package managers (*and* source code installations) tend to split the files up across the filesystem.

I believe this could be alleviated by creating a second property in TesseractOCRConfig that points to the 'tessdata' folder separately from the Tesseract executable. That, or a bit of documentation that clarifies that the files need to be together.

I would be more than willing to work on either solution, but only if the team considered it worthwhile.

Anyway, thanks for making a great library, and for taking time to read this.
David Meikle | 16 Jul 19:09 2015

Travel Assistance for ACEU closes tomorrow!

Hi Folks,

A little reminder from the Travel Assistance Committee on the ApacheCon EU Travel Assistance Deadline.

Hope to see some fellow Tika community members in Budapest.



HI All,

This is a reminder that currently applications are open for Travel Assistance to go to ApacheCon EU Budapest 
this coming September/October.

Applications close tomorrow night so if you have not applied yet and intend to do so, please act now!

For those that have submitted talks for this event and have not heard back as to whether or not they will be 
accepted or not; and you intend to apply for assistance based on getting your talks accepted — please DO 
apply for assistance now anyway, should your talk not be accepted, your assistance application can be 

See for more info. 

Thanks and hope to see you all in Budapest!

Gav… (On behalf of the Travel Assistance Committee)
Allison, Timothy B. | 15 Jul 13:38 2015

robust Tika and Hadoop



  I’d like to fill out our Wiki a bit more on using Tika robustly within Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven’t looked carefully into these packages yet.


  Does anyone have any recommendations for specific configurations/design patterns that will defend against oom and permanent hangs within Hadoop?


  Thank you!











Nazar Hussain | 15 Jul 08:37 2015

Per Page Document Content


I am using Apache Tika 1.9 and content extraction working awesome.

The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document. 

I had searched a lot and tried some solutions suggested by users, but did not work for me, may be due to latest Tika version. 

Please suggest any solution or further research direction for this. 

I will be thankful. 

Nazar Hussain

Inconsistent (buggy) behavior when using tika-server

Hi Folks,
I am using Tika trunk (1.10-SNAPSHOT) and posting documents there. An example would be the following:

curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  http://localhost:9998/meta --header "Accept: application/json”

curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  http://localhost:9998/meta --header "Accept: application/rdf+xml”

curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  http://localhost:9998/meta --header "Accept: text/csv”

I am using a python script to iterate through all the files in a folder. It works for about 50% to 80% of the files. For the rest it gives an error 500. When I post a file individually for which it previously failed (using the python script) it sometimes works. When done in an ad hoc manner, it works most of the time but fails sometimes. At times it is successful for application/rdf+xml format but fails for application/json format. The behavior is inconsistent.

Here is an example trace of when it does not work as expected [0]

A sample of the data being used can be found here [1]

Any help would be appreciated. 




Namrata Malarout

James Baker | 13 Jul 10:39 2015

Licensing of Tika


Apache Tika is licensed under the ASL2 license, but a number of it's dependencies aren't - for example Java UnRar is licensed under the UnRar license.

Can someone explain to me how this works? If I am looking at releasing my own software that is dependent on Tika, can I release it under ASL2 or do I also need to take into account the licenses of the sub-dependencies?

Gabriele Lanaro | 10 Jul 17:36 2015

Configuring Logging

Hi, I would like to know if it is possible to configure logging in tika, for example I'd like to log to a file instead of standard output. Redirection is not an option because I'm using tika as a library for another application, and I'd like to log other parts of the application. 

Is there an easy way to configure tika logging globally?

Andrea Asta | 6 Jul 12:11 2015

Extract PDF inline images

I'm trying to store the inline images from a PDF to a local folder, but can't find any valid example. I can only use the RecursiveParserWrapper to get all the available metadata, but not the binary image content.

This is my code:

RecursiveParserWrapper parser = new RecursiveParserWrapper(
      new AutoDetectParser(),
      new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
PDFParser p;
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
context.set(org.apache.tika.parser.Parser.class, parser);

InputStream is = PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF");
//parsing the file
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);

How can I store each image file to a folder?

Stefan Alder | 29 Jun 23:17 2015

systemd script?

Is there a systemd script/config recommended for tika-server?  I'm planning to run on Centos 7.  Thanks!