Ron Grabowski | 24 Oct 04:30 2013

Adding Tika to existing non-Spring Apache CXF web app

If I have an existing non-Spring Apache CXF wep application up and running and I want to add support for basic meta data extraction via Tika do I just need to register Meta/TikaResource alongside my existing resources:

<!-- web.xml -->

Will there be conflicts with the CXF jars that ship with Tika if my application is stuck on CXF v2.5.x?
Samuel Desseaux | 23 Oct 14:53 2013

extract metadata of pdf files with tika


I'm a little newbie with tika and would need some help.

I have many pdf files which i would like to extract metadata, in order 
to have an xml file (which respect dublin core).

I've followed these links 

Do i have to write a program with tika to do it?

How could i do that?

Best regards


Attachment (samuel_desseaux.vcf): text/x-vcard, 360 bytes
Bratislav Stojanovic | 18 Oct 11:42 2013

Fwd: Tika OSGi bundle does not produce UTF-8 output


I've tried exactly the same code in two scenarios :

Tika tika = new Tika();
Metadata metadata = new Metadata();

Reader reader = tika.parse(new File("..."));
FileWriter fw = new FileWriter(new File("..."));

int data =;
StringBuilder sb = new StringBuilder();
while (data != -1){
char dataChar = (char) data;
data =;

When I put this code in a simple Java project with tika-app-1.4.jar as a dependency, it
generates UTF-8 output (correct).
When I put this code inside a bundle with tika-bundle and tika-core as dependencies and deploy it
inside karaf, it generates ANSI output (blah).
Both projects are managed with maven and Eclipse 4.2.

Do I have to additionaly set something or should I embed tika-app inside my bundle (using

I'm using Tika 1.4, Java 1.6.45, Win 7 x64 and karaf 2.3.3.

Bratislav Stojanovic, M.Sc.
Florin P | 10 Oct 11:15 2013

Unsubscribe address

  can you please provide me the unsubscribe address list of tika?
Mr Havercamp | 10 Oct 01:50 2013

Using TikaJAXRS with remote files


Been working with tika jaxrs and it is working great.

One thing I'm wondering; the standalone Tika app can extract remote 
files by providing a url (both in GUI and CMD mode); I'm wondering if 
the same is at all possible with TIKAJAXRS or TIka app launched in 
server mode?

The reason being I may run an indexing client on a separate server so it 
wouldn't necessarily have direct access to the file system where the 
files to be indexed reside.



Markus Jelsma | 9 Oct 11:24 2013

Script element not reported in custom handler


I'm building a new ContentHandler that needs to do some work on script elements as well. But they are not
reported in my startElement method. The context has the IdentityHtmlMapper set and script does not get
discarded in Tika's own HtmlHandler. Instead, the script element is reported in HtmlHandler but not in my
custom handler.

The confusing thing is that i am able to get it in my handler when adding the script element to TagSoup inside
HtmlParser's constructor:
        HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);

Without this, script and it's characters are only reported inside HtmlHandler, never in custom handlers.

Am must be doing something wrong here, any hints?


Tad Wimmer | 26 Sep 20:45 2013

Extracting Metadata from MS Office (2007 +) Files on Glassfish


I’m a Tika newbie, and running into an issue with Tika on Glassfish.

I'm using Tika to extract metadata from documents uploaded to a JSF 2.0 Web application using Prime Faces p:fileupload (Prime Faces 3.5) and running on Glassfish 3.2.2. Here is the essentials of my code:

private void extract(InputStream stream, Parser parser, Metadata metadata) {

    try {

        parser.parse(stream, new BodyContentHandler(), metadata, new ParseContext());

    } catch (IOException | SAXException | TikaException e) {

        LOGGER.debug("Exception parsing file for metadata: {}", e);



Passing in an AutoDetectParser with the InputStream from the fileupload works fine for MS Office OOXML (.docx, .xlsx, etc.) documents in JUnit tests, but fails when I run it in the Glassfish container with the following stack trace:

Caused by: java.lang.NoClassDefFoundError: org/dom4j/Namespace

at org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.<clinit>( ~[na:3.6]

at org.apache.poi.openxml4j.opc.OPCPackage.init( ~[na:3.6]

at org.apache.poi.openxml4j.opc.OPCPackage.<init>( ~[na:3.6]

at org.apache.poi.openxml4j.opc.Package.<init>( ~[na:3.6]

at org.apache.poi.openxml4j.opc.ZipPackage.<init>( ~[na:3.6]

at ~[na:3.6]

at org.apache.poi.extractor.ExtractorFactory.createExtractor( ~[na:3.6]

at ~[na:na]

at ~[na:na]

at org.apache.tika.parser.CompositeParser.parse( ~[na:na]

at org.apache.tika.parser.AutoDetectParser.parse( ~[na:na]

at com.spillman.fileupload.FileMetadataExtractor.extract( ~[FileMetadataExtractor.class:na]

The tika-core-0.7, tika-parsers-0.7, tika-app-0.8, poi-3.6, poi-ooxml-3.6, and dom4j-1.6.1 jars are all in the build path and marked for export. I've even gone so far as to put the jars in the Glassfish endorsed directory. Web research hasn't produced anything directly related, but I did find a few references to this exception in CF related to class loader conflicts that I wasn't able extrapolate to our Glassfish implementation (Which may be a lack of understanding on my part). What do I need to configure or change to get this to work on Glassfish?

Thanks in advance,





Toll Free 800.860.8026 ext. 1747 | Phone 801.902.1747 | Fax 801.902.1210

4625 Lake Park Blvd., Salt Lake City, UT 84120

twimmer-BxGJAMkO+kJWk0Htik3J/ |


Jukka Zitting | 26 Sep 16:40 2013

Re: Unsubscribe


To unsubscribe this list, send a message to user-unsurbscribe@...

See for more details.


Jukka Zitting

Julien Nioche | 25 Sep 09:52 2013

Nutch talk at Lucene/SOLR Revolution EU 2013


I will be giving a talk on Nutch at Lucene/SOLR Revolution in Dublin (4/7 Nov).

There should be quite a few interesting presentations as you can  see on as well as the training sessions ( 

Ping me on twitter if you will be there and want to meet for a chat.



Open Source Solutions for Text Engineering

Ryan McKinley | 16 Sep 21:32 2013

register multiple mime types to the same extension?

What approach do people suggest to handle the case different types use the same extension?

In this case (Lidar vs ASCII log file) I can see if the first character is '~' and pick the right mime type.

How is this supported in tike?  If i register two mimes with:

<glob pattern="*.las" />

the second one fails.

Is there a general approach to this problem, or does this collision need to be sorted out in a custom Detector?



Bratislav Stojanovic | 30 Aug 13:58 2013

Fwd: Parsing jnilib file throws exception


I'm trying to parse a folder with jnilib file inside, but Tika 1.4 throws exception : 
at Source)
at ca.cloudscraper.core.impl.Engine.process(
at ca.cloudscraper.core.impl.Engine.process(
at ca.cloudscraper.core.impl.Engine.process(
at ca.cloudscraper.core.impl.Engine.process(
at ca.cloudscraper.core.impl.Engine.execute(
at ca.cloudscraper.core.tests.LuceneServiceImplTest.test5(
at ca.cloudscraper.core.tests.LuceneServiceImplTest.main(
Caused by: org.apache.tika.exception.TikaException: Failed to parse a Java class
at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(
at org.apache.tika.parser.asm.ClassParser.parse(
at org.apache.tika.parser.CompositeParser.parse(
at org.apache.tika.parser.CompositeParser.parse(
at org.apache.tika.parser.AutoDetectParser.parse(
at org.apache.tika.parser.ParsingReader$
at Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.objectweb.asm.ClassReader.readClass(
at org.objectweb.asm.ClassReader.accept(
at org.objectweb.asm.ClassReader.accept(
at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(
... 6 more

Seems like Tika tries to parse this file as Java class file, but that obviously doesn't work.
I've tried to create custom-mimetypes.xml file like this :

<?xml version="1.0" encoding="UTF-8"?>

  <mime-type type="application/octet-stream">
    <_comment>Mac OSX jnilib</_comment>
    <glob pattern="*.jnilib"/>


and after I repack tika-app-1.4.jar with this file in org.apache.tika.mime folder, the problem still

Please help.

Bratislav Stojanovic, M.Sc.

Bratislav Stojanovic, M.Sc.