Brian Young | 25 Mar 22:07 2016
Picon

file-related metadata

Hello,

I'm having an issue where I'm getting back two or three metadata properties that are related to a temp file that tika is apparently creating under the hood:

File Modified Date (the current date)
File Name (temp file name: apache-tika-3021300783416279997.tmp)
File Size 

I assume this is because I only have a stream to give Tika and no longer have a physical file.  However the users are seeing these (particularly the modified date) and misinterpreting it.

I'd like to exclude these, which I could of course do by just a string-based filter.  However that feels a little hackish... I was hoping there may be some way to deactivate file metadata if Tika is the one that created the temp file?  I tried to find the spot in Tika where these are being added by greping all the source but I seem to have come up empty for some reason.

Thanks for any pointers,
Brian



Thamme Gowda N. | 23 Mar 22:53 2016
Picon
Gravatar

Fwd: How to enable multiple parsers for content type ?

Hi Tika experts,

Question : How to enable multiple parsers for specific mimetypes?

I am using tika to parse html pages. 

My requirement is that both NamedEntityParser and HtmlParser has to be enabled for specific web related MIME types like text/html, application/xhtml+xml.

From my findings on tika wiki, this should be possible with CompositeParser but I am not getting it right. Only the last parser registered for the mime type seems to be working.

My configuration is given below. 

<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
</parser>

<parser class="org.apache.tika.parser.ner.NamedEntityParser">
<mime>text/plain</mime>
<mime>text/html</mime>
<mime>text/x-php</mime>
<mime>text/x-jsp</mime>
<mime>application/atom+xml</mime>
<mime>application/xhtml+xml</mime>
<mime>application/xml</mime>
<mime>application/rss+xml</mime>
<mime>application/pdf</mime>
<mime>application/atom+xml</mime>
<mime>application/msword</mime>
<mime>text/asp</mime>
</parser>

<parser class="org.apache.tika.parser.html.HtmlParser">
<mime>text/html</mime>
<mime>text/x-php</mime>
<mime>text/x-jsp</mime>
<mime>application/atom+xml</mime>
<mime>application/xhtml+xml</mime>
<mime>application/xml</mime>
<mime>application/rss+xml</mime>
<mime>application/atom+xml</mime>
<mime>text/asp</mime>
</parser>
</parsers>
</properties>


-
Thanks in advance
Thamme.

--
Thamme Gowda N. 
Grad Student at usc.edu 

John Patrick | 3 Mar 00:13 2016
Picon
Gravatar

Logging

Tika appears to use two logging frameworks, Commons Logging and SLF4J.

Is that correct?

Commons Logging is used by;
tika-app
tika-parsers
tika-server

SLF4J is used by;
tika-batch
tika-core
tika-parsers
tika-translate

If I do a patch which way should I refactor? My personal preference is to use SLF4J.

John

raghu vittal | 1 Mar 07:40 2016

Unable to extract content from 250 MB document giving me TikaException: "Zip bomb detected!"

Hi All,


I am using Tika server REST api to extract the content from large files.


I was able to extract the content  up to 100 MB files. when i try to send files more than 100 MB giving me "Zip bomb detected" Tika exception.


I tried three ways of sending the file to tika


1. multipart/form-data http://localhost:9998/tika/form

2. zip stream http://localhost:9998/unpack/all

3. tika string http://localhost:9998/tika


with all these three options i am getting the same exception.


i am working on .NET project and using HTTPClient to send request to Tika server.

Even i tried setting large timeouts for HTTPClient but no luck. any help would be much appreciated.



Regards,

Raghu. 

Luke Noel-Storr | 25 Feb 16:35 2016
Picon
Gravatar

ForkParser

Hi,

I am trying to use the ForkParser, but am getting an exception:

org.apache.tika.exception.TikaException: Unable to serialize ParseContext to pass to the Forked Parser

Caused by:

Caused by: java.io.NotSerializableException: java.net.URLClassLoader


I"m constructing the parser like this:

return new Tika(
    TikaConfig.getDefaultConfig().getDetector(),
     new ForkParser()
);

and using it like this:

final Metadata metadata = new Metadata();

final String content
    = tika.parseToString(newInputStream(file), metadata);


I'm using Java 8, on OSX 10.11, and my application (in dev mode) is running inside Jetty inside a sbt console session.

Any pointers as to what I am doing wrong, or how I can get it working?


Many Thanks,

Luke Noel-Storr.
----------------

Principal Software Engineer <at>  integrate
Tel: +44 (0)1926 889199
http://www.integrate.co.uk
John Patrick | 23 Feb 18:46 2016
Picon
Gravatar

Jackson & Fat tika-server jar question

hiya,

I'm working with an existing code base that is using Jackson 2.6.3. Now adding tika but because the tika-server jar containers Jackson 2.4.0 having lots of compile issues.

1) Was it intentional to have a bloated/fat tika-server jar containing all dependencies?

2) Can tika be upgraded to use Jackson 2.6.3 or newer?

3) Can tika-server but corrected so it's not bloated with dependencies, or create tika-server with just org.apache.tika.server and tika-server-all which is the bloating version with dependencies?

Cheers,
John

Picon

font metrics issue

Hi all,
I'm extracting some text from pdf. As result, some important words end with spaces between characters. For example, I could have the word "Subtitle" that I want to detect, written like "S u b t i t l e". If I would parse the text with a standard tokenizer, the word will be lost.
I think (after consultation in Solr list) that this might be related to fonts.
Is there any way to cope with this through Tika configuration?
Many Thanks,

Francisco
Steven White | 19 Feb 18:05 2016
Picon

Removing cryptographic JARs from Tika

Hi,

Is there a wiki or instructions on how to remove cryptographic software from Tika?  Is it enough to simply remove the bouncy castle libraries from Tika's JAR and yet be able to use Tika to its full capacity less encrypted files?

I have no need to extract text off encrypted files, but due to Tika including cryptographic JARs, I won't be able to use.

Thank you!!

Steve
raghu vittal | 19 Feb 10:37 2016

Unable to extract content from chunked portion of large file

Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception. 

we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.

Sreenivasa Kallu | 17 Feb 00:34 2016
Picon

tika is unable to extract outlook messages

Hi ,
       I am currently indexing individual outlook messages and searching is working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs 

I am using following command to index individual messages.

This setup is working fine. 

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.


This command extracting only high level tags and extracting all messages into one message. I am not getting all tags when extracted individual messgaes. is above command is correct? is it problem not using recursion?  how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.
--sreenivasa kallu
Chris Mattmann | 15 Feb 19:45 2016
Picon
Gravatar

[ANNOUNCE] Apache Tika 1.12 release

The Apache Tika project is pleased to announce the release of Apache
Tika 1.12. The release contents have been pushed out to the main
Apache release site and to the Central sync, so the releases should
be available as soon as the mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries.

Apache Tika 1.12 contains a number of improvements and bug fixes.
Details can be found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.12.txt
<http://www.apache.org/dist/tika/CHANGES-1.12.txt>

Apache Tika is available in source form from the following download
page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.12-src.zip
<http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.12-src.zip>

Apache Tika is also available in binary form or for use using Maven
2 from the Central Repository:
http://repo1.maven.org/maven2/org/apache/tika/
<http://repo1.maven.org/maven2/org/apache/tika/>

In the initial 48 hours, the release may not be available on all
mirrors. When downloading from a mirror site, please remember to
verify the downloads using signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc
<https://people.apache.org/keys/group/tika.asc>

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/ <http://tika.apache.org/>

— Chris Mattmann, on behalf of the Apache Tika community


Gmane