Brian Young | 9 Sep 21:55 2015
Picon

tesseract issue

Hello,

On OS X at least, tesseract and tessdata may not be under a common root.  e.g.:

/opt/local/share/tessdata

/opt/local/bin/tesseract


Unfortunately it looks like TesseractOCRParser does not accommodate for this since there is only one configuration value that is used for finding the binary as well as setting the TESSDATA _PREFIX environment var.


Now, TESSDATA_PREFIX does not get set if I do not pass in the path on the config object.  However, even though tesseract is in my path, it isn't found when the ProcessBuilder executes unless I've given it the full path... which of course sets the TESSDATA_PREFIX to the wrong thing.


It seems like maybe it would be best to handle these as two separate configuration values?  But short of that and a new version of Tika, does anyone have any other advice?


Thank you

Brian






Mr Havercamp | 9 Sep 00:15 2015
Picon

WMV Support in TikaJAXRS

I have a windows media video which I can retrieve metadata from using the tika app but in tika jaxrs I get an extraction failed error.

Could this be attributed to a server configuration issue and is there some method of dealing with this issue?

This error is:

WARNING: meta: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1 <at> 29cfc20e
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:238)
    at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:135)
    at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
    at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:370)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
    at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
    at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type
    at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:111)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
    ... 36 more

zahlenmeer | 1 Sep 10:44 2015
Picon
Picon

Use TikaJAXRS with HDD offsets instead of urls

Hey everyone,
I am parsing file systems in hdd images in a c++ program. For further analysis I would like to parse the files I
find with Tika. The recommended way I found was setting up a Tika server and send and recieve files and
responses with its RESTful interface. Unfortunately I can only send "real" files over that interface,
but I need to send just the bytes of that file my file system parser found (described by the length of the file
in bytes and the offset on the hdd).
How can I achieve this? Or what would be an alternative? (The file system parser I use is c/c++ only)
Regards

Mungeol Heo | 1 Sep 03:38 2015
Picon

Does tika support "HWP"?

Hi,

I am trying to use tika to parse HWP files. and there is something,
which I addressed below, confuses me.

> java -jar tika-app-1.10.jar --list-supported-types | grep hwp
> application/x-hwp

> java -jar tika-app-1.10.jar --detect sample.hwp
> application/x-tika-msoffice

Is it means that tika can not detect HWP file correctly?

And another thing is, there is no 'application/x-hwp' in the supported
formats list which are mentioned at
'http://tika.apache.org/1.10/formats.html' page.

So, does tika support "HWP"?
If it does, is there a way to auto-detect HWP files correctly?
Any help will be great!
Thank you.

- mungeol

Andrea Asta | 27 Aug 10:38 2015
Picon
Gravatar

TikaConfig with constructor args

Hi all,
I've developed a new Parser for my custom file type.
This parser needs some configuration to init an external connections. Is there a way to specify the constructor params (or bean properties to set) in the Tika xml format?

Thanks
Andrea
Sergey Tsalkov | 20 Aug 08:19 2015
Picon

want to disable tesseract ocr parser

Hey awesome Tika folks!
The reason I'm writing is that I want to disable the
TesseractOCRParser. The reason is that it makes Tika take longer to
finish, and I don't need the OCRed results.

I can't simply uninstall tesseract from the system because I use it
for other things.

I thought about sending Tika a custom PATH that excludes /usr/bin so
it can't find tesseract, but that seems ugly and likely to break
things.

Then I thought I could pass a custom config.xml to disable it, but I
can't figure out how to write the config file.

I would greatly appreciate any help!

Thanks,
Sergey

Sznajder ForMailingList | 17 Aug 17:51 2015
Picon

Extracting the structure of an HTML Document

Hi

I am a new user of Tika.

I am handling HTML documents... I succeeded to parse the HTML documents to a "clean" text string.

However, I am interested to get the structure of the documents : what are the different sections, what are the titles of these sections etc...

Is there a way to do that with Tika?

Thanks!

Benjamin
Justin | 14 Aug 04:40 2015
Picon

Re: How to configure OutlookPSTParser

I replaced:

    <detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
    <detector class="org.apache.tika.mime.MimeTypes"/>

with:

    <detector class="org.apache.tika.detect.DefaultDetector">
      <detector-exclude class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
    </detector>

...and this works. So I think there's a problem in that CompositeDetector does not behave like DefaultDetector with the same set of detectors.



On Wednesday, August 12, 2015 8:47 AM, Justin <crynax-/E1597aS9LQAvxtiuMwx3w@public.gmane.org> wrote:


More information: I stepped through with debugger and found differences. When I use TikaConfig.getDefaultConfig(), then getDetector() returns a DefaultDetector whose MimeTypes successfully detects the PST. When I use my configuration file, then getDetector() instead returns a CompositeDetector whose MimeTypes fails to detect the PST.



On Wednesday, August 12, 2015 3:15 AM, Justin <crynax-/E1597aS9LQAvxtiuMwx3w@public.gmane.org> wrote:


Sorry, that was just a copy/paste omission. I have the closing tag and my config works for XLS, not for PST. Because the default config works, I know I have all the dependencies.



On Aug 12, 2015, 2:10:27 AM, Nick Burch wrote:
On 12/08/15 02:07, Justin wrote:

> ---tika-config.xml---
>
>
>
>
>
>
>
>
>
>
>
>
>
> I do not get anything back from BodyContentHandler when parsing a PST
> file whereas I do when I use TikaConfig.getDefaultConfig() instead. Am I
> missing something?


Your config file looks invalid - you need to close the tag
with a before you move onto the detectors


I'd also suggest you try some of the things listed in the
Troubleshooting page, to ensure you really have the parsers you expected:
http://wiki.apache.org/tika/Troubleshooting%20Tika

Nick






David Meikle | 13 Aug 19:05 2015
Picon

[CVE-2015-3271] Apache Tika information disclosure vulnerability

CVE-2015-3271: Apache Tika information disclosure vulnerability 

Severity: Important

Vendor:
The Apache Software Foundation

Versions Affected:
Apache Tika 1.9

Description:

Apache Tika provides optional functionality to run itself as a web service to allow remote use. When used in this manner, 
it's possible for a 3rd party to pass a 'fileUrl' header to the Apache Tika Server (tika-server). This header lets a remote
client request that the server fetches content from the URL provided, including files from the server's local filesystem.
Depending on the file permissions set on the local filesystem, this could be used to return sensitive content from 
the server machine.

Note this vulnerability only exists if you are running the tika-server version 1.9, and you allow un-trusted access to the tika-server
URL. Usage of Apache Tika as a standard library is not affected.

Mitigation:
Apache Tika 1.9 users should upgrade to Apache Tika 1.10

Example:
wget https://repo1.maven.org/maven2/org/apache/tika/tika-server/1.9/tika-server-1.9.jar && java -jar tika-server-1.9.jar
curl -i -H "fileUrl:file:///etc/passwd" -H "Accept: text/plain" -X PUT http://localhost:9998/tika

Credit:
This issue was discovered by Tim Allison from the Apache Tika Community.
Justin | 12 Aug 03:07 2015
Picon

How to configure OutlookPSTParser

Hi all,

I'm loading my own configuration file like so:

        InputStream stream = MyParser.class.getResourceAsStream("tika-config.xml");
        try {
            tikaConfig = new TikaConfig(stream);
        } catch (IOException | SAXException | TikaException e) {
            tikaConfig = TikaConfig.getDefaultConfig();
        } finally {
            try { stream.close(); } catch (IOException e) { }
        }

---tika-config.xml---
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.mail.RFC822Parser"/>
    <parser class="org.apache.tika.parser.mbox.MboxParser"/>
    <parser class="org.apache.tika.parser.mbox.OutlookPSTParser"/>
    <parser class="org.apache.tika.parser.microsoft.JackcessParser"/>
    <parser class="org.apache.tika.parser.microsoft.OldExcelParser"/>
    <parser class="org.apache.tika.parser.microsoft.OfficeParser"/>
    <parser class="org.apache.tika.parser.microsoft.TNEFParser"/>
    <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
  <detectors>
    <detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
    <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
    <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
</properties>
---

I do not get anything back from BodyContentHandler when parsing a PST file whereas I do when I use TikaConfig.getDefaultConfig() instead. Am I missing something?

Thanks!
Justin

David Meikle | 8 Aug 16:01 2015
Picon

[ANNOUNCE] Apache Tika 1.10 release

The Apache Tika project is pleased to announce the release of Apache Tika 1.10. The release contents have been pushed out to the main Apache release site and to the Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.10 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.10.txt Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc For more information on Apache Tika, visit the project home page: http://tika.apache.org/ -- David Meikle, on behalf of the Apache Tika community


Gmane