Mattmann, Chris A (3980 | 15 Feb 18:02 2016
Picon
Picon

[RESULT] [VOTE] Apache Tika 1.12 Release Candidate #1

Team,

Sorry for the long delay. This VOTE has PASSED with the following
tallies:

+1
Chris Mattmann*
Markus Jelsma
Oleg Tikhonov*
Ken Krugler*
Tim Allison*
Konstantin Gribov*
David Meikle*
Lewis John McGibbney*
Tyler Palsulich*

* - Tika PMC

I’ll go update the website and update the mirrors and complete
the rest of the tasks.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
(Continue reading)

Melissa Warnkin | 11 Feb 19:23 2016
Picon

ApacheCon NA 2016 - Important Dates!!!

Hello everyone!

I hope this email finds you well.  I hope everyone is as excited about ApacheCon as I am!

I'd like to remind you all of a couple of important dates, as well as ask for your assistance in spreading the word! Please use your social media platform(s) to get the word out! The more visibility, the better ApacheCon will be for all!! :)

CFP Close: February 12, 2016
CFP Notifications: February 29, 2016
Schedule Announced: March 3, 2016



Apache: Big Data North America 2016 Registration Fees:

Attendee Registration Fee: US$599 through March 6, US$799 through April 10, US$999 thereafter
Committer Registration Fee: US$275 through April 10, US$375 thereafter
Student Registration Fee: US$275 through April 10, $375 thereafter

Planning to attend ApacheCon North America 2016 May 11 - 13, 2016? There is an add-on option on the registration form to join the conference for a discounted fee of US$399, available only to Apache: Big Data North America attendees.

So, please tweet away!!

I look forward to seeing you in Vancouver! Have a groovy day!!

~Melissa
on behalf of the ApacheCon Team


Steven White | 11 Feb 00:04 2016
Picon

Using tika-app-1.11.jar

Hi everyone,

I'm including tika-app-1.11.jar with my application and see that Tika includes "slf4j".  This is conflicting with my own "slf4j".  If I remove it from Tika's JAR will that cause any issues?

I tested by removing it and tested and didn't see issue but would like to know from the community.

A side note, when I removed "slf4j", logs in my Eclipse's Console are now in red color (it use to be black before I integrated with tika-app-1.11.jar).  If I leave it, I still get red colored output, but obviously I also get the "SLF4J: Class path contains multiple SLF4J bindings" message.

Thanks

Steve
Steven White | 8 Feb 19:37 2016
Picon

Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if the OOM I'm getting is due to the way I'm using Tika or if it is an issue with parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  I do not see this issue with other file types that I tested so far.  Memory usage keeps on growing with XML file types, but stays constant with other file types.

    public class Extractor {
  private BodyContentHandler contentHandler = new BodyContentHandler(-1);
  private AutoDetectParser parser = new AutoDetectParser();
  private Metadata metadata = new Metadata();
        
        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }
    
    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve

Carlos A | 6 Feb 01:24 2016
Picon

Issues extracting contents of .doc and .txt files after upgrading to Tika 1.11

Hello all,

This was not an issue before but now it is.

I had tried to check the manual and online to see what has changed so I can update my code but no success, hence decided to email the users list with detail walk through of my code and the debugger.

Basically I was doing the following quite successfully until 1.11:

1) First I read a file into bytes:

String originalFilename = "/MyBio.doc";

InputStream stream = this.getClass().getResourceAsStream(originalFilename);
byte[] bytes;
try {
  bytes = IOUtils.toByteArray(stream);
} catch (Exception e) {
e.printStackTrace();
}

So far, so good as bytes are now filled.

Then, used to work fine but not anymore.


ByteArrayInputStream is = new ByteArrayInputStream(bytes);
Metadata metadata = new Metadata();
if (originalFilename.length() > 0) {
metadata.set(Metadata.RESOURCE_NAME_KEY, originalFilename);
}
Parser parser = new AutoDetectParser(); // Should auto-detect!
StringWriter textBuffer = new StringWriter();
BodyContentHandler handler = new BodyContentHandler(textBuffer);
ParseContext context = new ParseContext();
parser.parse(is, handler, metadata, context);
// How I did originally get the output
System.out.println(textBuffer.toString());
// Tried this doesn't work
System.out.println(handler.toString());

On the debugger all is fine. Metadata object is properly created.

I have a BodyContentHandler initialized with an empyt textBuffer.

It is passed to ther parser with the ByteArrayInputStream is (which is full), the handler, the metadate and the ParseContenxt.

Looking inside the method parser.parse, I can see that the variables are correctly populated.

The mediaType is properly identified as application/msword

MetaData object as resourceName=/MyBio.doc Content-Type=application/msword 

The Stream object has the full buffer as passed on the call.

From AutoDetectParser.parse() method:

The TikaInputStream object has the stream as passed.

The MediaType object is correctly : application/msword



The SecureContentHandler is properly created at the line: 

// TIKA-216: Zip bomb prevention
            SecureContentHandler sch = 
                handler != null ? new SecureContentHandler(handler, tis) : null;


From the CompositeParser instance on the parse() method I have:

TikaInputStream taggedStream corrected populated with the stream contents.

TaggedContentHandler taggedHandler gets the BodyContentHandler object passed and it is not null.

However on the call:

if (parser instanceof ParserDecorator){
                metadata.add("X-Parsed-By", ((ParserDecorator) parser).getWrappedParser().getClass().getName());
            } else {
                metadata.add("X-Parsed-By", parser.getClass().getName());
            }

It goes to the else and puts the EmptyParser so now the Metada object reads:

So value is now X-Parsed-By=org.apache.tika.parser.EmptyParser resourceName=/MyBio.doc Content-Type=application/msword 

No exceptions

When the original call above parser.parse(is, handler, metadata, context); returns, the handler.toString() is empty as well as the textBuffer.toString(). It used to work really well before Tika 1.11

I wonder if I need to do something so that the EmptyParser is not used as it was working before.

Thank you,

C.

Steven White | 5 Feb 21:40 2016
Picon

Detecting if a file type is supported or not

Hi everyone,

How do I detect if a file type is supported or not?  Also, how do I detect if a file type is supported but it cannot be processed because the parser for it is missing (the required JARs are missing)?

For the missing JAR part, when I tried to parse a JAR file, I got this exception:



Thanks

Steve
Andrea Asta | 3 Feb 15:35 2016
Picon
Gravatar

RTF exception

Hi,
I'm having an exception when converting a RTF document with the standard new Tika().parseToString().

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser <at> 47f4e407
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.Tika.parseToString(Tika.java:496)
    at com.expertsystem.experiments.tika.tika_test_dtra.App.main(App.java:48)
Caused by: org.apache.tika.metadata.PropertyTypeException: meta:page-count : SIMPLE
    at org.apache.tika.metadata.Metadata.add(Metadata.java:337)
    at org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:830)
    at org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:562)
    at org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:488)
    at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:450)
    at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
    at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    ... 4 more

Thanks
Andrea
Steven White | 3 Feb 01:01 2016
Picon

Using Tika that comes with Solr 5.2

Hi everyone,

I have written a standalone application that works with Solr 5.2.  I'm using the existing JARs that come with Solr to index data off a file system.  My applications scans the file system, looking for files and then uses Tika to extract the raw text and then sends the raw text to Solr, using SolrJ, for indexing.

What I'm finding is that Tika will not extract the raw text off PDF, Powerpoint, ets. files but it will off raw text files.

Here is the code for:

public static void parseWithTika() throws Exception {
  File file = new File("C:\\temp\\test.pdf");

  FileInputStream in =- new FileInputStream(file);
  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  BodyContentHandler contentHandler = new BodyContentHandler();

  parse.parse(in, contentHandler, metadata);

  String content = contentHandelr.toString();  <=== 'content is always an empty string

  in.close();
}

In the above code, 'content' is always empty (the above is: off https://tika.apache.org/1.8/examples.html)

Solr 5.2 comes with the following Tika JARs which I have included all of them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and kite-morphlines-tika-decompress-0.12.1.jar

Any idea why this isn't working?

Thanks!

Steve
Giovanni Usai | 2 Feb 15:28 2016

How can I reference a custom config file for TesseractOCR without recompiling the code?

Hello,
I would gladly welcome the reply of the community on the following subject:

We are using Tika embedded in Solr server.

I would like to know if it is possible to give in input to TesseractOCR, run by Solr Extractor, a specific config file without the need of recompiling any source code.

Instead of the default TesseractOCRConfig.properties, packaged inside Tika JAR, we must use our own overriding some parameters.

For the moment, we modified the Tika source code and replaced the body of TesseractOCRConfig default constructor,
from:
init(this.getClass().getResourceAsStream("TesseractOCRConfig.properties")).
to:
init(new FileInputStream("/opt/datafari/tomcat/conf/datafari-tika-ocr.properties"));  

Now, we would like to have a cleaner solution to the problem.
I had a look to Tika source code and TesseractOCRConfig also has the constructor with parameter:
public TesseractOCRConfig(InputStream is) {
        init(is);
}

With this method of TesseractOCRParser:
public Set<MediaType> getSupportedTypes(ParseContext context) {
        // If Tesseract is installed, offer our supported image types
        TesseractOCRConfig config = context.get(TesseractOCRConfig.class, DEFAULT_CONFIG);
        if (hasTesseract(config))
            return SUPPORTED_TYPES;

        // Otherwise don't advertise anything, so the other image parsers
        //  can be selected instead
        return Collections.emptySet();
    }
looks like it's possible to pass an instance of TesseractOCRConfig by the means of a ParseContext.
If the input instance is defined, then the code uses that one, otherwise creates a default instance.
The TesseractOCR instance in input might be created by the constructor with parameter, passing an input stream reading from our own file: /opt/datafari/tomcat/conf/datafari-tika-ocr.properties

So, do you know if is it possible to call Tika from Solr passing a specific context?
And, if it's doable, any hints on how to do it?

FYI: we are using Tika for our open source enterprise search engine "Datafari".

Thanks and
--
<!-- <at> page { margin: 2cm } p { margin-bottom: 0.25cm; line-height: 120% } a:link { so-language: zxx } -->

Best regards,
Giovanni Usai
giovanni.usai-wsFEBRdYfwE9XoPSrs7Ehg@public.gmane.org


www.francelabs.com

CEEI Nice Premium
1 Bd. Maître Maurice Slama
06200 Nice FRANCE

Ph: +33 (0)9 72 43 72 85

James Brooking | 2 Feb 14:16 2016

Fwd: Issues adding custom content-type

Hello Tika People,

I am trying to add a custom content-type to Tika and am finding it difficult. Not sure if the tutorial I am following is out of date but it could be the case.

I am using Tika 1.11, which I downloaded from here: https://www.apache.org/dist/tika/tika-server-1.11.jar

Once I have this file I can successfully run it on my PC using:
java -jar tika-server-1.11.jar -h 0.0.0.0

I created a custom content-type like so:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>application/hello</mime-exclude>
    </parser>
    <parser class="org.apache.tika.parser.hello.HelloWorldParser">
      <mime>application/hello</mime>
    </parser>
  </parsers>
</properties>

This was saved into file called parsers.xml.

Then I follow the example in https://tika.apache.org/1.5/parser_guide.html#Create_your_Parser_class and ad the parser class.

My question is what do I need to do add to the "java -jar tika-server-1.11.jar -h 0.0.0.0" command for it to load my custom parser?

Thanks in advanced,
James Brooking

Mattmann, Chris A (3980 | 25 Jan 20:58 2016
Picon
Picon

[VOTE] Apache Tika 1.12 Release Candidate #1

Hi Folks,

A first candidate for the Tika 1.12 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/


The release candidate is a zip archive of the sources in:
https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24

27f9e84bc4ff31e569ae661c


The SHA1 checksum of the archive is:
30e64645af643959841ac3bb3c41f7e64eba7e5f

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1015/



Please vote on releasing this package as Apache Tika 1.12.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.12
[ ] -1 Do not release this package because…

Cheers,
Chris

P.S. Of course here is my +1.



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann <at> nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Gmane