Picon

font metrics issue

Hi all,
I'm extracting some text from pdf. As result, some important words end with spaces between characters. For example, I could have the word "Subtitle" that I want to detect, written like "S u b t i t l e". If I would parse the text with a standard tokenizer, the word will be lost.
I think (after consultation in Solr list) that this might be related to fonts.
Is there any way to cope with this through Tika configuration?
Many Thanks,

Francisco
Steven White | 19 Feb 18:05 2016
Picon

Removing cryptographic JARs from Tika

Hi,

Is there a wiki or instructions on how to remove cryptographic software from Tika?  Is it enough to simply remove the bouncy castle libraries from Tika's JAR and yet be able to use Tika to its full capacity less encrypted files?

I have no need to extract text off encrypted files, but due to Tika including cryptographic JARs, I won't be able to use.

Thank you!!

Steve
raghu vittal | 19 Feb 10:37 2016

Unable to extract content from chunked portion of large file

Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception. 

we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.

Sreenivasa Kallu | 17 Feb 00:34 2016
Picon

tika is unable to extract outlook messages

Hi ,
       I am currently indexing individual outlook messages and searching is working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs 

I am using following command to index individual messages.

This setup is working fine. 

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.


This command extracting only high level tags and extracting all messages into one message. I am not getting all tags when extracted individual messgaes. is above command is correct? is it problem not using recursion?  how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.
--sreenivasa kallu
Chris Mattmann | 15 Feb 19:45 2016
Picon
Gravatar

[ANNOUNCE] Apache Tika 1.12 release

The Apache Tika project is pleased to announce the release of Apache
Tika 1.12. The release contents have been pushed out to the main
Apache release site and to the Central sync, so the releases should
be available as soon as the mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries.

Apache Tika 1.12 contains a number of improvements and bug fixes.
Details can be found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.12.txt
<http://www.apache.org/dist/tika/CHANGES-1.12.txt>

Apache Tika is available in source form from the following download
page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.12-src.zip
<http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.12-src.zip>

Apache Tika is also available in binary form or for use using Maven
2 from the Central Repository:
http://repo1.maven.org/maven2/org/apache/tika/
<http://repo1.maven.org/maven2/org/apache/tika/>

In the initial 48 hours, the release may not be available on all
mirrors. When downloading from a mirror site, please remember to
verify the downloads using signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc
<https://people.apache.org/keys/group/tika.asc>

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/ <http://tika.apache.org/>

— Chris Mattmann, on behalf of the Apache Tika community

Mattmann, Chris A (3980 | 15 Feb 18:02 2016
Picon
Picon

[RESULT] [VOTE] Apache Tika 1.12 Release Candidate #1

Team,

Sorry for the long delay. This VOTE has PASSED with the following
tallies:

+1
Chris Mattmann*
Markus Jelsma
Oleg Tikhonov*
Ken Krugler*
Tim Allison*
Konstantin Gribov*
David Meikle*
Lewis John McGibbney*
Tyler Palsulich*

* - Tika PMC

I’ll go update the website and update the mirrors and complete
the rest of the tasks.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann <at> nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: jpluser <chris.a.mattmann <at> jpl.nasa.gov>
Date: Monday, January 25, 2016 at 11:57 AM
To: "user <at> tika.apache.org" <user <at> tika.apache.org>, "dev <at> tika.apache.org"
<dev <at> tika.apache.org>
Subject: [VOTE] Apache Tika 1.12 Release Candidate #1

>Hi Folks,
>
>A first candidate for the Tika 1.12 release is available at:
>
>  https://dist.apache.org/repos/dist/dev/tika/

>
>The release candidate is a zip archive of the sources in:
>https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db2

>4
>27f9e84bc4ff31e569ae661c
>
>
>The SHA1 checksum of the archive is:
>30e64645af643959841ac3bb3c41f7e64eba7e5f
>
>In addition, a staged maven repository is available here:
>
>https://repository.apache.org/content/repositories/orgapachetika-1015/

>
>
>Please vote on releasing this package as Apache Tika 1.12.
>The vote is open for the next 72 hours and passes if a majority of at
>least three +1 Tika PMC votes are cast.
>
>[ ] +1 Release this package as Apache Tika 1.12
>[ ] -1 Do not release this package because…
>
>Cheers,
>Chris
>
>P.S. Of course here is my +1.
>
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann <at> nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Melissa Warnkin | 11 Feb 19:23 2016
Picon

ApacheCon NA 2016 - Important Dates!!!

Hello everyone!

I hope this email finds you well.  I hope everyone is as excited about ApacheCon as I am!

I'd like to remind you all of a couple of important dates, as well as ask for your assistance in spreading the word! Please use your social media platform(s) to get the word out! The more visibility, the better ApacheCon will be for all!! :)

CFP Close: February 12, 2016
CFP Notifications: February 29, 2016
Schedule Announced: March 3, 2016



Apache: Big Data North America 2016 Registration Fees:

Attendee Registration Fee: US$599 through March 6, US$799 through April 10, US$999 thereafter
Committer Registration Fee: US$275 through April 10, US$375 thereafter
Student Registration Fee: US$275 through April 10, $375 thereafter

Planning to attend ApacheCon North America 2016 May 11 - 13, 2016? There is an add-on option on the registration form to join the conference for a discounted fee of US$399, available only to Apache: Big Data North America attendees.

So, please tweet away!!

I look forward to seeing you in Vancouver! Have a groovy day!!

~Melissa
on behalf of the ApacheCon Team


Steven White | 11 Feb 00:04 2016
Picon

Using tika-app-1.11.jar

Hi everyone,

I'm including tika-app-1.11.jar with my application and see that Tika includes "slf4j".  This is conflicting with my own "slf4j".  If I remove it from Tika's JAR will that cause any issues?

I tested by removing it and tested and didn't see issue but would like to know from the community.

A side note, when I removed "slf4j", logs in my Eclipse's Console are now in red color (it use to be black before I integrated with tika-app-1.11.jar).  If I leave it, I still get red colored output, but obviously I also get the "SLF4J: Class path contains multiple SLF4J bindings" message.

Thanks

Steve
Steven White | 8 Feb 19:37 2016
Picon

Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if the OOM I'm getting is due to the way I'm using Tika or if it is an issue with parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  I do not see this issue with other file types that I tested so far.  Memory usage keeps on growing with XML file types, but stays constant with other file types.

    public class Extractor {
  private BodyContentHandler contentHandler = new BodyContentHandler(-1);
  private AutoDetectParser parser = new AutoDetectParser();
  private Metadata metadata = new Metadata();
        
        public String extract(File file) throws Exception {
            try {
                stream = TikaInputStream.get(file);
                parser.parse(stream, contentHandler, metadata);
                return contentHandler.toString();
            }
            finally {
                stream.close();
            }
        }
    }
    
    public static void main(...) {
        Extractor extractor = new Extractor();
        File file = new File("C:\\temp\\test.xml");
        for (int i = 0; i < 20; i++) {
            extractor.extract(file);
        }

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve

Carlos A | 6 Feb 01:24 2016
Picon

Issues extracting contents of .doc and .txt files after upgrading to Tika 1.11

Hello all,

This was not an issue before but now it is.

I had tried to check the manual and online to see what has changed so I can update my code but no success, hence decided to email the users list with detail walk through of my code and the debugger.

Basically I was doing the following quite successfully until 1.11:

1) First I read a file into bytes:

String originalFilename = "/MyBio.doc";

InputStream stream = this.getClass().getResourceAsStream(originalFilename);
byte[] bytes;
try {
  bytes = IOUtils.toByteArray(stream);
} catch (Exception e) {
e.printStackTrace();
}

So far, so good as bytes are now filled.

Then, used to work fine but not anymore.


ByteArrayInputStream is = new ByteArrayInputStream(bytes);
Metadata metadata = new Metadata();
if (originalFilename.length() > 0) {
metadata.set(Metadata.RESOURCE_NAME_KEY, originalFilename);
}
Parser parser = new AutoDetectParser(); // Should auto-detect!
StringWriter textBuffer = new StringWriter();
BodyContentHandler handler = new BodyContentHandler(textBuffer);
ParseContext context = new ParseContext();
parser.parse(is, handler, metadata, context);
// How I did originally get the output
System.out.println(textBuffer.toString());
// Tried this doesn't work
System.out.println(handler.toString());

On the debugger all is fine. Metadata object is properly created.

I have a BodyContentHandler initialized with an empyt textBuffer.

It is passed to ther parser with the ByteArrayInputStream is (which is full), the handler, the metadate and the ParseContenxt.

Looking inside the method parser.parse, I can see that the variables are correctly populated.

The mediaType is properly identified as application/msword

MetaData object as resourceName=/MyBio.doc Content-Type=application/msword 

The Stream object has the full buffer as passed on the call.

From AutoDetectParser.parse() method:

The TikaInputStream object has the stream as passed.

The MediaType object is correctly : application/msword



The SecureContentHandler is properly created at the line: 

// TIKA-216: Zip bomb prevention
            SecureContentHandler sch = 
                handler != null ? new SecureContentHandler(handler, tis) : null;


From the CompositeParser instance on the parse() method I have:

TikaInputStream taggedStream corrected populated with the stream contents.

TaggedContentHandler taggedHandler gets the BodyContentHandler object passed and it is not null.

However on the call:

if (parser instanceof ParserDecorator){
                metadata.add("X-Parsed-By", ((ParserDecorator) parser).getWrappedParser().getClass().getName());
            } else {
                metadata.add("X-Parsed-By", parser.getClass().getName());
            }

It goes to the else and puts the EmptyParser so now the Metada object reads:

So value is now X-Parsed-By=org.apache.tika.parser.EmptyParser resourceName=/MyBio.doc Content-Type=application/msword 

No exceptions

When the original call above parser.parse(is, handler, metadata, context); returns, the handler.toString() is empty as well as the textBuffer.toString(). It used to work really well before Tika 1.11

I wonder if I need to do something so that the EmptyParser is not used as it was working before.

Thank you,

C.

Steven White | 5 Feb 21:40 2016
Picon

Detecting if a file type is supported or not

Hi everyone,

How do I detect if a file type is supported or not?  Also, how do I detect if a file type is supported but it cannot be processed because the parser for it is missing (the required JARs are missing)?

For the missing JAR part, when I tried to parse a JAR file, I got this exception:



Thanks

Steve

Gmane