Alessandro Benedetti | 18 May 18:29 2015
Picon

[Date Format] Render dates in single format

Hi guys,
I am interested in understanding if there is any config param in Tika to force the rendering of all dates in a specific format.
Independently of the parser.
I would like all my dates to be returned in UTC/ Zulu.

I want this because later I want to index the dates in Solr ( I am using the Apache Tika Transformation connector inside Apache ManifoldCF) .

Any suggestion ?

--
--------------------------

Benedetti Alessandro 
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright     
In the forests of the night,     
What immortal hand or eye     
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England
Stefan Alder | 13 May 21:03 2015
Picon

Embedded images in PDF - detect, extract and/or OCR

Ultimately I'm trying to (1) determine whether images, particularly, full page images, are embedded in a pdf, and (2) extract the images and/or (3) OCR the text.

Does tika-app support this?  When I run java -jar tika-app-1.8.jar test.pdf, I get all of the meta data, and see <page></page> tags but no images.

Running with -z doesn't output any images.


ViolinWoman | 11 May 22:18 2015
Picon

XML Parsing in Tika

I'm trying to parse content from an XML file that contains data that looks like this:

<metadata name="key1">value1</metadata>
<metadata name="key2">value2</metadata>
<metadata name="key3">value3</metadata>
<metadata name="key4">value4</metadata>

I'd like to add the key=value pairs to the Tika metadata.

I've tried two different things:

ContentHandler elementMetadataHandler = new ElementMetadataHandler("", "metadata", metadata, "metadata");

will pull out
    metadata=value1, metadata=value2, metadata=value3, metadata=value4

And the second one:

ContentHandler attributeHandler = new AttributeMetadataHandler("", "name", metadata, "name");

will pull out
    name=key1, name=key2, name=key3, name=key4

What I really want is this:
    key1=value1, key2=value2, key3=value3, key4=value4

Is there a way to do this using Tika's built-in parsing? If not, what do I need to do to extend the parsing for this purpose?

Thank you!


Clemens Wyss DEV | 8 May 17:33 2015
Picon

extracting text from an "encrypted" pdf

When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:

pdfDocument = PDDocument.load( TIKA_FILES_DIR + "doc1.pdf" ); // "dauertewig.pdf" );			
PDFTextStripper pdfStripper = new PDFTextStripper();
parsedText = pdfStripper.getText( pdfDocument );

I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.

When, on the other hand, I do:

ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
parser = new AutoDetectParser();
context.set( Parser.class, parser );
parser.parse( is, handler, metadata, context );
parsedText = handler.toString();

I get to see some text/content oft he very pdf. 

1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")? 
2) Does the second approach possibly return more than text? Blobs? Binary data?

Tyler Palsulich | 20 Apr 23:09 2015
Picon

[ANNOUNCE] Apache Tika 1.8 Released

The Apache Tika project is pleased to announce the release of Apache Tika 1.8. The release
contents have been pushed out to the main Apache release site and to the Maven Central sync, so the releases should be available as soon as the mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and structured text content
from various documents using existing parser libraries.

Apache Tika 1.8 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.8.txt

Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page: http://tika.apache.org/

-- Tyler Palsulich, on behalf of the Apache Tika community
Tyler Palsulich | 20 Apr 22:09 2015
Picon

[RESULT] [VOTE] Apache Tika 1.8 Release Candidate #2

Hi Everyone,

The VOTE to release Tika 1.8 RC #2 has passed with the following tally:

+1:
Chris Mattmann
Hong-Thai Nguyen
Konstantin Gribov
Lewis John Mcgibbney
Oleg Tikhonov
Tim Allison
Tyler Palsulich

±0:
None

-1:
None

I'll move forward with the release process now.

Thank you all for your VOTE and collaboration,
Tyler
Tyler Palsulich | 13 Apr 19:56 2015
Picon

[VOTE] Apache Tika 1.8 Release Candidate #2

Hi Folks,

A candidate for the Tika 1.8 release is available at:

The release candidate is a zip archive of the sources in:

The SHA1 checksum of the archive is
  5e22fee9079370398472e59082d171ae2d7fdd31.

In addition, a staged maven repository is available here:

Please vote on releasing this package as Apache Tika 1.8. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.8
[ ] ±0 I don't object to this release, but I haven't checked it
[ ] -1 Do not release this package because...

Thanks,
Tyler
Merrill, Jeremy | 10 Apr 20:46 2015
Picon

Detecting standards-non-compliant emails as message/rfc822

Hi friends,

*tl;dr*: I've added an extra line to tika-mimetypes.xml for detecting certain rfc822 non-compliant emails that are exported by a certain U.S. politician's email server. Would this be useful to add to the official Tika repo? https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea754236756944ab5eb7

longer version:

Big fan of Tika -- we're using it a fair amount here to do document search for emails/files we receive in big dumps from various public officials.

These dumps frequently come directly from these officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. Many begin with the non-standard header RFC822 `Status: `.

It's important to note that Tika (and the underlying library, James Mime4J) do properly parse these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it.

Tika does not properly detect these emails as `message/rfc822`. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. This solves my problem and does not appear to cause test failures. Perhaps there's another, easier solution? Also, I don't know if it'll cause problems for other people or whether it would be useful to them -- that's why I'm asking you. If it is, I'd be happy to contribute it as a patch. Please let me know.

---
Jeremy B. Merrill
The New York Times

Kovalan R | 24 Mar 08:07 2015

Tika - tessract integration

I would like to know whether Tika parser and Tesseract can be integrate 
together? if yes  Please advice me with a neat doc.

OR

is there anyway to use tika parser to deal with OCR files..??

thanks in advance

Dmitry Minkovsky | 10 Mar 02:59 2015
Picon

Facade uses the EmptyParser despite correct type detection

I am trying to use the Tika facade. Here's my test code:


Tika tika = new Tika();
Metadata md = new Metadata();

try {                                                  
    String content = tika.parseToString(src, md, 100000);

    System.out.println("Content length: " + content.length());  

    for (String s: md.names()) {
        System.out.println(s + ": " + md.get(s));
    }                                          
}                                                      
catch (TikaException e) { System.out.println(e); } 


Here's the output:

> Content length: 0
> X-Parsed-By: org.apache.tika.parser.EmptyParser
> Content-Type: text/html

So:
 
* If Tika correctly identifies the input as text/html, why does it use the EmptyParser?
* If I'm supposed to pass a parser, which parser should I pass for best results, assuming that autodetection is successful, as it seems to be above.

Thank you,
Dmitry
Nick Burch | 6 Mar 07:33 2015

Tika at ApacheCon in Austin, 13-16 April

Hi All

As many of you will hopefully know, the next ApacheCon takes place in 
April in Austin, Texas. What you may not know is quite how many Tika 
related talks we have taking place.

If you take a look at the schedule, you'll discover there are 6 different 
talks on or related to Apache Tika this time!
http://apacheconna2015.sched.org/?s=tika&iframe=no

In addition, with that many talks, there will be lots of Tika developers 
around and about at the conference. We're almost certain to do a 
hackathon, along with probably trying out some new ideas, discussing and 
writing up plans on the wiki etc. With all that, it will be a great chance 
for anyone interested in Tika to get more involved, get mentoring on 
bringing in new parsers etc. If you've been looking for a chance to take 
your Tika use or participation up to the next level, this is probably your 
best chance of the year :)

You can find more information on the conference, including location, 
registration details, full schedule etc on the conference website
http://events.linuxfoundation.org/events/apachecon-north-america

We'll hopefully see lots of you in Austin!

Thanks
Nick


Gmane