imyuka | 9 Oct 14:22 2014

Formatted Content Extraction and Title Detection

Hi all,

    Here is my problem: I have extracted plain texts from a serious of doc(x) documents and their titles via the "dc:title" label of metadata, but I'm not sure this is the right way to attain a title of a document. In many cases, a title inside a document could be of the largest font-size and bold-style, which I want to utilized to extract the very title, however, I have no idea how to get a formatted content and font-size/bold-style detection. please let me know if I miss something.
    Thank you very much!
Can Duruk | 9 Oct 02:59 2014
Picon

Customizing Metadata Keys

Hi all,

My question is regarding setting the metadata keys coming from the parsers to my own keys.

For my application, I am using Tika to extract the metadata for a bunch of files. I am using the embedded HTTP server which I modified for my needs to return instead of CSV. (Hoping to submit that as a patch soon)

However, the keys in the JSON are all in different formats and I need them to conform to my own requirements.

So for example in this redacted example this is what I get:

{
  "meta:author": "Maxim Valyanskiy",
}


However, what I need is this:

{
  "my_author_key": "Maxim Valyanskiy",
}

I have a bunch (several dozens) of these modifications I need to make on the metadata keys in various places.

What is the best way to approach this problem? I've thought about extending each of the parsers to but that seems a bit too decentralized. Ideally it'd be something I can manage in a single file.

Thanks a lot in advance.
Harsh Singh | 29 Sep 04:14 2014
Picon

Tika - XHTML to Json

Hi All,

I am trying to build a XHTML to Json parser in TIka. After my research, I decided to got with ToTextContextHandler to parse the XHTML data and convert it to Json. For this I am mostly overriding the methods of  ToTextContextHandler to create the custom Json files. 
So I was wondering if this approach is appropriate or should I look in more generic approach like using ContextHandler? or should I try some other ContextHandlers?
Any suggestion or comments are highly appreciated.

Best Regards,
Harsh Singh

Vineet Ghatge Hemantkumar | 26 Sep 05:05 2014
Picon

Apache Tika - JSON?

Hello all, 

I was wondering if there any in built parser to get help in conversion from XHTML to JSON. 

My research showed that there is one named org.apache.io.json which just one method implemented. Also, I tried GJSON library to do this, but it does not seem to work with Tika. Any suggestions will be appreciated?

Regards,
Vineet
Nick Burch | 23 Sep 00:21 2014
Picon

Tika at ApacheCon Europe - 2 months time!

Hi All

It's only 2 months to go until ApacheCon Europe in Budapest. I'm 
simultaneously exciting by all the great Tika stuff going on, and worried 
by how many talks I need to finish writing...

As usual for an ApacheCon, we've a number of talks about Tika going on, 
and almost certainly a hackathon and/or meetup one evening. There's also 
lots of related talks too, covering technologies that Tika builds on, and 
ones you can use Tika with. For a full schedule, see:
http://events.linuxfoundation.org/events/apachecon-europe/program/schedule

If you're based in Europe, and involved in Tika, we'd love to see you 
there in November!

For those who can't make it to Budapest, we hope to have a similar level 
of talks at ApacheCon US 2015 in Austin, Texas in April, so save the date! 
http://events.linuxfoundation.org/events/apachecon-north-america

Nick

Mugat Gurkowsky | 13 Sep 15:05 2014
Picon

very large xml-file parsing

hallo,

i am trying to use tika in combination with lucene to parse and index of very large xml-files. so far, without success, because of memory limitations. tika's BodyContentHandler seems to try to copy the whole content in memory, which doesn't work as files are several giga-bytes large.

is there a way of getting around this problem? can i use any other handler which can deal with streams?

thanks in advance
zenpunk
Devaraja Swami | 9 Sep 04:12 2014
Picon

HTML parsing error with <a> tag inside <h1> tag

In the following HTML document, the <a> is inside the <h1> tag which is inside the <p> tag:
-------------------
<!DOCTYPE html>
<html>
<body>
<div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
</body>
</html>
-------------------
But when I parse it with Tika 1.5 HtmlParser, 
it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.

The same error happens when I replace the <h1> tag with other header tags <h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
[Haven't experimented with other replacements].

This seems to be a basic issue.
Any help would be deeply appreciated.

Cheers,
Devarajan

Matthew Caruana Galizia | 8 Sep 21:51 2014
Picon

Permission to make wiki edits for MatthewCaruana

Hi,

I've created API bindings for nodejs and would like to add the project to the Tika wiki's list of bindings.

Here's the project page: https://github.com/mattcg/node-tika

Matthew
Chris Mattmann | 5 Sep 22:48 2014
Picon

[ANNOUNCE] Apache Tika 1.6 release

The Apache Tika project is pleased to announce the release of Apache Tika
1.6. The release
contents have been pushed out to the main Apache release site and to the
Maven Central sync,
so the releases should be available as soon as the mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content
from various documents using existing parser libraries.

Apache Tika 1.6 contains a number of improvements and bug fixes. Details
can be found in the
changes file:
http://www.apache.org/dist/tika/CHANGES-1.6.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.6-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from
the Central Repository:
http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading
from a mirror site, please remember to verify the downloads using
signatures found on the
Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Chris Mattmann, on behalf of the Apache Tika community

Mattmann, Chris A (3980 | 5 Sep 05:53 2014
Picon
Picon

Waiting for infra to create new release area

Hey Guys,

To get with the times, I'm having infra create us new release and
staging areas via the new way:

http://www.apache.org/dev/release.html#upload-ci

Issue here:

https://issues.apache.org/jira/browse/INFRA-8309

I won't be able to push the release until this is done.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Mattmann, Chris A (3980 | 5 Sep 05:48 2014
Picon
Picon

[RESULT] [VOTE] Release Apache Tika 1.6 RC #2

Hi Everyone,

The VOTE has passed with the following tallies:

+1 PMC

Chris Mattmann
Hong-Thai Nguyen
Tyler Palsulich
Oleg Tikhonov
Tim Allison
Lewis John McGibbney
David Meikle
Sergey Beryozkin

I'll go ahead and push the release out to the mirrors
and finish updating the website, etc.

Thanks everyone!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-----Original Message-----
From: <Mattmann>, Chris Mattmann <Chris.A.Mattmann@...>
Reply-To: "dev@..." <dev@...>
Date: Sunday, August 31, 2014 10:16 PM
To: "dev@..." <dev@...>
Cc: "user@..." <user@...>
Subject: [VOTE] Release Apache Tika 1.6 RC #2

>Hi Folks,
>
>A candidate for the Tika 1.6 release is available at:
>
>    http://people.apache.org/~mattmann/apache-tika-1.6/rc2/
>
>The release candidate is a zip archive of the sources in:
>
>http://svn.apache.org/repos/asf/tika/tags/1.6-rc2/
>
>
>The SHA1 checksum of the archive is
>65644121446130fa29f1b62bcd75fb33344a6ba3.
>
>A Maven staging repository is at:
>
>https://repository.apache.org/content/repositories/orgapachetika-1004/
>
>
>Please vote on releasing this package as Apache Tika 1.6.
>The vote is open for the next 72 hours and passes if a majority of at
>least three +1 Tika PMC votes are cast.
>
>    [ ] +1 Release this package as Apache Tika 1.6
>    [ ] -1 Do not release this package because...
>
>
>Cheers,
>Chris
>
>P.S. Here is my +1!
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@...
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>


Gmane