Mugat Gurkowsky | 13 Sep 15:05 2014
Picon

very large xml-file parsing

hallo,

i am trying to use tika in combination with lucene to parse and index of very large xml-files. so far, without success, because of memory limitations. tika's BodyContentHandler seems to try to copy the whole content in memory, which doesn't work as files are several giga-bytes large.

is there a way of getting around this problem? can i use any other handler which can deal with streams?

thanks in advance
zenpunk
Devaraja Swami | 9 Sep 04:12 2014
Picon

HTML parsing error with <a> tag inside <h1> tag

In the following HTML document, the <a> is inside the <h1> tag which is inside the <p> tag:
-------------------
<!DOCTYPE html>
<html>
<body>
<div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
</body>
</html>
-------------------
But when I parse it with Tika 1.5 HtmlParser, 
it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.

The same error happens when I replace the <h1> tag with other header tags <h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
[Haven't experimented with other replacements].

This seems to be a basic issue.
Any help would be deeply appreciated.

Cheers,
Devarajan

Matthew Caruana Galizia | 8 Sep 21:51 2014
Picon

Permission to make wiki edits for MatthewCaruana

Hi,

I've created API bindings for nodejs and would like to add the project to the Tika wiki's list of bindings.

Here's the project page: https://github.com/mattcg/node-tika

Matthew
Chris Mattmann | 5 Sep 22:48 2014
Picon

[ANNOUNCE] Apache Tika 1.6 release

The Apache Tika project is pleased to announce the release of Apache Tika
1.6. The release
contents have been pushed out to the main Apache release site and to the
Maven Central sync,
so the releases should be available as soon as the mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content
from various documents using existing parser libraries.

Apache Tika 1.6 contains a number of improvements and bug fixes. Details
can be found in the
changes file:
http://www.apache.org/dist/tika/CHANGES-1.6.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.6-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from
the Central Repository:
http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading
from a mirror site, please remember to verify the downloads using
signatures found on the
Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Chris Mattmann, on behalf of the Apache Tika community

Mattmann, Chris A (3980 | 5 Sep 05:53 2014
Picon
Picon

Waiting for infra to create new release area

Hey Guys,

To get with the times, I'm having infra create us new release and
staging areas via the new way:

http://www.apache.org/dev/release.html#upload-ci

Issue here:

https://issues.apache.org/jira/browse/INFRA-8309

I won't be able to push the release until this is done.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Mattmann, Chris A (3980 | 5 Sep 05:48 2014
Picon
Picon

[RESULT] [VOTE] Release Apache Tika 1.6 RC #2

Hi Everyone,

The VOTE has passed with the following tallies:

+1 PMC

Chris Mattmann
Hong-Thai Nguyen
Tyler Palsulich
Oleg Tikhonov
Tim Allison
Lewis John McGibbney
David Meikle
Sergey Beryozkin

I'll go ahead and push the release out to the mirrors
and finish updating the website, etc.

Thanks everyone!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-----Original Message-----
From: <Mattmann>, Chris Mattmann <Chris.A.Mattmann@...>
Reply-To: "dev@..." <dev@...>
Date: Sunday, August 31, 2014 10:16 PM
To: "dev@..." <dev@...>
Cc: "user@..." <user@...>
Subject: [VOTE] Release Apache Tika 1.6 RC #2

>Hi Folks,
>
>A candidate for the Tika 1.6 release is available at:
>
>    http://people.apache.org/~mattmann/apache-tika-1.6/rc2/
>
>The release candidate is a zip archive of the sources in:
>
>http://svn.apache.org/repos/asf/tika/tags/1.6-rc2/
>
>
>The SHA1 checksum of the archive is
>65644121446130fa29f1b62bcd75fb33344a6ba3.
>
>A Maven staging repository is at:
>
>https://repository.apache.org/content/repositories/orgapachetika-1004/
>
>
>Please vote on releasing this package as Apache Tika 1.6.
>The vote is open for the next 72 hours and passes if a majority of at
>least three +1 Tika PMC votes are cast.
>
>    [ ] +1 Release this package as Apache Tika 1.6
>    [ ] -1 Do not release this package because...
>
>
>Cheers,
>Chris
>
>P.S. Here is my +1!
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@...
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>

Baldwin, David | 2 Sep 21:44 2014
Picon

Tika versions compatibility

I am looking for 

Information on Tika compatibility as version releases have been made. We are still on a old version of Tika
0.6.  We would like to upgrade to the latest released version 1.5 and prepare for 1.6 as well.

Is there any information I have not found googling around and searching the page that may show any changes
from 0.6 to the current 1.5 version that may make it incompatible on the API/Usage level?

We are also still using Lucene 2.9.2 with it (albeit we are upgrading to 4.9 in the next while)

David Baldwin 

Mattmann, Chris A (3980 | 1 Sep 07:16 2014
Picon
Picon

[VOTE] Release Apache Tika 1.6 RC #2

Hi Folks,

A candidate for the Tika 1.6 release is available at:

    http://people.apache.org/~mattmann/apache-tika-1.6/rc2/

The release candidate is a zip archive of the sources in:

http://svn.apache.org/repos/asf/tika/tags/1.6-rc2/

The SHA1 checksum of the archive is
65644121446130fa29f1b62bcd75fb33344a6ba3.

A Maven staging repository is at:

https://repository.apache.org/content/repositories/orgapachetika-1004/

Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

    [ ] +1 Release this package as Apache Tika 1.6
    [ ] -1 Do not release this package because...

Cheers,
Chris

P.S. Here is my +1!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Lewis John Mcgibbney | 15 Aug 04:46 2014
Picon

Support for HDF5 and netCDF

Hi Folks,

<at> Annie Brynant in particular,

I would like to have on list the current state of our support for Mime Types

 * NetCDF4
 * HDF5

I know that we maintain parsers for these types however I possibly have an extensionuse case which I would like to discuss.
I am looking to ensure that I can obtain metadata defined by the Attribute Conventions Dataset Discovery [0] effort. Please see elaboration below;

Attribute Name

Type

Description

Example Implementation

date_created

string

The date and time the data file was created in the form “yyyymmddThhmmssZ”. This time format is ISO 8601 compliant.

Date_created = “2012-04-06T16:26:33Z”;

time_coverage_start

string

Representative date and time of the start of the granule in the ISO 8601 compliant format of “yyyymmddThhmmssZ”.

Time_coverage_start = “2012001013102483”

time_coverage_start

string

Representative date and time of the start of the granule in the ISO 8601 compliant format of “yyyymmddThhmmssZ”.

Time_coverage_end = “2012002000843304”

geospatial_lat_max

float

Decimal degrees north, range -90 to +90.

Geospatial_lat_max = 90.0f

geospatial_lat_min

float

Decimal degrees north, range -90 to +90.

Geospatial_lat_min = -90.0f

geospatial_lon_max

float

Decimal degrees east,  range -180 to +180.

Geospatial_lon_max = -180.0f

geospatial_lon_min

float

Decimal degrees east,  range -180 to +180.

Geospatial_lon_min = 180.0f

geospatial_lat_resolution

float

Latitude Resolution in units matching geospatial_lat_units.

Geospatial_lat_resolution = 1

geospatial_lon_resolution

float

Longitude Resolution in units matching geospatial_lon_units.

Geospatial_lon_resolution = 1

geospatial_lat_units

string

Units of the latitudinal resolution. Typically “degrees_north”

geospatial_lat_units = “degrees_north”

geospatial_lon_units

string

Units of the longitudinal resolution. Typically “degrees_east”

geospatial_lon_units = “degrees_east”

platform

string

Satellite(s) used to create this data file

platform: “Aquarius/SAC-D”

sensor

string

Sensor(s) used to create this data file.

Sensor = “Aquarius”

project

string

Project/mission name

project = “Aquarius”

product_version

string

The product version of this data file, which may be different than the file version used in the file naming convention.

Product_version = “1.3"

processing_level

string

Product processing Level (eg. L2, L3, L4)

processing_level = 3

keywords

string

Comma sperated list of GCMD Science Keywords from http://gcmd.nasa.gov/learn/keyword_list.html

keywords_vocabulary = "SURFACE SALINITY, SALINITY,  AQUARIUS SAC-D"


and also the Climate Forecast (CF) metadata convention... which looks like this

Attribute Name

Type

Description

Example Implementation

Conventions

string

Version of Convention standard implemented by the file,  interpreted as a directory name relative to a directory that is a repository of documents describing sets of discipline-specific conventions

Conventions = "CF-1.6";

title

string

A succinct description of what is in the dataset.

title = "Aquarius CAP Level-3 1x1 Deg Gridded 7-Day Bin Averaged Maps";

history

string

Used to document Provenance.  Provides an audit trail for modifications to the original data. We recommend that each line begin with a timestamp indicating the date and time of day that the program was executed.

history = "L2_1.3CAP2.1.4";

institution

string

Specifies where the original data was produced.

institution = "JPL";

source

string

The method of production of the original data. If it was model-generated, source should name the model and its version, as specifically as could be useful. If it is observational, source should characterize it (e.g., "surface observation" or "radiosonde").

source = "CAPV1.3-HDF5";

comment

string

Miscellaneous information about the data or methods used to produce it.

comment ="rolling 7 day means at 1 degree spatial resolution";

references

string

Published or web-based references that describe the data or methods used to produce it.

references = "Yueh,S.,Tang, W.,Fore,A.,Freedman,A.,Neumann,G.,Chaubell,J.,Hayashi,A (2012).SIMULTANEOUS SALINITY AND WIND RETRIEVAL USING THE CAP ALGORITHM FOR AQUARIUS. http://www.igarss2012.org/Papers/viewpapers.asp?papernum=1596";




--
Lewis
Avi Hayun | 8 Aug 08:28 2014
Picon

How to identify binary content ?

Hi,

I am crawling my site and am using Tika for binary content parsing.

But, how can I know if a certain url contains binary content or plain text ?

I can get the contentType.


So for now I am using:
if (typeStr.contains("image") || typeStr.contains("audio") || typeStr.contains("video") || typeStr.contains("application")) {
return true;
}


Which is dumb code.

I will replace the plain strings with Tika's MediaType objects but still I need better code

Does anyone have any better idea ?




Thank you for your help,
Avi
Nick Burch | 7 Aug 17:09 2014

Re: Compression of Tika server output files

On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
> OK, but I don't really have to use http...does tika support extracting 
> all resources in one call by some other method?

The Tika App does - the -z / --extract will do that. You might also want 
to use the --extract-dir=<dir> flag to set where they go

Nick


Gmane