Baldwin, David | 2 Sep 21:44 2014
Picon

Tika versions compatibility

I am looking for 

Information on Tika compatibility as version releases have been made. We are still on a old version of Tika
0.6.  We would like to upgrade to the latest released version 1.5 and prepare for 1.6 as well.

Is there any information I have not found googling around and searching the page that may show any changes
from 0.6 to the current 1.5 version that may make it incompatible on the API/Usage level?

We are also still using Lucene 2.9.2 with it (albeit we are upgrading to 4.9 in the next while)

David Baldwin 

Mattmann, Chris A (3980 | 1 Sep 07:16 2014
Picon
Picon

[VOTE] Release Apache Tika 1.6 RC #2

Hi Folks,

A candidate for the Tika 1.6 release is available at:

    http://people.apache.org/~mattmann/apache-tika-1.6/rc2/

The release candidate is a zip archive of the sources in:

http://svn.apache.org/repos/asf/tika/tags/1.6-rc2/

The SHA1 checksum of the archive is
65644121446130fa29f1b62bcd75fb33344a6ba3.

A Maven staging repository is at:

https://repository.apache.org/content/repositories/orgapachetika-1004/

Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

    [ ] +1 Release this package as Apache Tika 1.6
    [ ] -1 Do not release this package because...

Cheers,
Chris

P.S. Here is my +1!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(Continue reading)

Lewis John Mcgibbney | 15 Aug 04:46 2014
Picon

Support for HDF5 and netCDF

Hi Folks,

<at> Annie Brynant in particular,

I would like to have on list the current state of our support for Mime Types

 * NetCDF4
 * HDF5

I know that we maintain parsers for these types however I possibly have an extensionuse case which I would like to discuss.
I am looking to ensure that I can obtain metadata defined by the Attribute Conventions Dataset Discovery [0] effort. Please see elaboration below;

Attribute Name

Type

Description

Example Implementation

date_created

string

The date and time the data file was created in the form “yyyymmddThhmmssZ”. This time format is ISO 8601 compliant.

Date_created = “2012-04-06T16:26:33Z”;

time_coverage_start

string

Representative date and time of the start of the granule in the ISO 8601 compliant format of “yyyymmddThhmmssZ”.

Time_coverage_start = “2012001013102483”

time_coverage_start

string

Representative date and time of the start of the granule in the ISO 8601 compliant format of “yyyymmddThhmmssZ”.

Time_coverage_end = “2012002000843304”

geospatial_lat_max

float

Decimal degrees north, range -90 to +90.

Geospatial_lat_max = 90.0f

geospatial_lat_min

float

Decimal degrees north, range -90 to +90.

Geospatial_lat_min = -90.0f

geospatial_lon_max

float

Decimal degrees east,  range -180 to +180.

Geospatial_lon_max = -180.0f

geospatial_lon_min

float

Decimal degrees east,  range -180 to +180.

Geospatial_lon_min = 180.0f

geospatial_lat_resolution

float

Latitude Resolution in units matching geospatial_lat_units.

Geospatial_lat_resolution = 1

geospatial_lon_resolution

float

Longitude Resolution in units matching geospatial_lon_units.

Geospatial_lon_resolution = 1

geospatial_lat_units

string

Units of the latitudinal resolution. Typically “degrees_north”

geospatial_lat_units = “degrees_north”

geospatial_lon_units

string

Units of the longitudinal resolution. Typically “degrees_east”

geospatial_lon_units = “degrees_east”

platform

string

Satellite(s) used to create this data file

platform: “Aquarius/SAC-D”

sensor

string

Sensor(s) used to create this data file.

Sensor = “Aquarius”

project

string

Project/mission name

project = “Aquarius”

product_version

string

The product version of this data file, which may be different than the file version used in the file naming convention.

Product_version = “1.3"

processing_level

string

Product processing Level (eg. L2, L3, L4)

processing_level = 3

keywords

string

Comma sperated list of GCMD Science Keywords from http://gcmd.nasa.gov/learn/keyword_list.html

keywords_vocabulary = "SURFACE SALINITY, SALINITY,  AQUARIUS SAC-D"


and also the Climate Forecast (CF) metadata convention... which looks like this

Attribute Name

Type

Description

Example Implementation

Conventions

string

Version of Convention standard implemented by the file,  interpreted as a directory name relative to a directory that is a repository of documents describing sets of discipline-specific conventions

Conventions = "CF-1.6";

title

string

A succinct description of what is in the dataset.

title = "Aquarius CAP Level-3 1x1 Deg Gridded 7-Day Bin Averaged Maps";

history

string

Used to document Provenance.  Provides an audit trail for modifications to the original data. We recommend that each line begin with a timestamp indicating the date and time of day that the program was executed.

history = "L2_1.3CAP2.1.4";

institution

string

Specifies where the original data was produced.

institution = "JPL";

source

string

The method of production of the original data. If it was model-generated, source should name the model and its version, as specifically as could be useful. If it is observational, source should characterize it (e.g., "surface observation" or "radiosonde").

source = "CAPV1.3-HDF5";

comment

string

Miscellaneous information about the data or methods used to produce it.

comment ="rolling 7 day means at 1 degree spatial resolution";

references

string

Published or web-based references that describe the data or methods used to produce it.

references = "Yueh,S.,Tang, W.,Fore,A.,Freedman,A.,Neumann,G.,Chaubell,J.,Hayashi,A (2012).SIMULTANEOUS SALINITY AND WIND RETRIEVAL USING THE CAP ALGORITHM FOR AQUARIUS. http://www.igarss2012.org/Papers/viewpapers.asp?papernum=1596";




--
Lewis
Avi Hayun | 8 Aug 08:28 2014
Picon

How to identify binary content ?

Hi,

I am crawling my site and am using Tika for binary content parsing.

But, how can I know if a certain url contains binary content or plain text ?

I can get the contentType.


So for now I am using:
if (typeStr.contains("image") || typeStr.contains("audio") || typeStr.contains("video") || typeStr.contains("application")) {
return true;
}


Which is dumb code.

I will replace the plain strings with Tika's MediaType objects but still I need better code

Does anyone have any better idea ?




Thank you for your help,
Avi
Nick Burch | 7 Aug 17:09 2014

Re: Compression of Tika server output files

On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
> OK, but I don't really have to use http...does tika support extracting 
> all resources in one call by some other method?

The Tika App does - the -z / --extract will do that. You might also want 
to use the --extract-dir=<dir> flag to set where they go

Nick

Bratislav Stojanovic | 7 Aug 16:31 2014
Picon

Running tika-server in background?

Hi,

Is there any way to run tika-server in background? I want to start this command
and immediately continue execution :

java -jar target\tika-server-1.5-SNAPSHOT.jar

I've tried with adding & at the end, but no luck (Win7 x64 command prompt).
The only way to run this synchronously is to use start command like this :

start java -jar target\tika-server-1.5-SNAPSHOT.jar

Wrapping this jar with Tanuki's is not really an option for me since this product
is commercial.

Any other solutions please? I think tika-app has -f (--fork) parameter which would be
useful to have here.

--
Bratislav Stojanovic, M.Sc.
Bratislav Stojanovic | 7 Aug 14:44 2014
Picon

Compression of Tika server output files

Hi,

I'm trying to get text, metadata and attachments all in one request using tika-server (JAX-RS), but
the only thing I can get as an output is either uncompressed ZIP or TAR.

Is there any way to :

- set compression level? Having uncompressed ZIP/TAR with resources actually occupies more space than having plain __METADATA__ , __TEXT__ and other files because of additional ZIP/TAR headers. If I decide to use ZIP/TAR I would like to save some hd space.

- or use a simple folder instead of output file with all extracted resources inside? This is desired
for me because I don't have to decompress output to reach the extracted resources

Basically, I would like to specify compression or folder in this command :

curl -T example.doc http://localhost:9998/all > outputFolder

I haven't found any related info on http://wiki.apache.org/tika/TikaJAXRS or mailing list archives, so
please help :)

--
Bratislav Stojanovic, M.Sc.
Mattmann, Chris A (3980 | 31 Jul 20:34 2014
Picon
Picon

[ANNOUNCE] Welcome Tyler Palsulich as an Apache Tika PMC member and committer

Hi Folks,

The Tika PMC has elected to add Tyler Palsulich to our ranks. Tyler please
feel free to introduce yourself and welcome!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

AarKay | 31 Jul 06:29 2014
Picon

Tika - Outlook msg file with another Outlook msg as an attachment - OutlookExtractor passes empty stream

I am using Tika Server (TikaJaxRs) for text extraction needs.
I also have a need to extract the attachments in the file and save it to the 
disk in its native format.
I was able to do it by having CustomParser and write the file to disk using 
'stream' in parse method.

Here is the post I used as a reference for building CustomParser.
http://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-
files-using-apache-tika

I was able to get it work fine if the attachment is anything but Outlook msg 
file.

I am running into an issue when the attachment is a Outlook msg file.
When CustomParser.parse method gets invoked the stream passed to it is empty 
because of which the file thats being written to disk is always 0 KB.

Digging through the code I noticed that in OutlookExtractor.java class the 
attachment is handled by OfficeParser because msg.attachdata is always null 
when attachment is a Outlook msg and thats where it is always sending empty 
stream to CustomParser.

Here is the snippet of code from OutlookExtractor where it iterates through 
Attachment files and uses handleEmbeddedResource method only when 
msg.attachData is not null.
But msg.attachData is always null if the Attachment is of type Outlook msg 
because of which stream is always empty when delegating the request to 
CustomParser.parse method.

Can someone please tell me how can i access the msg attachment and save it 
to disk in its Native format?

for (AttachmentChunks attachment : msg.getAttachmentFiles()) {
               xhtml.startElement("div", "class", "attachment-entry");               
               String filename = null;
               if (attachment.attachLongFileName != null) {
                  filename = attachment.attachLongFileName.getValue();
               } else if (attachment.attachFileName != null) {
                  filename = attachment.attachFileName.getValue();
               }
               if (filename != null && filename.length() > 0) {
                   xhtml.element("h1", filename);
               }               
               if(attachment.attachData != null) {
                  handleEmbeddedResource(                        
TikaInputStream.get(attachment.attachData.getValue()),
                        filename,
                        null, xhtml, true
                  );
               }
               if(attachment.attachmentDirectory != null) {
                  handleEmbededOfficeDoc(
                        attachment.attachmentDirectory.getDirectory(),
                        xhtml
                  );
               }
               xhtml.endElement("div");               
           }

Thanks
-AarKay

Mattmann, Chris A (3980 | 28 Jul 06:22 2014
Picon
Picon

[VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/

The release candidate is a zip archive of the sources in:

    http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/

Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

    [ ] +1 Release this package as Apache Tika 1.6
    [ ] -1 Do not release this package because҆

Thank you!

Cheers,
Chris

P.S. Here is my +1!

Avi Hayun | 23 Jul 14:50 2014
Picon

How to identify a language of text

Hi,

I saw that Tika can identify language of a given text by using the following:
http://tika.apache.org/1.4/api/org/apache/tika/language/LanguageIdentifier.html


How many languages does Tika support?
Where can I find more information about it ?

Gmane