David Meikle | 16 Jun 06:44 2014
Picon

ApacheCon CFP closes June 25

Dear Apache Tika Enthusiast,

As you may be aware, ApacheCon will be held this year in Budapest, on
November 17-23. (See http://apachecon.eu for more info.)

The Call For Papers for that conference is still open, but will be
closing soon. We need you talk proposals, to represent Apache Tika at
ApacheCon. We need all kinds of talks - deep technical talks, hands-on
tutorials, introductions for beginners, or case studies about the
awesome stuff you're doing with Apache Tika.

Please consider submitting a proposal, at
http://events.linuxfoundation.org//events/apachecon-europe/program/cfp

Thanks!
Mattmann, Chris A (3980 | 12 Jun 05:27 2014
Picon
Picon

Re: Question re installing Tika

Hi Richard,

Hope you are well, will try and answer below:

-----Original Message-----

From: Richard <rgwlawson@...>
Date: Friday, June 6, 2014 6:07 AM
To: "user@..." <user@...>,
"dev-owner@..." <dev-owner@...>
Subject: Question re installing Tika

>Hello
> 
>I am new to the Apache suite of products and dealing with text in pdfs,
>more generally. In particular I am trying to install Tika (the
>tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
>
> 
>However I am confused about how to do the Tika installation.
>
> 
>From reading various webpages (eg
>http://tika.apache.org/1.5/gettingstarted.html
><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
> 
>1)     
>Download the .jar from
>http://tika.apache.org/download.html
><http://tika.apache.org/download.html> (do I need to put it in a specific
>windows folder?)

Nope you don't have to put in any specific folder, wherever you are
comfortable calling the jar from.

>2)     
>Download Maven 2 (from http://maven.apache.org/ ) and follow up the
>instructions for Windows on
>http://maven.apache.org/download.cgi#Installation

No need to do this unless you are building from scratch.

>3)     
>Also where do I set the base directory?

You just need to install Apache Tika and its *-app.jar file into some
folder, and then
call it by doing java -jar /path/to/tika-*version*-app.jar --help

> 
>4)     
>Where do I run the command ³mvn install² from? Is it the command line?

If you are building from source, then you would run this at the top level
directory containing
files like pom.xml, tika-parent, tika-parsers, etc.

>
>
>Any help would be most gratefully received.

Cheers!

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

Carlos Scheidecker | 10 Jun 16:16 2014
Picon

extract content of an uploaded file on the file

Hello all,

I have a Spring controller to handle uploads and I would like to extract the contents of a pdf, doc, txt, html file as it is uploaded.

Problem is that I can see the file being uploaded and I can see the bytes payload, but when I try to use the AutodetectParser I cannot get the contents, here is what I am doing.

Notice that I extract the bytes from MultipartFIle and then build a ByteArrayInputStream.

I can see from the debugger that it is not empty and has the contents. But when I try to extract them with Tika I get an empty string but no errors.

<at> Controller
<at> RequestMapping(value = "/documents")
public class DocumentController {

<at> RequestMapping(value = "/parse", method = RequestMethod.POST)
public <at> ResponseBody String handleFileUpload(
<at> RequestParam("file") MultipartFile file) {
if (!file.isEmpty()) {
try {
byte[] source = file.getBytes();
long size = file.getSize();
ByteArrayInputStream is = new ByteArrayInputStream(source);
   Metadata metadata = new Metadata();
   //metadata.set(Metadata.RESOURCE_NAME_KEY, file.getOriginalFilename());
   
   Parser parser = new AutoDetectParser(); // Should auto-detect!
   StringWriter textBuffer = new StringWriter();
   BodyContentHandler handler = new BodyContentHandler(textBuffer);
       ParseContext context = new ParseContext();
       parser.parse(is, handler, metadata, context);
       String content2 = textBuffer.toString();
       String content1 = handler.toString();
       

       Tika tk = new Tika();
String text = tk.parseToString(is, metadata);
is.close();
// TODO : return structure instead of text
return text;

} catch (Exception e) {
return "You failed to upload the file => " + e.getMessage();
}
} else {
return "You failed to upload the file because the file was empty.";
}
}

}

I thought that calling either handler.toString() or passing a textBuffer to the handler constructor and then calling textBuffer,toString() would give me the contents of the text file, or pdf being uploaded to it.

I get an empty string instead.

I do not want to save the file but just extract its text content. How shall I do it?

Thanks.
Érico | 3 Jun 22:44 2014
Picon

Tika and JBoss 5.1 compatibility

Hi 

Please I need to use Tika embedded in my jboss eap 5.1 

what would be Tika version for this ? 

also is there any how-to and tutorial for using Tika with sample codes ? 

Regards
Érico
Yi, EungJun | 27 May 06:32 2014
Picon

What exception does CharsetDetector.detect() throw?

Hi.

According to the javadoc of CharsetDetector.detect(), it raises an
exception if no charset appears to match the data:

     * Raise an exception if
     *  <ul>
     *    <li>no charsets appear to match the input data.</li>
     *    <li>no input text has been provided</li>
     *  </ul>

But it seems to me that the method returns null but does not raise an
exception. What exception does the method throw?

Thanks in advance.

Best Regards,
EungJun Yi

Annie Burgess | 8 May 01:16 2014
Picon

tika install fail on os x 10.9.2

Hi all, 

I have a new computer running  OS X 10.9.2 (13C64).  I am attempting to get Tika up and running, but am getting errors in the Maven install phase.  My steps are as follows:


[annies-mbp:~/tika/] % svn co https://svn.apache.org/repos/asf/tika/trunk tmp
[annies-mbp:~/tika/tmp]% setenv MAVEN_OPTS "-Xms128m -Xmx256m"
[annies-mbp:~/tika/tmp]% mvn install

Results :

Tests in error:

  testiBooksParser(org.apache.tika.parser.ibooks.iBooksParserTest): Premature end of file.

Tests run: 506, Failures: 0, Errors: 1, Skipped: 1

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] Apache Tika parent ................................ SUCCESS [  0.626 s]
[INFO] Apache Tika core .................................. SUCCESS [  6.631 s] [INFO] Apache Tika parsers ............................... FAILURE [ 23.323 s]

.
.
.

[INFO] ------------------------------------------------------------------------

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

My Maven version is:

[annies-mbp:~/Development/tika/tmp]% mvn --version
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T08:37:52-09:00)
Maven home: /usr/local/Cellar/maven/3.2.1/libexec
Java version: 1.8.0_05, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.9.2", arch: "x86_64", family: "mac"--


Does anyone have any insight as to why this is failing at 'iBooksParserTest'?
Thanks!
Annie

------------------------------------------------------------------------------------------
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
University of Southern California
Viterbi School of Engineering                        
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK                  

Cell:  (585) 738-7549
Office:  (907) 786-7059
Fax:  (907) 786-7150
E-mail: anniebryant.burgess-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
-------------------------------------------------------------------------------------------
Tamás Cservenák | 7 May 12:57 2014
Picon

Inconsistent priorities in bundled tika-mimetypes.xml

Hi all,

I just created an issue

In short: it's about Tika Detector detecting a JAR file (correct ZIP file, with proper magic bytes, etc) as "text/html" instead of expected "application/java-archive".

The reason is clear to me (we already created a PR in Nexus project for that), but the interesting thing what bothers me is _why_ Detector behaves correctly with tika-parsers on classpath?

How is the presence of tika-parsers affecting the MIME magic detection and most interestingly, why does it affects? (am aware of added org.apache.tika.parser.pkg.ZipContainerDetector).

Isn't MIME magic detection based on bundled tika-mimetypes.xml, where even the globs defined for text/html (*.htm and *.html) does not match for the JAR file above (*.jar), still, Tika selects the HTML mime type....


Thanks,
~t~
Milos Kovacevic | 4 May 22:21 2014

Setting output format and content length in tika-server 1.5

Hello,
I would like to use tika-server-1.5.jar as a parsing server for my
application but I have two problems:

when I post a file for parsing to tika server from my app i get the output
in html format but I would like that to be in plain text. When I use
tika-server-1.3.jar from the same app the output is plain text.

In addition, I would like to set the maximum content length for the output
to be more than 100 000 chars.

How to set the output and content length for tika-server-1.5.jar?
Do I have to change source code or just use some command line switch?

Regards, Milos

Allison, Timothy B. | 30 Apr 20:40 2014
Picon

tika server jax-rs and recursive file processing

All,

  As always, apologies for the cluelessness the following reveals… I’m starting to move from embedded Tika to a server option for greater robustness.  Is the jax-rs server intended not to handle embedded files recursively?  If so, how are users currently handling multiply embedded documents with the jax-rs server?  Would it be worthwhile to add another service that uses AutoDetectParser as the embedded parser/extractor instead of MyEmbeddedDocumentExtractor?

 

        Best,

 

                   Tim

 

Timothy B. Allison, Ph.D.

Lead Artificial Intelligence Engineer

Group Lead

K83A/Human Language Technology

The MITRE Corporation

7515 Colshire Drive, McLean, VA  22102

703-983-2473 (phone); 703-983-1379 (fax)

 

Chris Bamford | 28 Apr 17:00 2014

Plans for Tika 1.6


Hi,

I am wondering when 1.6 is planned for release?
I recently worked with Tim Allison on some changes to the RTF parser (see https://issues.apache.org/jira/browse/TIKA-1010) and am keen to start using them.

Thanks,

Chris


Chris Bamford m: +44 7860 405292 w: www.mimecast.com
Senior Developer p: +44 207 847 8700 Address click here
 


אברהם חיון | 23 Apr 10:35 2014
Picon

Correct use of Tika's MediaType

I want to use Tika's MediaType class to compare mediaTypes.

I first use Tika to detect the MediaType. Then I want to start an action according to the MediaType.

So if the MediaType is from type XML I want to do some action, if it is a compressed file I want to start an other action.

My problem is that there are many XML types, so how do I check if it is an XML using the MediaType ?

Here is my previous (before Tika) implementation:

if (contentType.contains("text/xml") || contentType.contains("application/xml") || contentType.contains("application/x-xml") || contentType.contains("application/atom+xml") || contentType.contains("application/rss+xml")) { processXML(); } else if (contentType.contains("application/gzip") || contentType.contains("application/x-gzip") || contentType.contains("application/x-gunzip") || contentType.contains("application/gzipped") || contentType.contains("application/gzip-compressed") || contentType.contains("application/x-compress") || contentType.contains("gzip/document") || contentType.contains("application/octet-stream")) { processGzip(); }

I want to switch it to use Tika something like the following:

MediaType mediaType = MediaType.parse(contentType); if (mediaType == APPLICATION_XML) { return processXml(); } else if (mediaType == APPLICATION_ZIP || mediaType == OCTET_STREAM) { return processGzip(); }

But the problem is that Tika.detect(...) returns many different types which don't have a MediaType constant.

How can I just identify the MediaType if it is type XML ? Or if it is type Compress ? I need a "Father" type which includes all of it's childs, maybe a method which is: "boolean isXML()" which includes application/xml and text/xml and application/x-xml or "boolean isCompress()" which includes all of the zip + gzip types etc


Gmane