george | 17 Jun 10:03 2014
Picon

Detecting html file which is urf-16 encoded

I want to be able to detect when a file is html even when it is utf-16 encoded. I can see from the default tika-mimetypes.xml that normally files with a BOM will be detected as text/plain, which is the case.  I have tried creating my own versions of the html and text mime types in a custom-mimetypes.xml and these successfully overwrite the original ones but changing the priority of these does not force the utf-16 files to be identified as html. Even removing the BOM matches completely from the text mimetype in the custom-mimetypes.xml does not work. 

So I tried another approach by removing the BOM from the inputstream before detecting. However the utf-16 file is still not recognised as html, despite the tect having multiple matches. It seems that the detect method does not realise what encoding is being used for the file. Is there a way to tell a detector what encoding a file is in to aid detection?

Thanks

George

UTF-16 encoded HTML files detected as plain/text

I can successfully detect valid html files in other encodings but when a valid file is encoded as UTF-16 it is identified as plain/text.  I can see that in tika-mimetypes.xml the UTF_16 BOMs are used to identify files as text/plain with a priority of 20 and *.html identification is set to a priority of 40. I’m not sure why this is the case.

 

I see the advice here is not to alter  tika-mimetypes.xml (and indeed that would be a pain to maintain) and suggests that custom-mimetypes.xml should be used for new file types. However, I want to overwrite the definition for the existing text/plain type to reduce the priority or remove the UTF-16 magic signs so my valid UTF-16 html files are correctly identified.

 

Is this possible or is there a better way to achieve my aim of correctly identifying my UTF-16 html files as I can with those in other encodings?

 

George

 

 

 

 



 

Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
David Meikle | 16 Jun 06:44 2014
Picon

ApacheCon CFP closes June 25

Dear Apache Tika Enthusiast,

As you may be aware, ApacheCon will be held this year in Budapest, on
November 17-23. (See http://apachecon.eu for more info.)

The Call For Papers for that conference is still open, but will be
closing soon. We need you talk proposals, to represent Apache Tika at
ApacheCon. We need all kinds of talks - deep technical talks, hands-on
tutorials, introductions for beginners, or case studies about the
awesome stuff you're doing with Apache Tika.

Please consider submitting a proposal, at
http://events.linuxfoundation.org//events/apachecon-europe/program/cfp

Thanks!
Mattmann, Chris A (3980 | 12 Jun 05:27 2014
Picon
Picon

Re: Question re installing Tika

Hi Richard,

Hope you are well, will try and answer below:

-----Original Message-----

From: Richard <rgwlawson@...>
Date: Friday, June 6, 2014 6:07 AM
To: "user@..." <user@...>,
"dev-owner@..." <dev-owner@...>
Subject: Question re installing Tika

>Hello
> 
>I am new to the Apache suite of products and dealing with text in pdfs,
>more generally. In particular I am trying to install Tika (the
>tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
>
> 
>However I am confused about how to do the Tika installation.
>
> 
>From reading various webpages (eg
>http://tika.apache.org/1.5/gettingstarted.html
><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
> 
>1)     
>Download the .jar from
>http://tika.apache.org/download.html
><http://tika.apache.org/download.html> (do I need to put it in a specific
>windows folder?)

Nope you don't have to put in any specific folder, wherever you are
comfortable calling the jar from.

>2)     
>Download Maven 2 (from http://maven.apache.org/ ) and follow up the
>instructions for Windows on
>http://maven.apache.org/download.cgi#Installation

No need to do this unless you are building from scratch.

>3)     
>Also where do I set the base directory?

You just need to install Apache Tika and its *-app.jar file into some
folder, and then
call it by doing java -jar /path/to/tika-*version*-app.jar --help

> 
>4)     
>Where do I run the command ³mvn install² from? Is it the command line?

If you are building from source, then you would run this at the top level
directory containing
files like pom.xml, tika-parent, tika-parsers, etc.

>
>
>Any help would be most gratefully received.

Cheers!

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@...
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

Carlos Scheidecker | 10 Jun 16:16 2014
Picon

extract content of an uploaded file on the file

Hello all,

I have a Spring controller to handle uploads and I would like to extract the contents of a pdf, doc, txt, html file as it is uploaded.

Problem is that I can see the file being uploaded and I can see the bytes payload, but when I try to use the AutodetectParser I cannot get the contents, here is what I am doing.

Notice that I extract the bytes from MultipartFIle and then build a ByteArrayInputStream.

I can see from the debugger that it is not empty and has the contents. But when I try to extract them with Tika I get an empty string but no errors.

<at> Controller
<at> RequestMapping(value = "/documents")
public class DocumentController {

<at> RequestMapping(value = "/parse", method = RequestMethod.POST)
public <at> ResponseBody String handleFileUpload(
<at> RequestParam("file") MultipartFile file) {
if (!file.isEmpty()) {
try {
byte[] source = file.getBytes();
long size = file.getSize();
ByteArrayInputStream is = new ByteArrayInputStream(source);
   Metadata metadata = new Metadata();
   //metadata.set(Metadata.RESOURCE_NAME_KEY, file.getOriginalFilename());
   
   Parser parser = new AutoDetectParser(); // Should auto-detect!
   StringWriter textBuffer = new StringWriter();
   BodyContentHandler handler = new BodyContentHandler(textBuffer);
       ParseContext context = new ParseContext();
       parser.parse(is, handler, metadata, context);
       String content2 = textBuffer.toString();
       String content1 = handler.toString();
       

       Tika tk = new Tika();
String text = tk.parseToString(is, metadata);
is.close();
// TODO : return structure instead of text
return text;

} catch (Exception e) {
return "You failed to upload the file => " + e.getMessage();
}
} else {
return "You failed to upload the file because the file was empty.";
}
}

}

I thought that calling either handler.toString() or passing a textBuffer to the handler constructor and then calling textBuffer,toString() would give me the contents of the text file, or pdf being uploaded to it.

I get an empty string instead.

I do not want to save the file but just extract its text content. How shall I do it?

Thanks.
Érico | 3 Jun 22:44 2014
Picon

Tika and JBoss 5.1 compatibility

Hi 

Please I need to use Tika embedded in my jboss eap 5.1 

what would be Tika version for this ? 

also is there any how-to and tutorial for using Tika with sample codes ? 

Regards
Érico
Yi, EungJun | 27 May 06:32 2014
Picon

What exception does CharsetDetector.detect() throw?

Hi.

According to the javadoc of CharsetDetector.detect(), it raises an
exception if no charset appears to match the data:

     * Raise an exception if
     *  <ul>
     *    <li>no charsets appear to match the input data.</li>
     *    <li>no input text has been provided</li>
     *  </ul>

But it seems to me that the method returns null but does not raise an
exception. What exception does the method throw?

Thanks in advance.

Best Regards,
EungJun Yi

Annie Burgess | 8 May 01:16 2014
Picon

tika install fail on os x 10.9.2

Hi all, 

I have a new computer running  OS X 10.9.2 (13C64).  I am attempting to get Tika up and running, but am getting errors in the Maven install phase.  My steps are as follows:


[annies-mbp:~/tika/] % svn co https://svn.apache.org/repos/asf/tika/trunk tmp
[annies-mbp:~/tika/tmp]% setenv MAVEN_OPTS "-Xms128m -Xmx256m"
[annies-mbp:~/tika/tmp]% mvn install

Results :

Tests in error:

  testiBooksParser(org.apache.tika.parser.ibooks.iBooksParserTest): Premature end of file.

Tests run: 506, Failures: 0, Errors: 1, Skipped: 1

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] Apache Tika parent ................................ SUCCESS [  0.626 s]
[INFO] Apache Tika core .................................. SUCCESS [  6.631 s] [INFO] Apache Tika parsers ............................... FAILURE [ 23.323 s]

.
.
.

[INFO] ------------------------------------------------------------------------

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

My Maven version is:

[annies-mbp:~/Development/tika/tmp]% mvn --version
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T08:37:52-09:00)
Maven home: /usr/local/Cellar/maven/3.2.1/libexec
Java version: 1.8.0_05, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.9.2", arch: "x86_64", family: "mac"--


Does anyone have any insight as to why this is failing at 'iBooksParserTest'?
Thanks!
Annie

------------------------------------------------------------------------------------------
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
University of Southern California
Viterbi School of Engineering                        
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK                  

Cell:  (585) 738-7549
Office:  (907) 786-7059
Fax:  (907) 786-7150
E-mail: anniebryant.burgess-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
-------------------------------------------------------------------------------------------
Tamás Cservenák | 7 May 12:57 2014
Picon

Inconsistent priorities in bundled tika-mimetypes.xml

Hi all,

I just created an issue

In short: it's about Tika Detector detecting a JAR file (correct ZIP file, with proper magic bytes, etc) as "text/html" instead of expected "application/java-archive".

The reason is clear to me (we already created a PR in Nexus project for that), but the interesting thing what bothers me is _why_ Detector behaves correctly with tika-parsers on classpath?

How is the presence of tika-parsers affecting the MIME magic detection and most interestingly, why does it affects? (am aware of added org.apache.tika.parser.pkg.ZipContainerDetector).

Isn't MIME magic detection based on bundled tika-mimetypes.xml, where even the globs defined for text/html (*.htm and *.html) does not match for the JAR file above (*.jar), still, Tika selects the HTML mime type....


Thanks,
~t~
Milos Kovacevic | 4 May 22:21 2014

Setting output format and content length in tika-server 1.5

Hello,
I would like to use tika-server-1.5.jar as a parsing server for my
application but I have two problems:

when I post a file for parsing to tika server from my app i get the output
in html format but I would like that to be in plain text. When I use
tika-server-1.3.jar from the same app the output is plain text.

In addition, I would like to set the maximum content length for the output
to be more than 100 000 chars.

How to set the output and content length for tika-server-1.5.jar?
Do I have to change source code or just use some command line switch?

Regards, Milos

Allison, Timothy B. | 30 Apr 20:40 2014
Picon

tika server jax-rs and recursive file processing

All,

  As always, apologies for the cluelessness the following reveals… I’m starting to move from embedded Tika to a server option for greater robustness.  Is the jax-rs server intended not to handle embedded files recursively?  If so, how are users currently handling multiply embedded documents with the jax-rs server?  Would it be worthwhile to add another service that uses AutoDetectParser as the embedded parser/extractor instead of MyEmbeddedDocumentExtractor?

 

        Best,

 

                   Tim

 

Timothy B. Allison, Ph.D.

Lead Artificial Intelligence Engineer

Group Lead

K83A/Human Language Technology

The MITRE Corporation

7515 Colshire Drive, McLean, VA  22102

703-983-2473 (phone); 703-983-1379 (fax)

 


Gmane