Ron Grabowski | 24 Oct 04:30 2013
Picon

Adding Tika to existing non-Spring Apache CXF web app

If I have an existing non-Spring Apache CXF wep application up and running and I want to add support for basic meta data extraction via Tika do I just need to register Meta/TikaResource alongside my existing resources:


<!-- web.xml -->
<servlet>
 <servlet-name>MyApp</servlet-name>
  <servlet-class>org.apache.cxf.jaxrs.servlet.CXFNonSpringJaxrsServlet</servlet-class>
  <init-param>
  <param-name>jaxrs.serviceClasses</param-name>
  <param-value>
   org.apache.tika.server.MetadataResource
   org.apache.tika.server.TikaResource
   com.example.ExistingResources
  </param-value>
  </init-param>
</servlet>

Will there be conflicts with the CXF jars that ship with Tika if my application is stuck on CXF v2.5.x?
Samuel Desseaux | 23 Oct 14:53 2013
Picon
Picon

extract metadata of pdf files with tika

Hi,

I'm a little newbie with tika and would need some help.

I have many pdf files which i would like to extract metadata, in order 
to have an xml file (which respect dublin core).

I've followed these links

http://www.hascode.com/2012/12/content-detection-metadata-and-content-extraction-with-apache-tika/#Extracting_Metadata_from_a_PDF_using_a_concrete_Parser 
and http://tika.apache.org/0.8/api/org/apache/tika/metadata/Metadata.html

Do i have to write a program with tika to do it?

How could i do that?

Best regards

Samuel

Attachment (samuel_desseaux.vcf): text/x-vcard, 360 bytes
Bratislav Stojanovic | 18 Oct 11:42 2013
Picon

Fwd: Tika OSGi bundle does not produce UTF-8 output

Hi,

I've tried exactly the same code in two scenarios :

Tika tika = new Tika();
Metadata metadata = new Metadata();

Reader reader = tika.parse(new File("..."));
FileWriter fw = new FileWriter(new File("..."));

int data = reader.read();
StringBuilder sb = new StringBuilder();
while (data != -1){
char dataChar = (char) data;
sb.append(dataChar);
fw.write(dataChar);
data = reader.read();
}

When I put this code in a simple Java project with tika-app-1.4.jar as a dependency, it
generates UTF-8 output (correct).
When I put this code inside a bundle with tika-bundle and tika-core as dependencies and deploy it
inside karaf, it generates ANSI output (blah).
Both projects are managed with maven and Eclipse 4.2.

Do I have to additionaly set something or should I embed tika-app inside my bundle (using
maven-bundle-plugin)?

I'm using Tika 1.4, Java 1.6.45, Win 7 x64 and karaf 2.3.3.


--
Bratislav Stojanovic, M.Sc.
Florin P | 10 Oct 11:15 2013
Picon

Unsubscribe address

Hello!
  can you please provide me the unsubscribe address list of tika?
Thanks!
Mr Havercamp | 10 Oct 01:50 2013
Picon

Using TikaJAXRS with remote files

Hi

Been working with tika jaxrs and it is working great.

One thing I'm wondering; the standalone Tika app can extract remote 
files by providing a url (both in GUI and CMD mode); I'm wondering if 
the same is at all possible with TIKAJAXRS or TIka app launched in 
server mode?

The reason being I may run an indexing client on a separate server so it 
wouldn't necessarily have direct access to the file system where the 
files to be indexed reside.

Cheers

Hayden

Markus Jelsma | 9 Oct 11:24 2013

Script element not reported in custom handler

Hi,

I'm building a new ContentHandler that needs to do some work on script elements as well. But they are not
reported in my startElement method. The context has the IdentityHtmlMapper set and script does not get
discarded in Tika's own HtmlHandler. Instead, the script element is reported in HtmlHandler but not in my
custom handler.

The confusing thing is that i am able to get it in my handler when adding the script element to TagSoup inside
HtmlParser's constructor:
        HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);

Without this, script and it's characters are only reported inside HtmlHandler, never in custom handlers.

Am must be doing something wrong here, any hints?

Thanks,
Markus

Tad Wimmer | 26 Sep 20:45 2013

Extracting Metadata from MS Office (2007 +) Files on Glassfish

Hello.

I’m a Tika newbie, and running into an issue with Tika on Glassfish.

I'm using Tika to extract metadata from documents uploaded to a JSF 2.0 Web application using Prime Faces p:fileupload (Prime Faces 3.5) and running on Glassfish 3.2.2. Here is the essentials of my code:

private void extract(InputStream stream, Parser parser, Metadata metadata) {

    try {

        parser.parse(stream, new BodyContentHandler(), metadata, new ParseContext());

    } catch (IOException | SAXException | TikaException e) {

        LOGGER.debug("Exception parsing file for metadata: {}", e);

    }

}

Passing in an AutoDetectParser with the InputStream from the fileupload works fine for MS Office OOXML (.docx, .xlsx, etc.) documents in JUnit tests, but fails when I run it in the Glassfish container with the following stack trace:

Caused by: java.lang.NoClassDefFoundError: org/dom4j/Namespace

at org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.<clinit>(PackagePropertiesUnmarshaller.java:49) ~[na:3.6]

at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149) ~[na:3.6]

at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:136) ~[na:3.6]

at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54) ~[na:3.6]

at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:81) ~[na:3.6]

at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220) ~[na:3.6]

at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:86) ~[na:3.6]

at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:53) ~[na:na]

at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) ~[na:na]

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132) ~[na:na]

at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99) ~[na:na]

at com.spillman.fileupload.FileMetadataExtractor.extract(FileMetadataExtractor.java:220) ~[FileMetadataExtractor.class:na]

The tika-core-0.7, tika-parsers-0.7, tika-app-0.8, poi-3.6, poi-ooxml-3.6, and dom4j-1.6.1 jars are all in the build path and marked for export. I've even gone so far as to put the jars in the Glassfish endorsed directory. Web research hasn't produced anything directly related, but I did find a few references to this exception in CF related to class loader conflicts that I wasn't able extrapolate to our Glassfish implementation (Which may be a lack of understanding on my part). What do I need to configure or change to get this to work on Glassfish?

Thanks in advance,

 

Tad

 

TAD B WIMMER | Spillman Technologies | JAVA DEVELOPER – HOSTED SOLUTIONS

Toll Free 800.860.8026 ext. 1747 | Phone 801.902.1747 | Fax 801.902.1210

4625 Lake Park Blvd., Salt Lake City, UT 84120

twimmer-BxGJAMkO+kJWk0Htik3J/w@public.gmane.org www.spillman.com | www.citadex.com

 

Jukka Zitting | 26 Sep 16:40 2013
Picon

Re: Unsubscribe

Hi,

To unsubscribe this list, send a message to user-unsurbscribe@...

See http://tika.apache.org/mail-lists.html for more details.

BR,

Jukka Zitting

Julien Nioche | 25 Sep 09:52 2013
Picon

Nutch talk at Lucene/SOLR Revolution EU 2013

Hi,

I will be giving a talk on Nutch at Lucene/SOLR Revolution in Dublin (4/7 Nov).

There should be quite a few interesting presentations as you can  see on http://lucenerevolution.org/sessions as well as the training sessions (http://lucenerevolution.org/training). 

Ping me on twitter if you will be there and want to meet for a chat.

Julien

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Ryan McKinley | 16 Sep 21:32 2013
Picon

register multiple mime types to the same extension?

What approach do people suggest to handle the case different types use the same extension?

In this case (Lidar vs ASCII log file) I can see if the first character is '~' and pick the right mime type.

How is this supported in tike?  If i register two mimes with:

<glob pattern="*.las" />


the second one fails.


Is there a general approach to this problem, or does this collision need to be sorted out in a custom Detector?


thanks

ryan


Bratislav Stojanovic | 30 Aug 13:58 2013
Picon

Fwd: Parsing jnilib file throws exception

Hi,

I'm trying to parse a folder with jnilib file inside, but Tika 1.4 throws exception :

java.io.IOException: 
at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:260)
at java.io.Reader.read(Unknown Source)
at ca.cloudscraper.core.impl.Engine.process(Engine.java:63)
at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
at ca.cloudscraper.core.impl.Engine.process(Engine.java:34)
at ca.cloudscraper.core.impl.Engine.execute(Engine.java:117)
at ca.cloudscraper.core.tests.LuceneServiceImplTest.test5(LuceneServiceImplTest.java:140)
at ca.cloudscraper.core.tests.LuceneServiceImplTest.main(LuceneServiceImplTest.java:176)
Caused by: org.apache.tika.exception.TikaException: Failed to parse a Java class
at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:66)
at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.objectweb.asm.ClassReader.readClass(ClassReader.java:2157)
at org.objectweb.asm.ClassReader.accept(ClassReader.java:542)
at org.objectweb.asm.ClassReader.accept(ClassReader.java:506)
at org.apache.tika.parser.asm.XHTMLClassVisitor.parse(XHTMLClassVisitor.java:61)
... 6 more

Seems like Tika tries to parse this file as Java class file, but that obviously doesn't work.
I've tried to create custom-mimetypes.xml file like this :

<?xml version="1.0" encoding="UTF-8"?>
<mime-info>

  <mime-type type="application/octet-stream">
    <_comment>Mac OSX jnilib</_comment>
    <glob pattern="*.jnilib"/>
  </mime-type>

</mime-info>

and after I repack tika-app-1.4.jar with this file in org.apache.tika.mime folder, the problem still
exists.

Please help.

--
Bratislav Stojanovic, M.Sc.



--
Bratislav Stojanovic, M.Sc.

Gmane