Chris Bamford | 28 Apr 17:00 2014

Plans for Tika 1.6


Hi,

I am wondering when 1.6 is planned for release?
I recently worked with Tim Allison on some changes to the RTF parser (see https://issues.apache.org/jira/browse/TIKA-1010) and am keen to start using them.

Thanks,

Chris


Chris Bamford m: +44 7860 405292 w: www.mimecast.com
Senior Developer p: +44 207 847 8700 Address click here
 


אברהם חיון | 23 Apr 10:35 2014
Picon

Correct use of Tika's MediaType

I want to use Tika's MediaType class to compare mediaTypes.

I first use Tika to detect the MediaType. Then I want to start an action according to the MediaType.

So if the MediaType is from type XML I want to do some action, if it is a compressed file I want to start an other action.

My problem is that there are many XML types, so how do I check if it is an XML using the MediaType ?

Here is my previous (before Tika) implementation:

if (contentType.contains("text/xml") || contentType.contains("application/xml") || contentType.contains("application/x-xml") || contentType.contains("application/atom+xml") || contentType.contains("application/rss+xml")) { processXML(); } else if (contentType.contains("application/gzip") || contentType.contains("application/x-gzip") || contentType.contains("application/x-gunzip") || contentType.contains("application/gzipped") || contentType.contains("application/gzip-compressed") || contentType.contains("application/x-compress") || contentType.contains("gzip/document") || contentType.contains("application/octet-stream")) { processGzip(); }

I want to switch it to use Tika something like the following:

MediaType mediaType = MediaType.parse(contentType); if (mediaType == APPLICATION_XML) { return processXml(); } else if (mediaType == APPLICATION_ZIP || mediaType == OCTET_STREAM) { return processGzip(); }

But the problem is that Tika.detect(...) returns many different types which don't have a MediaType constant.

How can I just identify the MediaType if it is type XML ? Or if it is type Compress ? I need a "Father" type which includes all of it's childs, maybe a method which is: "boolean isXML()" which includes application/xml and text/xml and application/x-xml or "boolean isCompress()" which includes all of the zip + gzip types etc

Luke Miller | 1 Apr 15:48 2014

Apache Tika Skill Test

Hi everyone,

I'd love your feedback on this short Apache Tika skills test I created: 


Any and all feedback is incredibly helpful, I also have tests on Nutch and Tika.

Feel free to email me at luke-asoCgOi28Rh54TAoqtyWWQ@public.gmane.org if you have any questions or suggestions.

Cheers,
Luke

--
Luke Miller
Content Management Associate


Annie Burgess | 29 Mar 01:23 2014
Picon

class paths

Hi Tika users.  

I'm having trouble importing tika .class files. I have written a basic .java script that is saved in the same directory as toolsUI-4.3.jar .

I call the script at the command line as:  

      [abryant:~/tika/tika] abryant% javac -cp '.:toolsUI-4.3.jar' NetDump_tikalist.java


Within that script I have the import commands: 

     import org.apache.tika.metadata.Metadata;
     import org.apache.tika.metadata.Property;

Once I execute the script I get the errors:

     NetDump_tikalist.java:11: package org.apache.tika.metadata does not exist
     import org.apache.tika.metadata.Metadata;
                               ^
     NetDump_tikalist.java:12: package org.apache.tika.metadata does not exist
     import org.apache.tika.metadata.Property;

The 'metadata' files are in: 

    /Users/annbryant/tika/tika/tika-core/target/classes/org/apache/tika/metadata

If I run the script with the two tika import commands commented out, it works fine.  

As a java beginner I'd appreciate any input on the behind-the-scenes action that needs to take place to have scripts find tika .class files.

Many thanks!

Annie


*********************************SCRIPT*********************************

// NetCDF imports
import ucar.nc2.NetcdfFile;
import ucar.nc2.Attribute;
import ucar.nc2.Variable;
import ucar.nc2.Dimension;

import java.io.IOException;
import java.util.logging.Logger;

// Desired TIKA imports
import org.apache.tika.metadata.Metadata; // comment this out to get it to run
import org.apache.tika.metadata.Property; // comment this out to get it to run

public class NetDump_tikalist{
private static Logger log = Logger.getLogger("InfoLogging");
public static void main( String[] args ){
String name = "lsmask.nc.nc"; // NetCDF file
       // NetCDF file downloaded from:  ftp://ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/lsmask.nc
System.out.println(name); // Print File

        try {
            NetcdfFile ncFile = NetcdfFile.open(name); // Attempt to open NetCDF File
            System.out.println("Dimensions:");
            for (Dimension dim : ncFile.getDimensions()){
                    System.out.println(dim);
         }
            
            System.out.println("Variables:");
            for (Variable var : ncFile.getVariables()){
                 //Property property = resolveMetadataKey(var.getName()); //**need tika import
                 if (var.getDataType().isString()){ // check if data type in the variable is a string
                        System.out.println(var);
                     //metadata.add(property, attr.getStringValue()); //**need tika import
                
                
                 else if (var.getDataType().isNumeric()){ // check if data type in the variable is a number       
                     //int value = var.getNumericValue().intValue();
                     //metadata.add(property, String.valueOf(value)); //**need tika import
                     System.out.println(var);
                 }
                }
            }
        
         catch (IOException ioe){
  log.info("error");
  }
}
}

/*  SCREEN OUTPUT WITH SUCCESSFUL RUN

From the command line:

[abryant:~/tika/tika] abryant% javac -cp '.:toolsUI-4.3.jar' NetDump_test.java
[abryant:~/tika/tika] abryant% java -cp '.:toolsUI-4.3.jar' NetDump_test

 Output will mimic that of NCdumpW:  
 
Dimensions
lon = 321;
lat = 161;
time = UNLIMITED;   // (1 currently

Variables
float lat(lat=161);
  :units = "degrees_north";
  :long_name = "Latitude";
  :actual_range = 20.0f, 60.0f; // float

float lon(lon=321);
  :units = "degrees_east";
  :long_name = "Longitude";
  :actual_range = 220.0f, 300.0f; // float

double time(time=1);
  :units = "hours since 1-1-1 00:00:00";
  :long_name = "Time";
  :actual_range = 0.0, 0.0; // double
  :delta_t = "0000-00-01 00:00:00";
  :avg_period = "0000-00-01 00:00:00";

short lsmask(time=1, lat=161, lon=321);
  :long_name = "Land Sea Mask";
  :valid_range = -1S, 1S; // short
  :actual_range = -1.0f, 1.0f; // float
  :add_offset = 0.0f; // float
  :scale_factor = 1.0f; // float
  :missing_value = 32766S; // short
  :var_desc = "Land-sea mask";
  :dataset = "CPC Unified Gauge-Based Analysis of Daily Precipitation over CONUS";
  :level_desc = "Surface";
  :statistic = "Other";

*/


--
------------------------------------------------------------------------------------------
Ann Bryant Burgess, PhD

University of Southern California
Postdoctoral Fellow                        
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK                  

Cell:  (585) 738-7549
Office:  (907) 786-7059
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
-------------------------------------------------------------------------------------------
Mattmann, Chris A (3980 | 26 Mar 22:37 2014
Picon
Picon

Fwd: Search broken at Apache Tika site?



Sent from my iPhone

Begin forwarded message:

From: Patrick Durusau <patrick-Q/T9HJUWdxzR7s880joybQ@public.gmane.org>
Date: March 26, 2014 at 5:35:36 PM EDT
To: <dev-owner <at> tika.apache.org>
Subject: Search broken at Apache Tika site?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I'm not sure if I should subscribe to the user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org list to
report site/documentation errors or not.

I also noticed that the "Search with Apache Solr" is "broken" on the
homepage.

By "broken" I mean that I searched for "www.pdfbox.org" the incorrect
location for PDFBox that occurs more than once in the documentation
and got this response:

*****
Not Found

The requested URL /p:tika was not found on this server.

Apache/2.2.22 (Ubuntu) Server at search.lucidimagination.com Port 80
*****

?

Thanks!

Hope you are having a great day!

Patrick

- --
Patrick Durusau
patrick-Q/T9HJUWdxzR7s880joybQ@public.gmane.org
Technical Advisory Board, OASIS (TAB)
Co-Chair, OpenDocument Format TC (OASIS)
Editor, OpenDocument Format TC, Project Editor ISO/IEC 26300
Former Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Co-Editor, ISO 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTM0ggAAoJEAudyeI2QFGo3jUQAMg9Cl6gkqVyjqVqDMRj6YuZ
fzFu5xeHoOuoiLO+yj+GUZ+N191blqlv4ilQL8pnXfQhm0QCj/CVAA2KWP6qacjG
OqNQSaa19lBJrs5sgeIiVYiUxtTli7OlErqr7l573d5cu+d6fvp8Ba0eUmUKR0dp
23hy91w+nOlJvnSWS0Zvf1ZVrznTBV9bqEwsCnkE/1Y2v+D10Zh+2yJMY9kza99i
DCJ/GQcGyz3O6zgBKz7dHGAg2sDRpoZ/BlukXov3Ls8bWb9r47td4Z+d69lrfVAD
bcY7nLmyiKPE7T7zbvFotprb6WhdJGfnWVGTxDBghhfEmjUENpS4aYEHdV0xpBlZ
fc3jskgjf+hARcHcBrnqe8wuiyxtCYrdiHXy4UM0QRTMtawp2NZKsT4Clc67RA/Z
vJwGJvwYV3obKskkSnWTd0IZfcb9zAwejl+yruPAW3gmldNC3iseLPEuJMfvFX6m
1rGam1Uy3Jtj+0m6OCezIPw8IlRNf9OB7ahF3febcpEpie5FekaJV/0S0v56NGip
ldQNDj44xg56Di/9tsJpOkcjb68QRFrXwaDliYQH7SJJqPL2D8cUhn8jgIif7Z3l
VsNzLsTA7Q8KGd/bgHd0Ry2HyUU9veOVpYPYUDYw3hGzADq0rjlQSrzQ1/iJvB7t
gND7gJw8sZnOCkd3wU8F
=gIlU
-----END PGP SIGNATURE-----
Jose Carlos Canova | 25 Mar 23:55 2014
Picon

Regarding the previous question (Threads on Tika when the sizeLimit is superseed by the document).

Followed the use Reader recommendation and worked fine. 

Tks. 

Jose Carlos Canova | 25 Mar 18:36 2014
Picon

PDFBox Parser Threads.

Hi, 

I've been testing Tika with PDFFiles, and for some reason that i didn't find yet, seems that that the PDFBox manages it's own threads when the length of the text that is about to be parsed is bigger that the default max length for strings (A Tika parameter), probably will happen at same with other parsers (not worried about with this now) that have the length superior that the default max length, 

Does anyone know where such threads are started by Tika or is a native thread non managed by the JVM, since i take a look on code and haven't found any "extra  thread" started by the Tika component. 

att. 
Picon

How to index only the pdf content/text

I searched a way to index only the content/text part of a PDF (without all the other fields Tika creates) and I found the “solution” with the "uprefix" = ignored_ and <dynamicField name="ignored_*" type="ignored" multiValued="true" indexed="false" stored="false" />.

 

The problem is, that uprefix works on fields that are not specified in the schema. In my schema I specified two fields (id and rmDocumentTitle) and this two fields are added to the content too (what I will avoid).

 

How can I exclude this two fields to be added to the fullText?

 

Here are my config files:

 

schema.xml

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="simple" version="1.1">

                <types>

                               <fieldtype name="string" class="solr.StrField" postingsFormat="SimpleText" />

                               <fieldtype name="ignored" class="solr.TextField" />

                               <fieldtype name="text" class="solr.TextField" postingsFormat="SimpleText">

                                               <analyzer type="index">

                                                               <tokenizer class="solr.StandardTokenizerFactory"/>

                                                               <!--<filter class="solr.ASCIIFoldingFilterFactory"/>--> <!--Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters into their ASCII equivalents, if one exists. -->

                                                               <filter class="solr.LowerCaseFilterFactory" /> <!--Lowercases the letters in each token. Leaves non-letter tokens alone.-->

                                                               <filter class="solr.TrimFilterFactory"/> <!--Trims whitespace at either end of a token. -->

                                                               <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <!--Discards common words.  -->

                                                               <filter class="solr.PorterStemFilterFactory"/>

                                                               <!--<filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->

                                                               <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

                                               </analyzer>

                                               <analyzer type="query">

                                                               <tokenizer class="solr.StandardTokenizerFactory"/>

                                                               <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

                                                               <filter class="solr.LowerCaseFilterFactory" />

                                                               <filter class="solr.TrimFilterFactory"/>

                                                               <filter class="solr.PorterStemFilterFactory"/>

                                                               <!--<filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->

                                                               <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

                                               </analyzer>

                               </fieldtype>

                </types>

 

                <fields>

                               <field name="signatureField" type="string" indexed="true" stored="true" multiValued="false" />

                               <dynamicField name="ignored_*" type="ignored" multiValued="true" indexed="false" stored="false" />

                               <field name="id" type="string" indexed="true" stored="true" multiValued="false" />

                               <field name="rmDocumentTitle" type="string" indexed="true" stored="true" multiValued="true"/>

                               <field name="fullText" indexed="true" type="text" multiValued="true" />

                </fields>

 

                <defaultSearchField>fullText</defaultSearchField>

 

                <solrQueryParser defaultOperator="OR" />

                <uniqueKey>id</uniqueKey>

</schema>

 

 

solrconfig.xml

<?xml version="1.0" encoding="UTF-8" ?>

<config>

                …

                <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">

                               <lst name="defaults">

                                               <str name="captureAttr">true</str>

                                               <str name="lowernames">false</str>

                                               <str name="overwrite">false</str>

                                               <str name="captureAttr">true</str>

                                               <str name="literalsOverride">true</str>

                                               <str name="uprefix">ignored_</str>

                                               <str name="fmap.a">link</str>

                                               <str name="fmap.content">fullText</str>

                                               <!-- the configuration here could be useful for tests -->

                                               <str name="update.chain">deduplication</str>

                               </lst>

                </requestHandler>

 

                <updateRequestProcessorChain name="deduplication">

                               <processor

                                               class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">

                                               <bool name="overwriteDupes">false</bool>

                                               <str name="signatureField">signatureField</str>

                                               <bool name="enabled">true</bool>

                                               <str name="fields">content</str>

                                               <str name="minTokenLen">10</str>

                                               <str name="quantRate">.2</str>

                                               <str name="signatureClass">solr.update.processor.TextProfileSignature</str>

                               </processor>

                               <processor class="solr.LogUpdateProcessorFactory" />

                               <processor class="solr.RunUpdateProcessorFactory" />

                </updateRequestProcessorChain>

 

                <requestHandler name="/admin/"

                               class="org.apache.solr.handler.admin.AdminHandlers" />

               

                <lockType>none</lockType>

               

                <admin>

                               <defaultQuery>*:*</defaultQuery>

                </admin>

</config>

 

 

Thank you for any help.

Francesco

 

Matthew Snape | 19 Mar 17:17 2014

problem with duplicate json output

Hi,

 

I am attempting to extract metadata from a pdf into JSON with the following command….

 

java -jar tika-app-1.5.jar -j example.pdf

 

This appears to give the output twice.  For example the following pdf gives the following output below which throws my JSON parser.  Am I doing something wrong?

 

Thanks.

 

Example PDF

 

http://www.dadsgarage.com/~/media/Files/example.ashx

 

Output

 

{ "Author":null,

"Content-Length":194007,

"Content-Type":"application/pdf",

"Keywords":null,

"cp:subject":null,

"creator":null,

"dc:creator":null,

"dc:subject":null,

"dc:title":null,

"meta:author":null,

"meta:keyword":null,

"producer":"dvips + GNU Ghostscript 7.05",

"resourceName":"example.pdf",

"subject":null,

"title":null,

"xmp:CreatorTool":"LaTeX with hyperref package",

"xmpTPg:NPages":10 }{ "Author":null,

"Content-Length":194007,

"Content-Type":"application/pdf",

"Keywords":null,

"cp:subject":null,

"creator":null,

"dc:creator":null,

"dc:subject":null,

"dc:title":null,

"meta:author":null,

"meta:keyword":null,

"producer":"dvips + GNU Ghostscript 7.05",

"resourceName":"example.pdf",

"subject":null,

"title":null,

"xmp:CreatorTool":"LaTeX with hyperref package",

"xmpTPg:NPages":10 }

This e-mail message and any attached file is the property of the sender and is sent in confidence to the addressee only.

Internet communications are not secure and RPS is not responsible for their abuse by third parties, any alteration or corruption in transmission or for any loss or damage caused by a virus or by any other means.

RPS Planning and Development Limited, company number: 02947164 (England). Registered office: 20 Western Avenue Milton Park Abingdon Oxfordshire OX14 4SH.

RPS Group Plc web link: http://www.rpsgroup.com

Picon

Indexing only “readable/parsable” text from pdf

I have to index a list of PDFs and for some of them there is no problem, but for others when I look the indexed content I only see a lot of diamonds with a question mark in it.

 

I think the problem is the font used for the document or that the content is "encapsulated" into a picture.

 

Is there a way to tell tika to extract only the "readable/parsable" text of a pdf?

 

When I query all the documents (with my java application) this is an ex. of what I see in the logfile for the content of the problematic files:

 

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << "  [0xe8]?[0x1]d41d8cd98f00b204e9800998ecf8427e[0xb][0xa4][0xe5][0x81](Diverses[0xe6]=aabhpdtyan3vfsujquccemebqr4m3[0xe7][0x81]?[0xc1][0x4] [\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << " E-Mail zur Archivierung [\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << "    [\n]">

    DEBUG org.apache.http.wire -  << " [0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0x9][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << "[0xef][0xbf][0xbd][0xef][0xbf][0xbd][0x9][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\r][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\r][\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << " [0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << " [0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << "[0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << " [0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0x9][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << " [0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0x9][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << "[0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\r][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\r][\n]">

    DEBUG org.apache.http.wire -  << "[0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << "[0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << "[0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\r][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\r][\n]">

    DEBUG org.apache.http.wire -  << "[0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << " [0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][0xef][0xbf][0xbd][\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << " [0xef][0xbf][0xbd] [\n]">

    DEBUG org.apache.http.wire -  << "  [\n]">

    DEBUG org.apache.http.wire -  << " [\n]">

    DEBUG org.apache.http.wire -  << " [0x9] data1.pdf [\n]">

 

 

 

Another problem is that for all the files (also the "good ones") at the beginning of the content field there is a long list of `\n` as you can also see above. How can avoid this?

 

 

Here is my schema.xml:

 

    <?xml version="1.0" encoding="UTF-8" ?>

    <schema name="simple" version="1.1">

                <types>

                               <fieldtype name="string" class="solr.StrField" postingsFormat="SimpleText" />

                               <fieldtype name="ignored" class="solr.TextField" />

                                <fieldtype name="text" class="solr.TextField" postingsFormat="SimpleText">

                                               <analyzer>

                                                               <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\n" replacement=""/>

                                                               <tokenizer class="solr.StandardTokenizerFactory"/>

                                                               <filter class="solr.LowerCaseFilterFactory" /> <!--Lowercases the letters in each token. Leaves non-letter tokens alone.-->

                                                               <filter class="solr.ClassicFilterFactory" /> <!--Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens produced by ClassicTokenizer or equivalent.-->

                                                               <filter class="solr.TrimFilterFactory"/> <!--Trims whitespace at either end of a token. -->

                                                               <filter class="solr.StopFilterFactory" ignoreCase="true"/> <!--Discards common words.  -->

                                                               <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

                                               </analyzer>

                               </fieldtype>

                </types>

   

                <fields>

                               <field name="signatureField" type="string" indexed="true" stored="true" multiValued="false" />

                               <dynamicField name="ignored_*" type="ignored" multiValued="true" indexed="false" stored="false" />

                               <field name="id" type="string" indexed="true" stored="true" multiValued="false" />

                               <field name="rmDocumentTitle" type="string" indexed="true" stored="true" multiValued="true"/>

                               <field name="fullText" indexed="true" type="text" multiValued="true" />

                </fields>

   

                <defaultSearchField>fullText</defaultSearchField>

    

                <solrQueryParser defaultOperator="OR" />

                <uniqueKey>id</uniqueKey>

    </schema>

 

and my solrconfig.xml:

 

    <?xml version="1.0" encoding="UTF-8" ?>

    <config>

                <luceneMatchVersion>LUCENE_45</luceneMatchVersion>

                <directoryFactory name='DirectoryFactory' class='solr.MMapDirectoryFactory' />

   

                <codecFactory name="CodecFactory" class="solr.SchemaCodecFactory" />

   

                <lib dir='${solr.core.instanceDir}\lib' />

                <lib dir="${solr.core.instanceDir}\dist\" regex="solr-cell-\d.*\.jar" />

                <lib dir="${solr.core.instanceDir}\contrib\extraction\lib" regex=".*\.jar" />

   

                <requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />

   

                <requestHandler name="/update" class="solr.UpdateRequestHandler">

                               <lst name="defaults">

                                               <str name="update.chain">deduplication</str>

                               </lst>

                </requestHandler>

   

                <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">

                               <lst name="defaults">

                                               <str name="captureAttr">true</str>

                                               <str name="lowernames">false</str>

                                               <str name="overwrite">false</str>

                                               <str name="captureAttr">true</str>

                                               <str name="literalsOverride">true</str>

                                               <str name="uprefix">ignored_</str>

                                               <str name="fmap.a">link</str>

                                               <str name="fmap.content">fullText</str>

                                               <!-- the configuration here could be useful for tests -->

                                               <str name="update.chain">deduplication</str>

                               </lst>

                </requestHandler>

   

                <updateRequestProcessorChain name="deduplication">

                               <processor

                                               class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">

                                               <bool name="overwriteDupes">false</bool>

                                               <str name="signatureField">signatureField</str>

                                               <bool name="enabled">true</bool>

                                               <str name="fields">content</str>

                                               <str name="minTokenLen">10</str>

                                               <str name="quantRate">.2</str>

                                               <str name="signatureClass">solr.update.processor.TextProfileSignature</str>

                               </processor>

                               <processor class="solr.LogUpdateProcessorFactory" />

                               <processor class="solr.RunUpdateProcessorFactory" />

                </updateRequestProcessorChain>

   

                <requestHandler name="/admin/"

                               class="org.apache.solr.handler.admin.AdminHandlers" />

               

                <lockType>none</lockType>

               

                <admin>

                               <defaultQuery>*:*</defaultQuery>

                </admin>

   

    </config>

Grant Ingersoll | 13 Mar 19:28 2014
Picon

Parsers, DefaultConfig and such

Myself and a colleague were parsing the Enron dataset the other day and =
noticed that a number of emails that had message bodies in them were not =
getting extracted.

In particular, when running our Tika parsing code in Hadoop distributed =
mode, the body was going missing.  If I ran the exact same code in my =
IDE in Hadoop local mode (i.e. no cluster), the message body gets =
extracted fine.

To isolate things down, we tried with the testLotusEml.eml file in =
Tika's test document suite (many of the Enron emails are Lotus) and =
noticed the same thing.  Digging in further, I thought the issue might =
be something in the RFC822Parser, since this is the MIME type of the =
document.  (In particular, I thought it would be a threading issue) =20

Turns out, however, the problem seems to be in my understanding of how =
TikaConfig.getDefaultConfig().getParser works (or doesn't work).  =
Namely, if you run the Test below (I added it to RFC822ParserTest =
locally), the first two checkParser methods pass just fine, the third =
one fails.  =20

So, I guess my questions are:=20
- what's different between how I use getDefaultConfig in local mode vs. =
Hadoop mode?  I haven't customized the config at all in either case and =
I am not aware of any SPIs registered.  (i've also reproduced the =
problem in non-dev environments -- i.e. machines only doing this =
workload w/ a clean OS)
- what's different in this test which is being run in the Tika =
development environment and presumably has the same core configuration?

(note to Julien Nioche, if you are reading this: this problem exists in =
Behemoth TikaProcessor or at least it did in the snapshot of the version =
I have)

  <at> Test
 public void testLotus() throws Exception {
   checkParser(new RFC822Parser());
   checkParser(new AutoDetectParser());
   checkParser(TikaConfig.getDefaultConfig().getParser());
 }

 private void checkParser(Parser parser) {
   Metadata metadata =3D new Metadata();
   InputStream stream =3D getStream("test-documents/testLotusEml.eml");
   ContentHandler handler =3D new BodyContentHandler();

   try {
     parser.parse(stream, handler, metadata, new ParseContext());
     String bodyText =3D handler.toString();
     assertTrue(bodyText.contains("Message body"));
   } catch (Exception e) {
     fail("Exception thrown: " + e.getMessage());
   }
 }

Thanks,
Grant

--------------------------------------------
Grant Ingersoll | <at> gsingers
http://www.lucidworks.com


Gmane