Alexandre Rafalovitch | 25 May 2013 14:25
Picon
Gravatar

Wiki editing: please add AlexandreRafalovitch

I want to document the new DIH Tika flag I introduced in SOLR-4530

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

Gian Maria Ricci | 25 May 2013 10:44
Gravatar

Tika: How can I import automatically all metadata without specifiying them explicitly

Hi to everyone,

 

I’ve configured import of a document folder with FileListEntityProcessor, everything went smooth on the first try, but I have a simple question. I’m able to map metadata without any problem, but I’d like to import in my index all metadata, not only those I’ve configured with field nodes. In this example I’ve imported Author and title, but I does not know in advance which metadata a document could have and I wish to have all of them inside my index.

 

Here is my import config. It is the first try with importing with tika and probably I’m missing a simple stuff.

 

<dataConfig> 

                <dataSource type="BinFileDataSource" />

                                <document>

                                                <entity name="files" dataSource="null" rootEntity="false"

                                                processor="FileListEntityProcessor"

                                                baseDir="c:/temp/docs" fileName=".*\.(doc)|(pdf)|(docx)"

                                                onError="skip"

                                                recursive="true">

                                                                <field column="file" name="id" />

                                                                <field column="fileAbsolutePath" name="path" />

                                                                <field column="fileSize" name="size" />

                                                                <field column="fileLastModified" name="lastModified" />

                                                               

                                                                <entity

                                                                                name="documentImport"

                                                                                processor="TikaEntityProcessor"

                                                                                url="${files.fileAbsolutePath}"

                                                                                format="text">

                                                                                <field column="file" name="fileName"/>

                                                                                <field column="Author" name="author" meta="true"/>

                                                                                <field column="title" name="title" meta="true"/>

                                                                                <field column="text" name="text"/>

                                                                </entity>

                                </entity>

                                </document>

</dataConfig> 

 

 

--

Gian Maria Ricci

Mobile: +39 320 0136949

   

 

 

Naska Osmani | 25 May 2013 09:06
Picon

Indexing Solr, Multiple Doc Types. Production of Multiple Values for UniqueKey Field Using TemplateTransformer


Hello,

I want to index multiple tables of a database into a single solr index.
I used TemplateTransformer to concatenate a prefix, the id of the table
or entity with the uniqueKey uid so that entities don't overwrite
eachother. The documents don't get indexed with the error message:

org.apache.solr.common.SolrException: Document contains multiple values
for uniqueKey field: uid=[A_1, dc1999fcf12df900]
     at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:91)
     at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464)
     at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
     at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
     at
org.apache.solr.update.processor.SignatureUpdateProcessorFactory$SignatureUpdateProcessor.processAdd(SignatureUpdateProcessorFactory.java:194)
     at
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
     at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
     at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
     at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
     at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
     at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
     at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
     at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
     at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)

Here is part of my schema.xml:
<fields>
     <field name="aid" type="string" multiValued="true"/>
     <field name="aname" type="string" indexed="true" stored="true"
omitNorms="true" termVectors="true" multiValued="true"/>
     <field name="acountry" type="string" indexed="true" stored="true"
omitNorms="true" termVectors="true" multivalued="true" />

     <field name="bid" type="string" multiValued="true"/>
     <field name="bname" type="string" indexed="true" stored="true"
omitNorms="true" termVectors="true" multiValued="true"/>
     <field name="bcountry" type="string" indexed="true" stored="true"
omitNorms="true" termVectors="true" multivalued="true" />

     <field name="uid" type="string"/>

     <field name="doc_type" type="string"/>

     <field name="allText" type="text_general" indexed="true"
stored="true" multiValued="true" omitNorms="true" termVectors="true" />
</fields>
<uniqueKey>uid</uniqueKey>

And my data-config.xml:
<document name="doc">
         <entity name="atest" pk="id" transformer="TemplateTransformer"
          query="SELECT id, name, country FROM atests">
             <field column="uid" name="uid" template="A_${atest.id}"/>
             <field column="doc_type" template="ATEST"/>
             <field column="id" name="aid"/>
             <field column="name" name="aname"/>
             <field column="country" name="acountry"/>
         </entity>
         <entity name="btest" pk="id" transformer="TemplateTransformer"
                 query="SELECT id, name, country FROM btests">
             <field column="uid" template="B_${btest.id}"/>
             <field column="doc_type" template="BTEST"/>
             <field column="id" name="bid"/>
             <field column="name" name="bname"/>
             <field column="country" name="bcountry"/>
         </entity>
</document>

I have tried to set multivalued to true or false or earase it in aid and
bid fields but this didn't solve the issue.

Thanks in advance

Kevin Osborn | 24 May 2013 21:12

load balancing internal Solr on Azure

We are looking install SolrCloud on Azure. We want it to be an internal
service. For some applications that use SolrJ, we can use ZooKeeper. But
for other applications that don't talk to Azure, we will need to go through
a load balancer to distribute traffic among the Solr instances (VMs, IaaS).

The problem is that Azure as far as I am aware does not have a load
balancer for internal services. Internal endpoints are not load balanced.

This is obviously not a problem specific to Solr, but I was hoping that
other people might have some good ideas for addressing this issue. Thanks.

--

-- 
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677      SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions]
srinalluri | 24 May 2013 19:14
Picon
Favicon

HTTP Status 503 - Server is shutting down

Hi,

I am unable to setup solr4. I am getting this error: HTTP Status 503 -
Server is shutting down. I don't see anything in tomcat logs.

conf/Catalina/localhost\solr4new.xml:

<?xml version="1.0" encoding="utf-8"?>
<Context docBase="/apps/solr1/solr4new/solr.war" debug="0"
crossContext="true">
  <Environment name="solr/home" type="java.lang.String"
value="/apps/solr1/solr4new" override="true"/>
</Context>

/apps/solr1/solr4new/:
|-- bin
|-- collection1
|   |-- conf
|   |   |-- lang
|   |   |-- velocity
|   |   `-- xslt
|   `-- data
`-- logs

--
View this message in context: http://lucene.472066.n3.nabble.com/HTTP-Status-503-Server-is-shutting-down-tp4065958.html
Sent from the Solr - User mailing list archive at Nabble.com.

atuldj.jadhav | 24 May 2013 18:25
Picon

Solr java.io.FileNotFoundException

Hi Team,

I need your help with one of the critical issue I am facing.
I end up loosing my segment.

more frequently I get below File not Found exception 
../data/index/segments_c (No such file or directory) at
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1103) 

Segment name keeps changing. This happened twice in last week. and I am
still debugging the cause here.

For your information please see more detailed stack trace....

java.lang.RuntimeException: java.io.FileNotFoundException:
/apps/web/jboss/DL/data/index/segments_c (No such file or directory) at
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1103) at
org.apache.solr.core.SolrCore.<init>(SolrCore.java:587) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:463) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:316) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:207) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
at com.legg.lmsolr.utils.MDLSolrIndex.execute(MDLSolrIndex.java:227) at
com.legg.lmsolr.utils.MDLSolrIndex.main(MDLSolrIndex.java:156) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
Caused by: java.io.FileNotFoundException:
/apps/web/jboss/DL/mylm/data/index/segments_c (No such file or directory) at
java.io.RandomAccessFile.open(Native Method) at
java.io.RandomAccessFile.<init>(RandomAccessFile.java:212) at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.<init>(SimpleFSDirectory.java:70)
at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.<init>(SimpleFSDirectory.java:97)
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.<init>(NIOFSDirectory.java:92)
at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:79)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:345) at
org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:265) at
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:79) at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:754)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75) at
org.apache.lucene.index.IndexReader.open(IndexReader.java:462) at
org.apache.lucene.index.IndexReader.open(IndexReader.java:405) at
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1092) ... 12 more 

--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-java-io-FileNotFoundException-tp4065949.html
Sent from the Solr - User mailing list archive at Nabble.com.

blmak | 24 May 2013 17:10
Picon

multivalue location_rpt field not indexing with JSON format

Hi
 I am trying to index a multivalue lat/long values in the location_rpt field
from a json file, and I am getting the following error, when I attempt to
index a json file:

{"responseHeader":{"status":400,"QTime":5},"error":{"msg":"ERROR:
[doc=054ac6377d6ca4ad387f73b063000910] Error adding field
'location'='[33.448448009999999897, -111.988400740000003]'
msg=null","code":400}}

So, I have added the following to my schema.xml:

<field name="location"  type="location_rpt"  indexed="true" stored="true" 
multiValued="true" /> 

<fieldType name="location_rpt"
class="solr.SpatialRecursivePrefixTreeFieldType" geo="true" units="degrees"
distErrPct="0.025" maxDistErr="0.000009" />

And here is a truncated example of the JSON I am trying to index:

[{"id":"054ac6377d6ca4ad387f73b063000910","keywords":["time", "trouble",
"exactly"],"description":"a anno is an anno is an anno",
"location":[[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],
[40.732202530000002128,-73.925320569999996678],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[40.732202530000002128,-73.925320569999996678],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[33.448448009999999897,-111.988400740000003],[40.732202530000002128,-73.925320569999996678],[33.448448009999999897,-111.988400740000003]]},
...]

I am running solr-4.3.0. Any ideas/directions that I can go to get this to
work? From the documentation, it appears that the spatial field
functionality should work with Apache 4.3.0. Is there a library/extension
that I am missing? Does multivalue just mean one lat/one lng? 

Thanks for any assistance.
-barbra 

--
View this message in context: http://lucene.472066.n3.nabble.com/multivalue-location-rpt-field-not-indexing-with-JSON-format-tp4065935.html
Sent from the Solr - User mailing list archive at Nabble.com.

jerome.dupont | 24 May 2013 16:11
Picon

error while indexing huge filesystem with data import handler and FileListEntityProcessor


Hello,

We are trying to use data import handler and particularly on a collection
which contains many file (one xml per document)

Our configuration works  for a small amount of files, but dataimport fails
with OutofMemory Error when running it on 10M files (in several
directories...)

This is it the content of our config.xml:

			<entity name="noticebib"
					datasource="null"
					processor="FileListEntityProcessor"
					fileName="^.*\.xml$" recursive="true"
					baseDir="${noticesBIB.basedir}"
					rootEntity="false"
				>

				<entity  name="processorDocument"
					processor="XPathEntityProcessor"
					url="${noticebib.fileAbsolutePath}"
					xsl="xslt/mnb/IXM_MNb.xsl"
					forEach="/record"
					transformer="fr.bnf.solr.BnfDateTransformer"
				>
				<all my mapping>

When we try it on a directory which contains 10 subdirectoies each subdir
containing 1000 subdirectories, each one containing 1000 xml files (10M
files, so), indexation process doesn't work anymore,

We have a java.outofmemory excpetion (even with 512 Mo and 1GB memory)
ERROR 2013-05-24 15:26:25,733 http-9145-2
org.apache.solr.handler.dataimport.DataImporter  (96) - Full Import
failed:java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassCastException: java.lang.OutOfMemoryError cannot be cast to
java.lang.Exception
        at org.apache.solr.handler.dataimport.DocBuilder.execute
(DocBuilder.java:266)
        at org.apache.solr.handler.dataimport.DataImporter.doFullImport
(DataImporter.java:422)
        at org.apache.solr.handler.dataimport.DataImporter.runCmd
(DataImporter.java:487)
        at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
(DataImportHandler.java:179)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest
(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)

Monitoring the jvm with visualvm, I've seen that most of time is taken by
the method FileListEntityProcessor.accept (called by getFolderFiles), so I
assumed that the error occured when filling list of files to be indexed:
Indeed the list of files is done by this method which called by
getFolderFiles.

Basically, the list of files to index is done  by getFolderFiles, itself
called at first call to nextRow(). The indexation itself starts only after
that.
org/apache/solr/handler/dataimport/FileListEntityProcessor.java
  private void  [More ...] getFolderFiles(File dir, final List<Map<String,
Object>> fileDetails) {

I found back the variable fileDetails which contains the list of my xml
files. It contains 611345 entries (for approximatively 500 Mo of memory).
And I have 10M xml files (more or less...). That why I think it's not
finished yet.
To get the entire list I guess I need something between 5 and 10 Go for my
process.

So I have several questions :
_ Is it possible to have severalFileListEntityProcessor attached to only
one  XPathEntityProcessor in the data-config.xml : Like this I can do it in
ten times, with my 10 directories of first level.
_ Is there a roadmap to optimize this method, for example by not doing the
list of all file in  the first time, but each 1000 documents, for instance?
_ Or to store the file list in a temporary file in order to save some
memory?

Regards,
-----------------------------------------------
Jérôme Dupont
-----------------------------------------------

Exposition  Jean de Gonet, relieur  - jusqu'au 21 juillet 2013 - BnF - François-Mitterrand / Galerie
François 1 er 
Jean de Gonet dédicacera le catalogue de l'exposition le samedi 25 mai de 16h30 à 18 heures à l'entrée de
l'exposition. Avant d'imprimer, pensez à l'environnement. 
Saikat Kanjilal | 24 May 2013 16:25
Picon
Favicon

Keeping a rolling window of indexes around solr

Hello Solr community folks,
I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to
date I've come up with an architecture that separates a set of masters that are focused on writes and get
replicated periodically and a set of slave shards strictly docused on reads, additionally for each
master index the design contains partial purges which get performed on each of the slave shards as well as
the master to keep the data current.   However the architecture seems a bit more complex than I'd like with a
lot of moving pieces.  I was wondering if anyone has ever handled/designed an architecture around a
"conveyor belt" or rolling window of indexes around n days of data and if there are best practices around
this.  One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate
them as needed and drop the master periodically and make its backup temporarily the master.

Anyways would love to hear thoughts and usecases that are similar from the community.

Regards 		 	   		  
ramrrajesh | 24 May 2013 06:17
Picon

Nested Facets and distributed shard system.

we are facing an issue with nested facets use case.  The data is indexed
across multiple shards and being searched by several tomcat instances. 

Use case : 
- User wants to navigate the results, category by category. 
- Eg : <Country>
             <United status>20</UnitedStates>
             <State>
                        NJ(20)</State>
                  <city>
                    <City A>10</City A>
                     <City B>10</City B>
                   </city>
               <State>
                <State>
                      <NewYork>20</NewYork>
                 </State>
Problem :
      We were trying to implement this nested navigation, later we
identified that pivot.facets is not compatible with distributed shard
system. 

Is there an alternative to achieve this use case without using pivot.facets
?

--
View this message in context: http://lucene.472066.n3.nabble.com/Nested-Facets-and-distributed-shard-system-tp4065847.html
Sent from the Solr - User mailing list archive at Nabble.com.

Daniel Collins | 24 May 2013 10:07
Picon

zk disconnects and failure to retry?

Had a scenario on a dev system here that has me confused.

We have a simple Solr cloud (dev) system running 4.3, 4 shards, running on
2 machines (2 instances per machine), 2 ZKs (external) and no replicas (or
1 replica depending on your definition, we only have 1 instance of each
shard!)

Yes, we have no backups, and we only have 2 ZKs which is bad, but its a dev
system, so not mission critical.

What I saw last night was that various shards disconnected from ZK (still
trying to work out why that was in itself), and some reconnected, some
didn't.  The ones that failed eventually had this error:

2013-05-23 14:27:38,876 ERROR [main-EventThread]
o.a.s.c.c.DefaultConnectionStrategy [SolrException.java:119] Reconnect to
ZooKeeper failed:java.lang.RuntimeException:
java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
xxx1:11600,xxx2:11600 within 30000 ms

2013-05-23 14:27:38,877 INFO [main-EventThread]
o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:51]
Reconnect to ZooKeeper failed
2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.s.c.c.ConnectionManager
[ConnectionManager.java:130] Connected:false
2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.z.ClientCnxn
[ClientCnxn.java:509] EventThread shut down

So my question is why don't they keep re-trying?  Yes I could increase the
timeout, but that feels like the wrong action.  If the core had failed to
connect to ZK, shouldn't it keep trying to re-enter the cloud, why does it
"give up"?  From that point onwards, those cores just give errors during
update

2013-05-23 14:30:39,605 ERROR [qtp21465667-1439] o.a.s.c.SolrCore
[SolrException.java:108] org.apache.solr.common.SolrException: Cannot talk
to ZooKeeper - Updates are disabled.
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:999)

Now I understand the reason for the errors, but surprised it didn't try to
fix itself.  I eventually bounced the core and it reconnected, but why does
it need a manual fix?

Gmane