Allison, Timothy B. | 5 Jun 03:31 2015
Picon

RE: Memory issues with PDF parser

1)      Thank you for the color…I apparently needed it…and sorry for missing that earlier.   I think profiling is the only option.  Although, you might want to write a simple directory crawler/test harness and see if you have the same issues with pure PDFBox; if you go to that extent, you might as well also try PDFBox 2.0/trunk which is quite a bit different under the hood.  As a side note, when I do a batch run against govdocs1, with the exception of a handful of files, I’m ok with -Xmx5g, 10 threads and 250k PDFs.  So, I’d be interested in learning what’s different about your PDFs and where the memory load is.  How much heap do you have available to your jvm?  Is there anything else in memory in your jvm or on the same box?  I’m hoping that your db isn’t in shared memory?

2)      Text.  Right.  If a PDF document has, say, an MSWord file attached, do you want to get the text from that document as well?  If so, use the AutoDetectParser, and make sure to set the Parser.class in the ParseContext.

 

 

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Thursday, June 04, 2015 4:36 PM
To: Allison, Timothy B.
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

Hi Timothy,

 

1.)    The interesting thing is that if we parse the pdf individually  there is no problem. Only when we do it for many pdf’s in bulk, some of the pdf files get the issue. I tried it with pure tika app command line for the same file that erred with the IOexception. It was able to extract it fine.

2.)    Other question: Are you sure that you want to avoid parsing attachments?  -- I don’t understand your question. Is it the attachment within the pdf like images, links etc..?

Our objective is very simple . we just want to extract the text content from the pdf. We don’t want images , graphs etc.. If you know of some alternative simple ways that’s fine too.

 

Given below is the stack trace after using Tika 1.9-SNAPSHOT. Its throwing “WrappedIOException”. I have highlighted the key exceptions. You can also see the “ java.lang.OutOfMemoryError: Java heap space” below. The size of this pdf file was 4400KB.

 

[Server:research-etl-server] 15:48:51,716 INFO  [stdout] (Thread-0 (HornetQ-client-global-threads-1087933741)) Exception in updating docbody for report ==> RPT_749557

[Server:research-etl-server] 15:48:55,408 SEVERE [class com.fitch.researchapi.dao.ResearchReportMDAO] (Thread-0 (HornetQ-client-global-threads-1087933741)) null: org.apache.pdfbox.exceptions.WrappedIOException

[Server:research-etl-server]       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:278) [tika-server-1.9-SNAPSHOT.jar:1.9-SNAPSHOT]

[Server:research-etl-server]       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1239) [tika-server-1.9-SNAPSHOT.jar:1.9-SNAPSHOT]

[Server:research-etl-server]       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:125) [tika-parsers-1.9-SNAPSHOT.jar:1.9-SNAPSHOT]

[Server:research-etl-server]       at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:890) [RESEARCH_API-1.0.27-SNAPSHOT.jar:]

[Server:research-etl-server]       at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:986) [RESEARCH_API-1.0.27-SNAPSHOT.jar:]

[Server:research-etl-server]       at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:679) [RESEARCH_API-1.0.27-SNAPSHOT.jar:]

[Server:research-etl-server]       at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70) [research-ejb.jar:]

[Server:research-etl-server]       at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) [:1.6.0_29]

[Server:research-etl-server]       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:32) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at com.fitch.research.ejb.ResearchReportManagerBeanLocal$$$view4.processResearchReport(Unknown Source) [research-ejb.jar:]

[Server:research-etl-server]       at com.fitch.research.ejb.mdb.ResearchQueueManagerMDB.onMessage(ResearchQueueManagerMDB.java:150) [research-ejb.jar:]

[Server:research-etl-server]       at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) [:1.6.0_29]

[Server:research-etl-server]       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:43) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponentDescription$5$1.processInvocation(MessageDrivenComponentDescription.java:184) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation-1.1.1.Final.jar:1.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at javax.jms.MessageListener$$$view6.onMessage(Unknown Source) [jboss-jms-api_1.1_spec-1.0.0.Final.jar:1.0.0.Final]

[Server:research-etl-server]       at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source) [:1.6.0_29]

[Server:research-etl-server]       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at org.jboss.as.ejb3.inflow.MessageEndpointInvocationHandler.doInvoke(MessageEndpointInvocationHandler.java:140) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at org.jboss.as.ejb3.inflow.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:73) [jboss-as-ejb3-7.1.1.Final.jar:7.1.1.Final]

[Server:research-etl-server]       at $Proxy57.onMessage(Unknown Source)         at org.hornetq.ra.inflow.HornetQMessageHandler.onMessage(HornetQMessageHandler.java:278)

[Server:research-etl-server]       at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:983)

[Server:research-etl-server]       at org.hornetq.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:48)

[Server:research-etl-server]       at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1113)

[Server:research-etl-server]       at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)

[Server:research-etl-server]       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [rt.jar:1.6.0_29]

[Server:research-etl-server]       at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_29]

[Server:research-etl-server] Caused by: java.lang.OutOfMemoryError: Java heap space

[Server:research-etl-server]

 

 

Thanks,

Mouthgalya Ganapathy

Product Development Team

From: Allison, Timothy B. [mailto:tallison-AZamIotjMK3YtjvyW6yDsg@public.gmane.org]
Sent: Thursday, June 04, 2015 2:37 PM
To: Mouthgalya Ganapathy
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

You will get the same exception.  If you run the pure Tika app commandline on a triggering file, does it at least show you the “caused by” clause that might give more information?

 

Other question: Are you sure that you want to avoid parsing attachments? 

 

 

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Thursday, June 04, 2015 2:55 PM
To: Allison, Timothy B.
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

Thanks for the update Timothy,

I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try that and  will use TikaInputStreams. I will update the results.

 

Given below is the IO exception that I get when I use Autoparser to extract pdf contents. I had used Tika 1.6. and pdfbox 1.8.9. I am guessing I will get the same/similar exception when I am going to run it with 1.9-SNAPSHOT.

 

1:27:53,921 WARN  [org.hornetq.core.client.impl.ClientSessionImpl] (Thread-4 (HornetQ-client-global-threads-248507153)) resetting session after failure

[Server:research-etl-server] 21:29:16,314 INFO  [stdout] (Thread-12 (HornetQ-client-global-threads-248507153)) Exception in updating docbody for report ==> RPT_720610

[Server:research-etl-server] 21:29:23,817 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser <at> 29fe5969

[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:250)

[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)

[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)

[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:888)

[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:983)

[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:678)

[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)

[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)

[Server:research-etl-server] 21:29:23,822 WARN  [org.hornetq.core.server.impl.ServerSessionImpl] (hornetq-failure-check-thread) Cleared up resources for session dc692df4-0a50-11e5-8aa3-005056900299

[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at java.lang.reflect.Method.invoke(Method.java:597)

[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

 

 

 

Thanks,

Mouthgalya Ganapathy

Product Development Team

From: Allison, Timothy B. [mailto:tallison-AZamIotjMK3YtjvyW6yDsg@public.gmane.org]
Sent: Thursday, June 04, 2015 12:50 PM
To: Mouthgalya Ganapathy
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

1)      Right, the npe is caused by the exception returning null when we call getMessage().  In TIKA-1605, we modified all code in the project to check for null returned by getMessage().  So, in the “fixed” version, you’ll still get your good old IOException.  I can’t tell from your stacktrace what caused the IOException.

2)      Y, regular builds of 1.9’s app (and other modules) are available via Jenkins here: https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/

3)      Ok, makes sense.

 

For kicks, you may want to change opening the file to:

is = TikaInputStream.get(file)

or maybe:

is = TikaInputStream.get(file, metadata)

 

And you’ll want to surround your closing of the IS in a try/catch block.  Or use IOUtils.closeQuietly.

 

Finally, are you able to share the particular file that caused the IOException?

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

Hi Timothy,

Thanks for the prompt reply.

 

1.)    Wouldn’t fixing the null pointer exception in turn throw the IO exception? I saw that the null pointer exception was thrown inside the catch block of the IO exception? Any root cause for the IO exception??.

Is that also fixed?

 

I am including the code that threw the null pointer exception in tike 1.8

 

Exception:

10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)

 

Code in the pdf parser:

catch (IOException e) {

            //nonseq parser throws IOException for bad password

            //At the Tika level, we want the same exception to be thrown

            if (e.getMessage().contains("Error (CryptographyException)")) {

                metadata.set("pdf:encrypted", Boolean.toString(true));

                throw new EncryptedDocumentException(e);

            }

 

2.)    Do you have a snapshot or beta version of tika 1.9 that I could try with our pdf corpus? It would also help in your developer testing.

3.)    For the inline images, we have just set the defaults(which is to skip them as you had mentioned). I have not done any memory profiling till now. I will also try that.

 

 

Thanks,

MG

 

From: Allison, Timothy B. [mailto:tallison-AZamIotjMK3YtjvyW6yDsg@public.gmane.org]
Sent: Thursday, June 04, 2015 7:19 AM
To: Mouthgalya Ganapathy; tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org
Subject: RE: Memory issues with PDF parser

 

Hi Mouthgalya,

  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week.

  As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  One potential memory hog is the processing of inline images within PDFs…have you configured Tika to pull those out (default is to skip them)?  Other than that, I’d recommend dropping a note to the PDFBox users list to get help in diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

 

          Best,

 

                    Tim

 

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Wednesday, June 03, 2015 3:25 PM
To: tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Subject: Memory issues with PDF parser

 

Hi all,

I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the below code for extracting it. It works well for few files. But if I read many files , I see out of memory exception.

I also see a Null pointer exception in the pdf parser. I think the null pointer exception is because of the memory exception.

Any suggestions?

 

Tika version:

  <dependency>

                     <groupId>org.apache.tika</groupId>

                     <artifactId>tika-server</artifactId>

                     <version>1.8</version>

        </dependency>

 

I am running it as a part of J2EE APP in JBoss 1.7

 

Code:-

 

//Parse the pdf content using Apache Tikka

            InputStream is = null;

            try {

              is = new BufferedInputStream(new FileInputStream(input));

              //Disable write limit.

              contenthandler = new BodyContentHandler(-1);

               metadata = new Metadata();

              pdfparser = new PDFParser();

              context = new ParseContext();

              pdfparser.parse(is, contenthandler, metadata, context);             

              docBody=contenthandler.toString();

              //System.out.println(contenthandler.toString());

            }

            catch (Exception e) {

               System.out.println("Exception in updating docbody for report ==> " + report.getDocID());

               if(is==null)

                 System.out.println("The input stream is a null object");

               e.printStackTrace();

              logger.log(Level.SEVERE, e.getMessage(), e);

            }

            finally {

                if (is != null) is.close();

                contenthandler=null;

                metadata=null;

                pdfparser=null;

                context =null;

            }            

 

 

Exception:-

I am just including the null pointer exception in the parser below.

 

10:53:11,696 INFO  [stdout] (Thread-11 (HornetQ-client-global-threads-1619682129)) Exception in updating docbody for report ==> RPT_764268

10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:32)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBeanLocal$$$view4.processResearchReport(Unknown Source)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.mdb.ResearchQueueManagerMDB.onMessage(ResearchQueueManagerMDB.java:150)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:43)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponentDescription$5$1.processInvocation(MessageDrivenComponentDescription.java:184)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,882 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

 

Thanks,

MG

Product Development Team

 

 

 


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit
http://www.messagelabs.com/email.
______________________________________________________________________


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit
http://www.symanteccloud.com
______________________________________________________________________


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit http://www.messagelabs.com/email.
______________________________________________________________________


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit http://www.messagelabs.com/email.
______________________________________________________________________


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit http://www.messagelabs.com/email.
______________________________________________________________________

Allison, Timothy B. | 4 Jun 21:37 2015
Picon

RE: Memory issues with PDF parser

You will get the same exception.  If you run the pure Tika app commandline on a triggering file, does it at least show you the “caused by” clause that might give more information?

 

Other question: Are you sure that you want to avoid parsing attachments? 

 

 

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Thursday, June 04, 2015 2:55 PM
To: Allison, Timothy B.
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

Thanks for the update Timothy,

I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try that and  will use TikaInputStreams. I will update the results.

 

Given below is the IO exception that I get when I use Autoparser to extract pdf contents. I had used Tika 1.6. and pdfbox 1.8.9. I am guessing I will get the same/similar exception when I am going to run it with 1.9-SNAPSHOT.

 

1:27:53,921 WARN  [org.hornetq.core.client.impl.ClientSessionImpl] (Thread-4 (HornetQ-client-global-threads-248507153)) resetting session after failure

[Server:research-etl-server] 21:29:16,314 INFO  [stdout] (Thread-12 (HornetQ-client-global-threads-248507153)) Exception in updating docbody for report ==> RPT_720610

[Server:research-etl-server] 21:29:23,817 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser <at> 29fe5969

[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:250)

[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)

[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)

[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:888)

[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:983)

[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:678)

[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)

[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)

[Server:research-etl-server] 21:29:23,822 WARN  [org.hornetq.core.server.impl.ServerSessionImpl] (hornetq-failure-check-thread) Cleared up resources for session dc692df4-0a50-11e5-8aa3-005056900299

[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at java.lang.reflect.Method.invoke(Method.java:597)

[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 (HornetQ-client-global-threads-248507153)) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

 

 

 

Thanks,

Mouthgalya Ganapathy

Product Development Team

From: Allison, Timothy B. [mailto:tallison-AZamIotjMK3YtjvyW6yDsg@public.gmane.org]
Sent: Thursday, June 04, 2015 12:50 PM
To: Mouthgalya Ganapathy
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

1)      Right, the npe is caused by the exception returning null when we call getMessage().  In TIKA-1605, we modified all code in the project to check for null returned by getMessage().  So, in the “fixed” version, you’ll still get your good old IOException.  I can’t tell from your stacktrace what caused the IOException.

2)      Y, regular builds of 1.9’s app (and other modules) are available via Jenkins here: https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/

3)      Ok, makes sense.

 

For kicks, you may want to change opening the file to:

is = TikaInputStream.get(file)

or maybe:

is = TikaInputStream.get(file, metadata)

 

And you’ll want to surround your closing of the IS in a try/catch block.  Or use IOUtils.closeQuietly.

 

Finally, are you able to share the particular file that caused the IOException?

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

Hi Timothy,

Thanks for the prompt reply.

 

1.)    Wouldn’t fixing the null pointer exception in turn throw the IO exception? I saw that the null pointer exception was thrown inside the catch block of the IO exception? Any root cause for the IO exception??.

Is that also fixed?

 

I am including the code that threw the null pointer exception in tike 1.8

 

Exception:

10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)

 

Code in the pdf parser:

catch (IOException e) {

            //nonseq parser throws IOException for bad password

            //At the Tika level, we want the same exception to be thrown

            if (e.getMessage().contains("Error (CryptographyException)")) {

                metadata.set("pdf:encrypted", Boolean.toString(true));

                throw new EncryptedDocumentException(e);

            }

 

2.)    Do you have a snapshot or beta version of tika 1.9 that I could try with our pdf corpus? It would also help in your developer testing.

3.)    For the inline images, we have just set the defaults(which is to skip them as you had mentioned). I have not done any memory profiling till now. I will also try that.

 

 

Thanks,

MG

 

From: Allison, Timothy B. [mailto:tallison-AZamIotjMK3YtjvyW6yDsg@public.gmane.org]
Sent: Thursday, June 04, 2015 7:19 AM
To: Mouthgalya Ganapathy; tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org
Subject: RE: Memory issues with PDF parser

 

Hi Mouthgalya,

  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week.

  As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  One potential memory hog is the processing of inline images within PDFs…have you configured Tika to pull those out (default is to skip them)?  Other than that, I’d recommend dropping a note to the PDFBox users list to get help in diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

 

          Best,

 

                    Tim

 

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Wednesday, June 03, 2015 3:25 PM
To: tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Subject: Memory issues with PDF parser

 

Hi all,

I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the below code for extracting it. It works well for few files. But if I read many files , I see out of memory exception.

I also see a Null pointer exception in the pdf parser. I think the null pointer exception is because of the memory exception.

Any suggestions?

 

Tika version:

  <dependency>

                     <groupId>org.apache.tika</groupId>

                     <artifactId>tika-server</artifactId>

                     <version>1.8</version>

        </dependency>

 

I am running it as a part of J2EE APP in JBoss 1.7

 

Code:-

 

//Parse the pdf content using Apache Tikka

            InputStream is = null;

            try {

              is = new BufferedInputStream(new FileInputStream(input));

              //Disable write limit.

              contenthandler = new BodyContentHandler(-1);

               metadata = new Metadata();

              pdfparser = new PDFParser();

              context = new ParseContext();

              pdfparser.parse(is, contenthandler, metadata, context);             

              docBody=contenthandler.toString();

              //System.out.println(contenthandler.toString());

            }

            catch (Exception e) {

               System.out.println("Exception in updating docbody for report ==> " + report.getDocID());

               if(is==null)

                 System.out.println("The input stream is a null object");

               e.printStackTrace();

              logger.log(Level.SEVERE, e.getMessage(), e);

            }

            finally {

                if (is != null) is.close();

                contenthandler=null;

                metadata=null;

                pdfparser=null;

                context =null;

            }            

 

 

Exception:-

I am just including the null pointer exception in the parser below.

 

10:53:11,696 INFO  [stdout] (Thread-11 (HornetQ-client-global-threads-1619682129)) Exception in updating docbody for report ==> RPT_764268

10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:32)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBeanLocal$$$view4.processResearchReport(Unknown Source)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.mdb.ResearchQueueManagerMDB.onMessage(ResearchQueueManagerMDB.java:150)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:43)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponentDescription$5$1.processInvocation(MessageDrivenComponentDescription.java:184)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,882 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

 

Thanks,

MG

Product Development Team

 

 

 


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit
http://www.messagelabs.com/email.
______________________________________________________________________


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit
http://www.symanteccloud.com
______________________________________________________________________


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit http://www.messagelabs.com/email.
______________________________________________________________________


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit http://www.messagelabs.com/email.
______________________________________________________________________

Allison, Timothy B. | 4 Jun 19:49 2015
Picon

RE: Memory issues with PDF parser

1)      Right, the npe is caused by the exception returning null when we call getMessage().  In TIKA-1605, we modified all code in the project to check for null returned by getMessage().  So, in the “fixed” version, you’ll still get your good old IOException.  I can’t tell from your stacktrace what caused the IOException.

2)      Y, regular builds of 1.9’s app (and other modules) are available via Jenkins here: https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/

3)      Ok, makes sense.

 

For kicks, you may want to change opening the file to:

is = TikaInputStream.get(file)

or maybe:

is = TikaInputStream.get(file, metadata)

 

And you’ll want to surround your closing of the IS in a try/catch block.  Or use IOUtils.closeQuietly.

 

Finally, are you able to share the particular file that caused the IOException?

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

 

Hi Timothy,

Thanks for the prompt reply.

 

1.)    Wouldn’t fixing the null pointer exception in turn throw the IO exception? I saw that the null pointer exception was thrown inside the catch block of the IO exception? Any root cause for the IO exception??.

Is that also fixed?

 

I am including the code that threw the null pointer exception in tike 1.8

 

Exception:

10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)

 

Code in the pdf parser:

catch (IOException e) {

            //nonseq parser throws IOException for bad password

            //At the Tika level, we want the same exception to be thrown

            if (e.getMessage().contains("Error (CryptographyException)")) {

                metadata.set("pdf:encrypted", Boolean.toString(true));

                throw new EncryptedDocumentException(e);

            }

 

2.)    Do you have a snapshot or beta version of tika 1.9 that I could try with our pdf corpus? It would also help in your developer testing.

3.)    For the inline images, we have just set the defaults(which is to skip them as you had mentioned). I have not done any memory profiling till now. I will also try that.

 

 

Thanks,

MG

 

From: Allison, Timothy B. [mailto:tallison-AZamIotjMK3YtjvyW6yDsg@public.gmane.org]
Sent: Thursday, June 04, 2015 7:19 AM
To: Mouthgalya Ganapathy; tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Cc: user-p0YNe5MqqUkPKjDvHGQMeg@public.gmane.org
Subject: RE: Memory issues with PDF parser

 

Hi Mouthgalya,

  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week.

  As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  One potential memory hog is the processing of inline images within PDFs…have you configured Tika to pull those out (default is to skip them)?  Other than that, I’d recommend dropping a note to the PDFBox users list to get help in diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

 

          Best,

 

                    Tim

 

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Wednesday, June 03, 2015 3:25 PM
To: tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Subject: Memory issues with PDF parser

 

Hi all,

I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the below code for extracting it. It works well for few files. But if I read many files , I see out of memory exception.

I also see a Null pointer exception in the pdf parser. I think the null pointer exception is because of the memory exception.

Any suggestions?

 

Tika version:

  <dependency>

                     <groupId>org.apache.tika</groupId>

                     <artifactId>tika-server</artifactId>

                     <version>1.8</version>

        </dependency>

 

I am running it as a part of J2EE APP in JBoss 1.7

 

Code:-

 

//Parse the pdf content using Apache Tikka

            InputStream is = null;

            try {

              is = new BufferedInputStream(new FileInputStream(input));

              //Disable write limit.

              contenthandler = new BodyContentHandler(-1);

               metadata = new Metadata();

              pdfparser = new PDFParser();

              context = new ParseContext();

              pdfparser.parse(is, contenthandler, metadata, context);             

              docBody=contenthandler.toString();

              //System.out.println(contenthandler.toString());

            }

            catch (Exception e) {

               System.out.println("Exception in updating docbody for report ==> " + report.getDocID());

               if(is==null)

                 System.out.println("The input stream is a null object");

               e.printStackTrace();

              logger.log(Level.SEVERE, e.getMessage(), e);

            }

            finally {

                if (is != null) is.close();

                contenthandler=null;

                metadata=null;

                pdfparser=null;

                context =null;

            }            

 

 

Exception:-

I am just including the null pointer exception in the parser below.

 

10:53:11,696 INFO  [stdout] (Thread-11 (HornetQ-client-global-threads-1619682129)) Exception in updating docbody for report ==> RPT_764268

10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:32)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBeanLocal$$$view4.processResearchReport(Unknown Source)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.mdb.ResearchQueueManagerMDB.onMessage(ResearchQueueManagerMDB.java:150)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:43)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponentDescription$5$1.processInvocation(MessageDrivenComponentDescription.java:184)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,882 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

 

Thanks,

MG

Product Development Team

 

 

 


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit
http://www.messagelabs.com/email.
______________________________________________________________________


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit
http://www.symanteccloud.com
______________________________________________________________________


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit http://www.messagelabs.com/email.
______________________________________________________________________

Allison, Timothy B. | 4 Jun 14:18 2015
Picon

RE: Memory issues with PDF parser

Hi Mouthgalya,

  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the fix will be available in Tika 1.9, which should be out within a week.

  As for memory issues, we worked around a memory leak in PDFBox with static caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  One potential memory hog is the processing of inline images within PDFs…have you configured Tika to pull those out (default is to skip them)?  Other than that, I’d recommend dropping a note to the PDFBox users list to get help in diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

 

          Best,

 

                    Tim

 

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapathy-LCZ+dAjAFeSJBmgMBNJ/Uw@public.gmane.org]
Sent: Wednesday, June 03, 2015 3:25 PM
To: tallison-1oDqGaOF3Lkdnm+yROfE0A@public.gmane.org
Subject: Memory issues with PDF parser

 

Hi all,

I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the below code for extracting it. It works well for few files. But if I read many files , I see out of memory exception.

I also see a Null pointer exception in the pdf parser. I think the null pointer exception is because of the memory exception.

Any suggestions?

 

Tika version:

  <dependency>

                     <groupId>org.apache.tika</groupId>

                     <artifactId>tika-server</artifactId>

                     <version>1.8</version>

        </dependency>

 

I am running it as a part of J2EE APP in JBoss 1.7

 

Code:-

 

//Parse the pdf content using Apache Tikka

            InputStream is = null;

            try {

              is = new BufferedInputStream(new FileInputStream(input));

              //Disable write limit.

              contenthandler = new BodyContentHandler(-1);

               metadata = new Metadata();

              pdfparser = new PDFParser();

              context = new ParseContext();

              pdfparser.parse(is, contenthandler, metadata, context);             

              docBody=contenthandler.toString();

              //System.out.println(contenthandler.toString());

            }

            catch (Exception e) {

               System.out.println("Exception in updating docbody for report ==> " + report.getDocID());

               if(is==null)

                 System.out.println("The input stream is a null object");

               e.printStackTrace();

              logger.log(Level.SEVERE, e.getMessage(), e);

            }

            finally {

                if (is != null) is.close();

                contenthandler=null;

                metadata=null;

                pdfparser=null;

                context =null;

            }            

 

 

Exception:-

I am just including the null pointer exception in the parser below.

 

10:53:11,696 INFO  [stdout] (Thread-11 (HornetQ-client-global-threads-1619682129)) Exception in updating docbody for report ==> RPT_764268

10:53:12,218 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)

10:53:12,219 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)

10:53:12,220 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)

10:53:12,221 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,222 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,223 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,224 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,225 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,226 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,227 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,228 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,229 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,230 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,231 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,232 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:32)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,233 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,234 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,235 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72)

10:53:12,236 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.ResearchReportManagerBeanLocal$$$view4.processResearchReport(Unknown Source)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at com.fitch.research.ejb.mdb.ResearchQueueManagerMDB.onMessage(ResearchQueueManagerMDB.java:150)

10:53:12,868 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

10:53:12,869 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at java.lang.reflect.Method.invoke(Method.java:597)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

10:53:12,870 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)

10:53:12,871 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)

10:53:12,872 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,873 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)

10:53:12,874 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)

10:53:12,875 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,876 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,877 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)

10:53:12,878 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:43)

10:53:12,879 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponentDescription$5$1.processInvocation(MessageDrivenComponentDescription.java:184)

10:53:12,880 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

10:53:12,881 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)

10:53:12,882 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)

10:53:12,883 ERROR [stderr] (Thread-11 (HornetQ-client-global-threads-1619682129))    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

 

Thanks,

MG

Product Development Team

 

 

 


______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any attachment(s) is confidential and for the use of the addressee(s) only. If you are not the intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete this e-mail and any attachment(s) and notify us immediately. Unauthorized use, reliance, disclosure or copying of the contents of this e-mail and any attachment(s), or any similar action, is strictly prohibited. Fitch Ratings reserves the right, to the extent permitted by applicable law, to retain, monitor and intercept e-mail messages both to and from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more information, please visit http://www.messagelabs.com/email.
______________________________________________________________________

Mark Kerzner | 4 Jun 07:06 2015

Tika parsing of emails

Hi,

usually I just do new Tika().parse(myfile...), and Tika does all the work.

Is there anything special about *.eml files? How does Tika treat attachments? What would be a reference for me to read?

Thank you

--
Mark Kerzner, Managing Partner, Elephant Scale
Mobile: 713-724-2534, Skype: mark.kerzner1
To schedule a meeting with me: http://www.meetme.so/markkerzner

Mattmann, Chris A (3980 | 4 Jun 04:57 2015
Picon
Picon

dev@...

You can use dev <at> tika.apache.org and/or user <at> tika.apache.org.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann <at> nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Mouthgalya Ganapathy <mouthgalya.ganapathy <at> fitchratings.com>
Date: Wednesday, June 3, 2015 at 7:53 PM
To: "dev-owner <at> tika.apache.org" <dev-owner <at> tika.apache.org>
Subject: RE: confirm subscribe to dev <at> tika.apache.org

>Is the dev <at> tika.apache.org the mailing list where users can send
>questions/concerns??
>
>Thanks,
>MG
>-----Original Message-----
>From: dev-help <at> tika.apache.org [mailto:dev-help <at> tika.apache.org]
>Sent: Wednesday, June 03, 2015 9:39 PM
>To: Mouthgalya Ganapathy
>Subject: confirm subscribe to dev <at> tika.apache.org
>
>Hi! This is the ezmlm program. I'm managing the dev <at> tika.apache.org
>mailing list.
>
>I'm working for my owner, who can be reached at dev-owner <at> tika.apache.org.
>
>To confirm that you would like
>
>   mouthgalya.ganapathy <at> fitchratings.com
>
>added to the dev mailing list, please send a short reply to this address:
>
>   
>dev-sc.1433385528.nflciooladhfpkphahjh-mouthgalya.ganapathy=fitchratings.c
>om <at> tika.apache.org
>
>Usually, this happens when you just hit the "reply" button.
>If this does not work, simply copy the address and paste it into the
>"To:" field of a new message.
>
>or click here:
>	mailto:dev-sc.1433385528.nflciooladhfpkphahjh-mouthgalya.ganapathy=fitchr
>atings.com <at> tika.apache.org
>
>This confirmation serves two purposes. First, it verifies that I am able
>to get mail through to you. Second, it protects you in case someone
>forges a subscription request in your name.
>
>Please note that ALL Apache dev- and user- mailing lists are publicly
>archived.  Do familiarize yourself with Apache's public archive policy at
>
>    http://www.apache.org/foundation/public-archives.html

>
>prior to subscribing and posting messages to dev <at> tika.apache.org.
>If you're not sure whether or not the policy applies to this mailing
>list, assume it does unless the list name contains the word "private" in
>it.
>
>Some mail programs are broken and cannot handle long addresses. If you
>cannot reply to this request, instead send a message to
><dev-request <at> tika.apache.org> and put the entire address listed above
>into the "Subject:" line.
>
>
>--- Administrative commands for the dev list ---
>
>I can handle administrative requests automatically. Please do not send
>them to the list address! Instead, send your message to the correct
>command address:
>
>To subscribe to the list, send a message to:
>   <dev-subscribe <at> tika.apache.org>
>
>To remove your address from the list, send a message to:
>   <dev-unsubscribe <at> tika.apache.org>
>
>Send mail to the following for info and FAQ for this list:
>   <dev-info <at> tika.apache.org>
>   <dev-faq <at> tika.apache.org>
>
>Similar addresses exist for the digest list:
>   <dev-digest-subscribe <at> tika.apache.org>
>   <dev-digest-unsubscribe <at> tika.apache.org>
>
>To get messages 123 through 145 (a maximum of 100 per request), mail:
>   <dev-get.123_145 <at> tika.apache.org>
>
>To get an index with subject and author for messages 123-456 , mail:
>   <dev-index.123_456 <at> tika.apache.org>
>
>They are always returned as sets of 100, max 2000 per request, so you'll
>actually get 100-499.
>
>To receive all messages with the same subject as message 12345, send a
>short message to:
>   <dev-thread.12345 <at> tika.apache.org>
>
>The messages should contain one line or word of text to avoid being
>treated as sp <at> m, but I will ignore their content.
>Only the ADDRESS you send to is important.
>
>You can start a subscription for an alternate address, for example
>"john <at> host.domain", just add a hyphen and your address (with '=' instead
>of ' <at> ') after the command word:
><dev-subscribe-john=host.domain <at> tika.apache.org>
>
>To stop subscription for this address, mail:
><dev-unsubscribe-john=host.domain <at> tika.apache.org>
>
>In both cases, I'll send a confirmation message to that address. When you
>receive it, simply reply to it to complete your subscription.
>
>If despite following these instructions, you do not get the desired
>results, please contact my owner at dev-owner <at> tika.apache.org. Please be
>patient, my owner is a lot slower than I am ;-)
>
>--- Enclosed is a copy of the request I received.
>
>Return-Path: <mouthgalya.ganapathy <at> fitchratings.com>
>Received: (qmail 80863 invoked by uid 99); 4 Jun 2015 02:38:48 -0000
>Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2015 02:38:48
>+0000
>Received: from localhost (localhost [127.0.0.1])
>	by spamd4-us-west.apache.org (ASF Mail Server at
>spamd4-us-west.apache.org) with ESMTP id BC47DC0729
>	for <dev-subscribe <at> tika.apache.org>; Thu,  4 Jun 2015 02:38:47 +0000
>(UTC)
>X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
>X-Spam-Flag: NO
>X-Spam-Score: 3.979
>X-Spam-Level: ***
>X-Spam-Status: No, score=3.979 tagged_above=-999 required=6.31
>	tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1,
>	RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
>	SPF_HELO_PASS=-0.001] autolearn=disabled
>Received: from mx1-us-west.apache.org ([10.40.0.8])
>	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port
>10024)
>	with ESMTP id fwmDAPQUpDaQ for <dev-subscribe <at> tika.apache.org>;
>	Thu,  4 Jun 2015 02:38:41 +0000 (UTC)
>Received: from mail1.bemta12.messagelabs.com
>(mail1.bemta12.messagelabs.com [216.82.251.16])
>	by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org)
>with ESMTPS id C023627607
>	for <dev-subscribe <at> tika.apache.org>; Thu,  4 Jun 2015 02:38:40 +0000
>(UTC)
>Received: from [216.82.249.211] by server-16.bemta-12.messagelabs.com id
>E8/4D-32232-A2ABF655; Thu, 04 Jun 2015 02:38:34 +0000
>X-Env-Sender: mouthgalya.ganapathy <at> fitchratings.com
>X-Msg-Ref: server-4.tower-53.messagelabs.com!1433385510!8617637!1
>X-Originating-IP: [208.78.2.70]
>X-StarScan-Received:
>X-StarScan-Version: 6.13.16; banners=fitchratings.com,-,-
>X-VirusChecked: Checked
>Received: (qmail 10598 invoked from network); 4 Jun 2015 02:38:34 -0000
>Received: from mx10.fitchratings.com (HELO mx10.fitchratings.com)
>(208.78.2.70)
>  by server-4.tower-53.messagelabs.com with RC4-SHA encrypted SMTP; 4 Jun
>2015 02:38:34 -0000
>Received: from exchange02.fitchratings.com (HELO
>exchangeNJ.fitchratings.com) ([172.20.52.152])
>  by mx10.fitchratings.com with ESMTP/TLS/AES256-SHA; 03 Jun 2015
>22:38:29 -0400
>Received: from Exchange01.fitch.corp (172.20.52.151) by
>Exchange02.fitch.corp
> (172.20.52.152) with Microsoft SMTP Server (TLS) id 15.0.913.22; Wed, 3
>Jun
> 2015 22:38:28 -0400
>Received: from Exchange01.fitch.corp ([::1]) by Exchange01.fitch.corp
> ([fe80::fc2c:4d09:de36:623a%18]) with mapi id 15.00.0913.011; Wed, 3 Jun
>2015
> 22:38:28 -0400
>From: Mouthgalya Ganapathy <mouthgalya.ganapathy <at> fitchratings.com>
>To: "dev-subscribe <at> tika.apache.org" <dev-subscribe <at> tika.apache.org>
>Subject: Subscribe
>Thread-Topic: Subscribe
>Thread-Index: AdCeb2TTfMLXCDj1REKiAx0QcvRPSw==
>Date: Thu, 4 Jun 2015 02:38:28 +0000
>Message-ID: <9743aab16295432ca83d7705146af18f <at> Exchange01.fitch.corp>
>Accept-Language: en-US
>Content-Language: en-US
>X-MS-Has-Attach:
>X-MS-TNEF-Correlator:
>x-originating-ip: [172.20.4.1]
>x-kse-antispam-interceptor-info: protection disabled
>x-kse-antivirus-interceptor-info: license violation
>Content-Type: multipart/alternative;
>	boundary="_000_9743aab16295432ca83d7705146af18fExchange01fitchcorp_"
>MIME-Version: 1.0
>
>
>______________________________________________________________________
>This email has been scanned by the Symantec Email Security.cloud service.
>For more information please visit http://www.symanteccloud.com

>______________________________________________________________________
>
>______________________________________________________________________
>Confidentiality Notice:  The information contained in this e-mail and any
>attachment(s) is confidential and for the use of the addressee(s) only.
>If you are not the intended recipient of this e-mail, do not duplicate or
>redistribute it by any means.  Please delete this e-mail and any
>attachment(s) and notify us immediately.  Unauthorized use, reliance,
>disclosure or copying of the contents of this e-mail and any
>attachment(s), or any similar action, is strictly prohibited.  Fitch
>Ratings reserves the right, to the extent permitted by applicable law, to
>retain, monitor and intercept e-mail messages both to and from its
>systems.
>
>This e-mail has been scanned by the MessageLabs Email Security System.
>For more information, please visit http://www.messagelabs.com/email.

>______________________________________________________________________

Andrea Asta | 3 Jun 10:08 2015
Picon

XPath support for attributes

Hi,
I've created a parser which extracts some sections from HTML documents.
I'm producing a new form of XHTML for the content handler, with for example <section id = "myID">SECTION CONTENT</section>. Which is the proper content handler for having all the sections? Will the XPath implementation for Tika support expressions with <at> ATTRIBUTE and [ <at> ATTRIBUTE='value'] ?

Thank you
Andrea
Mattmann, Chris A (3980 | 1 Jun 07:38 2015
Picon
Picon

[VOTE] Release Apache Tika 1.9 Candidate #1

Hi Folks,

A candidate for the Tika 1.9 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/


The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.9-rc1/


The SHA1 checksum of the archive is
  2f0b0ca0981dc507b347047a56e6ade859a7ec05.

In addition, a staged maven repository is available here:
https://repository.apache.org/content/repositories/orgapachetika-1010/



Please vote on releasing this package as Apache Tika 1.9.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.9
[ ] -1 Do not release this package becauseā€¦

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann <at> nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Satya Deep Maheshwari | 29 May 06:14 2015
Picon

Fwd: Does TikaInputStream take care of mark/reset?

Hi

I am wrapping an inputstream in TikaInputStream using TikaInputStream.get(my_input_stream) . Then I pass this stream to a detector to detect the mime-type. Do I need to take care of mark and reset myself on this tikaInputStream instance?

Regards
Satya Deep

Carter, Phillip Michael | 28 May 01:30 2015
Picon

Tika-Translate, Clojure, and a new Constructor

Tika-translate doesn't work in the iPReS Project when attempting to utilize the MicrosoftTranslator.  I have configured the project to use the same resources file structure as the Tika project, and I have verified that the client-id and client-secret match the account I have set up with Microsoft.

I have verified that this line of code is always called when making a new request, and that Tika itself is indeed being called into - but nothing is being translated.  My hypothesis is that Tika cannot recognize the config file I have, even though it matches the path in the project itself.  This can be verified by running the project.

I propose expanding the API for the Microsoft and Google Translators to accept configuration details as parameters.  This eases development for the iPReS project by allowing us to be responsible for our own configuration details, and allows the use of these translators in the REPL, allowing for more interactive development.

- Phillip
Halil Ibrahim Simsek | 24 May 11:26 2015

HTML5 Support

Hi,

I have applied Google Summer of Code this year for Apache Nutch project on giving support for HTML5
specifications. As you know Nutch uses nekoHtml(by default), tagSoup and tika for parsing html pages.
What I wonder is, in what proportion tika supports HTML5 specifications. What parsers tika has
applicable for it. Are there any relevant issues on JIRA tracker? Kindly advice me. Any help will be
appreciated. 

Thanks in advance


Gmane