Marshall Schor | 4 Jun 03:10

Some offset issues with Open Calais

I fiddled around a bit more with this, trying various things that 
actually connected to the service.

I finally figured out that if you send the string "xxx & yyy" to the 
service, it actually processes the string

<Document><Title>1212537108630-85FDAB4B-292518</Title><Date>2008-06-03</Date><Body>xxx 
&amp; yyy</Body></Document>

or something like that. And that returned offsets are relative to this 
string. 

To correct the offsets returned so that they correspond to what you sent 
looks like it has 2 parts:  the first part - the prefix "<Document ...  
<Body>" is pretty easily accounted for.  The send part, expanding & to 
&amp; requires more work.  Other characters are also converted, some 
strangely.  I've seen the usual:

<  converted to &lt;,   > converted to &gt;

The character " seemed to be converted to &amp;quot;

All this is apparently a "bug" - their forum includes a post saying the 
problem with the "&" will be fixed in the next release.

I've posted a reply to their forum asking about other characters beside 
the "&".

One final note: their API says that for the POST method, content sent 
using that method needs to be escaped.  I think that means the kind of 
(Continue reading)

Christoph Büscher | 4 Jun 13:54
Picon
Favicon

Problem using Capabilities - OutputSofa

Hi,

I ignored the analysis engines "capabilities" section so far, but after I tried 
declaring an "outputSofa" for the first time, I ran into trouble using the 
analysis engine in a CPE.

I have an AE that takes webpages in HTML format as input and removes the 
HTML-Tags etc... The result is stored in a new CAS view named "plainTextView".
So far I didn't declare any capabilities in the AEs descriptor, but now I tried 
this:

<capabilities>
       <capability>
         <inputs/>
         <outputs/>
         <outputSofas>
           <sofaName>plainTextView</sofaName>
         </outputSofas>
         <languagesSupported/>
       </capability>
</capabilities>

The AEs process() method usually acesses the default view of the JCas, does some 
processing and stores the result in the new view. The code goes something like this:

  // get the text from the default CAS view
  String originalText = jcas.getDocumentText();
  JCas plainTextView = null;

// Extract plain text from original document
(Continue reading)

Eddie Epstein | 4 Jun 15:41
Picon

Re: Problem using Capabilities - OutputSofa

The CAS reference passed to the annotator process method changes when
Sofa capabilities are declared. See
http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.mvs.deciding_multi_view

After declaring an output Sofa, process gets the "base CAS". To get
the text from the "default" view, try

String originalText = jcas.getCas().getCurrentView().getDocumentText();

Eddie

PS looks like the JCas interface is missing the getCurrentView() method.

On 6/4/08, Christoph Büscher <christoph.buescher@...> wrote:
> Hi,
>
> I ignored the analysis engines "capabilities" section so far, but after I
> tried
> declaring an "outputSofa" for the first time, I ran into trouble using the
> analysis engine in a CPE.
>
> I have an AE that takes webpages in HTML format as input and removes the
> HTML-Tags etc... The result is stored in a new CAS view named
> "plainTextView".
> So far I didn't declare any capabilities in the AEs descriptor, but now I
> tried
> this:
>
> <capabilities>
>        <capability>
(Continue reading)

Richard Eckart | 4 Jun 16:35
Picon

Session variable potpurri in CPE

Hello there,

I have recently switched from my own home-cooked version of session  
variables to using the UIMAContext
session. Actually I am using the UIMAContextAdmin-rootContext for  
storing my session variables as I need them to set them in the  
CollectionReader and read them in the CASConsumer.

However, unless I set the casPoolSize to 1 I am having the problem  
that the CollectionReader already overwrites the session variable  
which the CASConsumer has not yet read.

Before I had encoded my variables as an annotation within the CAS  
which worked fine.

Is there any way to make use of the CAS pool AND of the UIMA session  
variables at the same time?

Richard Eckart

Technische Universität Darmstadt
Institute of Linguistics and Literary Studies
Department of English Linguistics

Hochschulstrasse 1
64289 Darmstadt
Germany

Thilo Goetz | 4 Jun 17:20
Picon
Picon

Re: Problem using Capabilities - OutputSofa

And here's some background if you're interested:
http://www.mail-archive.com/uima-dev-d1GL8uUpDdXTxqt0kkDzDmD2FQJk+8+b <at> public.gmane.org/msg00945.html

There's a lot of discussion before that message,
and a lot afterwards.

So we were mostly agreed that this was broken, but
couldn't agree on the proper fix and finally gave
up.  If we ever do a UIMA 3, we'll have the same
discussion all over :-)

--Thilo

Eddie Epstein wrote:
> The CAS reference passed to the annotator process method changes when
> Sofa capabilities are declared. See
> http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.mvs.deciding_multi_view
> 
> After declaring an output Sofa, process gets the "base CAS". To get
> the text from the "default" view, try
> 
> String originalText = jcas.getCas().getCurrentView().getDocumentText();
> 
> Eddie
> 
> PS looks like the JCas interface is missing the getCurrentView() method.
> 
> On 6/4/08, Christoph Büscher <christoph.buescher@...> wrote:
>> Hi,
>>
(Continue reading)

Christoph Büscher | 4 Jun 18:50
Picon
Favicon

Re: Problem using Capabilities - OutputSofa

Hi,

thanks for the information. Accessing the default view via the CAS interface 
seems to work. However it seems a bit confusing that declaring an output view 
should affect the input view of an analysis engine. I will have a look at the 
Multi-View components section in the documentation.

Eddie Epstein schrieb:
> The CAS reference passed to the annotator process method changes when
> Sofa capabilities are declared. See
> http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.mvs.deciding_multi_view
> 
> After declaring an output Sofa, process gets the "base CAS". To get
> the text from the "default" view, try
> 
> String originalText = jcas.getCas().getCurrentView().getDocumentText();
> 
> Eddie
> 
> PS looks like the JCas interface is missing the getCurrentView() method.
> 

--

-- 
--------------------------------
Christoph Büscher

Adam Lally | 4 Jun 22:30
Picon
Gravatar

Re: Session variable potpurri in CPE

Hi Richard,

On Wed, Jun 4, 2008 at 10:35 AM, Richard Eckart
<eckart@...> wrote:
> Hello there,
>
> I have recently switched from my own home-cooked version of session
> variables to using the UIMAContext
> session. Actually I am using the UIMAContextAdmin-rootContext for storing my
> session variables as I need them to set them in the CollectionReader and
> read them in the CASConsumer.
>

The Session support is actually not intended for sharing information
between components, and it's not really fully implemented anyway.  See
this email:
http://www.mail-archive.com/uima-dev-d1GL8uUpDdXTxqt0kkDzDmD2FQJk+8+b <at> public.gmane.org/msg04364.html.

> However, unless I set the casPoolSize to 1 I am having the problem that the
> CollectionReader already overwrites the session variable which the
> CASConsumer has not yet read.
>
> Before I had encoded my variables as an annotation within the CAS which
> worked fine.
>

It sounds like the values of these variables pertain to a particular
CAS (since you don't want the values to change until that CAS has been
fully processed).  If so, then storing them them in the CAS was a fine
solution.  The CollectionReader and CAS Consumers run in separate
(Continue reading)

Richard Eckart | 4 Jun 23:39
Picon

OSGi-fied UIMA Maven artifact repository?

Hi folks,

is there a Maven repository containing OSGi-fied UIMA artifacts?

Richard Eckart

Technische Universität Darmstadt
Institute of Linguistics and Literary Studies
Department of English Linguistics

Hochschulstrasse 1
64289 Darmstadt
Germany

Richard Eckart | 5 Jun 02:42
Picon

Re: Session variable potpurri in CPE

Hi again,

> It sounds like the values of these variables pertain to a particular
> CAS (since you don't want the values to change until that CAS has been
> fully processed).  If so, then storing them them in the CAS was a fine
> solution.  The CollectionReader and CAS Consumers run in separate
> threads, so the CollectionReader definitely may move on to a new CAS
> before the CAS Consumer processes the first one.

The problem is that I need to share data that is not a primitive type
which I could represent in the CAS (it's a rather complex Java object).

> If you wanted your Collection Reader and CAS Consumer so share data
> that was *not* related to a particular CAS, you could use UIMA's
> external resource mechanism to accomplish that.

I was originally searching for that (because it was mentioned at the
UIMA workshop at the LREC) but I only found out how to use the session
variables. Do you have a pointer to finding out how to use that  
mechanism?

What I want is that wherever I put the Java object, it should be  
automatically
removed when a CAS is fully processed - no matter if it was  
successful or not.
The object is bound to a document but representable in the CAS.  
Before I had an
ID encoded in the CAS and passed the object out-of-band via a static  
hashmap
(CollectionReader puts object there under the ID and CASConsumer  
(Continue reading)

Thilo Goetz | 5 Jun 07:59
Picon
Picon

Re: OSGi-fied UIMA Maven artifact repository?

Richard Eckart wrote:
> Hi folks,
> 
> is there a Maven repository containing OSGi-fied UIMA artifacts?
> 
> Richard Eckart
> 
> Technische Universität Darmstadt
> Institute of Linguistics and Literary Studies
> Department of English Linguistics
> 
> Hochschulstrasse 1
> 64289 Darmstadt
> Germany
> 
> 

Hi Richard,

the short answer is no.

The somewhat longer answer is that there were some efforts in
IBM Research to use OSGi as PEAR replacement, so as a packaging
format for UIMA components.  I haven't heard of this project in
a while, so not sure if it's still active.

--Thilo


Gmane