Kirk True | 17 May 19:52
Gravatar

UIMA internals memory footprint

Hi all,

I have begun getting seeing heavy memory use when processing largish
documents through a UIMA pipeline. I wanted to make sure what I'm
seeing with regard to UIMA's internal memory use is on par with
expectations.

It looks like either for a 1,500,000 byte or a 15,000,000 byte document
with the same annotations (100,000 10-character annotations), we incur
a ~13 MB "overhead" for internal UIMA data structures. Is this in line
with expectations?

Details:

In the interest of narrowing down the issue, I made a very simple test
annotator which mimics what my annotators do. The annotator creates a
document of N bytes which is set in a view in the CAS, then it
transforms the bytes to an HTML string that is then set in a view in
the CAS. Next, for each view, the annotator creates 50,000 annotations.
Each annotation has two 5-character attributes. I profiled my
application using two profilers (JProbe and YourKit) and took heap
snapshots before and after processing was performed and saw similar
results.

I know there's a lot going on under the hood, so I'm trying to get an
idea of what kind of size factor I can expect for a given document
size. Right now, according to my calculations and verified by the
profiler, the expected memory usage for just my data (i.e. the two
views of the document and the strings making up the annotations) is:

(Continue reading)

Kirk True | 17 May 19:56
Gravatar

Removal of data from view?

Hi all,

Is there a way to remove data that was inserted into a view? 

In our application, we load the initial raw document bytes and store it
in a CAS view in one annotator, but later in another downstream
annotator we transform the raw document bytes into plain text and store
it in the "main" CAS view as the document text. At that point the
initial raw document bytes are of no interest. 

I've looked but don't see a clear way to remove that data. I tried the
reset/release methods, but that deleted other data from the CAS that is
needed.

Thoughts?

Thanks,
Kirk

Adam Lally | 17 May 20:58
Picon
Gravatar

Re: UIMA internals memory footprint

Kirk,

In this test are you running a CPE or just an AnalysisEngine?  If it
is a CPE do you know what your CAS Pool size is?

When a CAS is created it does allocate a large heap which is then
filled as you create annotations.  By default I believe this is
500,000 cells (2MB) per CAS, but this can be overridden (see
UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
defintely be one source of memory overhead.  As you saw it does not
grow with larger documents, it will only grow if you create enough
annotations to fill up the allocated space.

-Adam

On 5/17/07, Kirk True <kirk@...> wrote:
> Hi all,
>
> I have begun getting seeing heavy memory use when processing largish
> documents through a UIMA pipeline. I wanted to make sure what I'm
> seeing with regard to UIMA's internal memory use is on par with
> expectations.
>
> It looks like either for a 1,500,000 byte or a 15,000,000 byte document
> with the same annotations (100,000 10-character annotations), we incur
> a ~13 MB "overhead" for internal UIMA data structures. Is this in line
> with expectations?
>
> Details:
>
(Continue reading)

Kirk True | 18 May 01:28
Gravatar

Re: UIMA internals memory footprint

Hi Adam,

> Kirk,
> 
> In this test are you running a CPE or just an AnalysisEngine?  If it
> is a CPE do you know what your CAS Pool size is?

It's an AnalysisEngine.

> When a CAS is created it does allocate a large heap which is then
> filled as you create annotations.  By default I believe this is
> 500,000 cells (2MB) per CAS, but this can be overridden (see
> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
> defintely be one source of memory overhead.  As you saw it does not
> grow with larger documents, it will only grow if you create enough
> annotations to fill up the allocated space.

I noticed that this is tweak-able and set it to something insanely
small (like 100). But, as you said, it grows as the number of
annotations grow. Since the parameter is under the umbrella of
performance, I'd assume that it would actually be better to
pre-allocate close to what we're going to use.

Thanks!
Kirk

> On 5/17/07, Kirk True <kirk@...> wrote:
> > Hi all,
> >
> > I have begun getting seeing heavy memory use when processing
(Continue reading)

Thilo Goetz | 18 May 09:55
Picon
Picon

Re: UIMA internals memory footprint

Kirk True wrote:
> Hi Adam,
> 
>> Kirk,
>>
>> In this test are you running a CPE or just an AnalysisEngine?  If it
>> is a CPE do you know what your CAS Pool size is?
> 
> It's an AnalysisEngine.
> 
>> When a CAS is created it does allocate a large heap which is then
>> filled as you create annotations.  By default I believe this is
>> 500,000 cells (2MB) per CAS, but this can be overridden (see
>> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
>> defintely be one source of memory overhead.  As you saw it does not
>> grow with larger documents, it will only grow if you create enough
>> annotations to fill up the allocated space.
> 
> I noticed that this is tweak-able and set it to something insanely
> small (like 100). But, as you said, it grows as the number of
> annotations grow. Since the parameter is under the umbrella of
> performance, I'd assume that it would actually be better to
> pre-allocate close to what we're going to use.
[...]

Yes.

You can estimate data use on the heap as follows.  Each FS uses at least one
int for the type information, plus whatever features it has.  So a vanilla
annotation is 3 ints, one for the type, and one for the start and end features,
(Continue reading)

Ngan Nguyen | 18 May 17:58
Picon

Problem with subIterator of AnnotationIndex

In AnnotationIndex class says that "Annotations are sorted in increasing
order of their start offset". However when I use subIterator method of
AnnotationIndex to get an iterator of annotations (token) inside an
annotation (sentence), the returned annotations' order is strange. See below
token_string(begin,end,part_of_speech):

He(0/2/PRP)
. (13,14,.)
running(6,13,VBG)
is (3,5,VBZ)

Can such strange orders happen in the AnnotationIndex? or just my program
bugs? I tried to find the reason why but I still couldn't.

I find AnnotationIndex a bit inconvenient for programmers. Do you have any
effective strategy to deal with annotations in the annotation pool?
Kirk True | 18 May 20:22
Gravatar

InternationalizedException doesn't look for resources in extension ClassLoader

Hi all,

Summary:

org.apache.uima.InternationalizedException's getLocalizedMessage
doesn't account for the use of an "extension" class path when loading
message resources. For applications that use extension class paths, the
resource is therefore not found.

Details:

I'm getting an exception when trying to create an
AnalysisEngineProcessException:

org.apache.uima.analysis_engine.AnalysisEngineProcessException:
    EXCEPTION MESSAGE LOCALIZATION FAILED: 
    java.util.MissingResourceException: Can't find bundle for base
    name TestAnnotatorRB, locale en_US

The code that generates this message is in InternationalizedException's
getLocalizedMethod:

    public String getLocalizedMessage(Locale aLocale) {
        // check for null message
        if (getMessageKey() == null)
          return null;

        try {
          // locate the resource bundle for this exception's messages
          ResourceBundle bundle = 
(Continue reading)

Thilo Goetz | 18 May 20:31
Picon
Picon

Re: Problem with subIterator of AnnotationIndex

Ngan Nguyen wrote:
> In AnnotationIndex class says that "Annotations are sorted in increasing
> order of their start offset". However when I use subIterator method of
> AnnotationIndex to get an iterator of annotations (token) inside an
> annotation (sentence), the returned annotations' order is strange. See 
> below
> token_string(begin,end,part_of_speech):
> 
> He(0/2/PRP)
> . (13,14,.)
> running(6,13,VBG)
> is (3,5,VBZ)
> 
> Can such strange orders happen in the AnnotationIndex? or just my program
> bugs? I tried to find the reason why but I still couldn't.
> 
> I find AnnotationIndex a bit inconvenient for programmers. Do you have any
> effective strategy to deal with annotations in the annotation pool?
> 

That sounds like a bug.  Could you provide a test case (code)?  Or maybe an
XCAS of a document plus instructions on how to reproduce this?  Thanks.

--Thilo

Thilo Goetz | 18 May 20:34
Picon
Picon

Re: InternationalizedException doesn't look for resources in extension ClassLoader

Kirk True wrote:
> Hi all,
> 
> Summary:
> 
> org.apache.uima.InternationalizedException's getLocalizedMessage
> doesn't account for the use of an "extension" class path when loading
> message resources. For applications that use extension class paths, the
> resource is therefore not found.
[...]

Yes, that's a known issue.  There is one trick you can use: don't use
InternationalizedException directly, but inherit from it.  Bundle this
exception class with your pear.  Since message localization is done in
the exception class, the correct class loader will then be used and
your message bundle will be found.  It's a bit of a hack, but it works ;-)

--Thilo

Kirk True | 18 May 20:43
Gravatar

Re: InternationalizedException doesn't look for resources in extension ClassLoader

Hi Thilo,

> Yes, that's a known issue.  There is one trick you can use: don't use
> InternationalizedException directly, but inherit from it.  Bundle
> this
> exception class with your pear.  Since message localization is done
> in
> the exception class, the correct class loader will then be used and
> your message bundle will be found.  It's a bit of a hack, but it
> works ;-)

Thanks for the fast reply!

Is there a bug report listed somewhere? I looked for it, but didn't
find anything. 

Thanks for the workaround. My concern is that since we want to "host"
arbitrary annotators from third parties, etc. *their* annotators won't
implement the trick and thus fail :(

Thanks!!!

Kirk


Gmane