negrinv | 1 Dec 01:33 2006
Picon

Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted


Thank you Robert for your commnets. I am inclined to agree with you, but I
would like to establish first of all if simplicity of implementation is the
overriding consideration. But before I dwell on that let me say that i have
discovered that I am not a master of DIFF file creation with Eclipse. The
diff file attachement to my original posting is absurdly large and not
correct. I have therefore attached a zip file containing the complete source
code of the classes I modified. I leave it to others to extract the diffs
properly.
Back to the issue. So far the implementation has not been difficult
considering that I knew nothing about Lucene internals before I started. The
reason is that Lucene is very well structured and the changes just fitted
nicely by adding some code in the right place with minimal changes to the
existing code. But I admit that the proposed implementation so far is not
complete and more work is required to overcome some of its restrictions.
While I like your idea I believe that it imposed too large a granularity on
the encrypted data, all fields will all kinds of data will be encrypted
including  images and others which normally would be left alone, thus adding
to the performance penalty due to encryption. Many hardware devices and most
operating systems already provide directory or file system encryption
therefore that level of encryption appears to me an unnecessary addition to
Lucene. Encryption at field level however is not provided by anything I
know. The key in my opinion is to decide what is best from the end user
point of view, but perhaps we need more discussion on this. 
Victor

http://www.nabble.com/file/4390/LuceneEncryptionMods.zip
LuceneEncryptionMods.zip 

Robert Engels wrote:
(Continue reading)

negrinv | 1 Dec 07:45 2006
Picon

Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted


Luke, I should have mentioned in my earlier posting that what I am proposing
uses password based encrytpion, where the password  is NOT stored anywhere
within Lucene. I avoided  on purpose to make any references to security (as
opposed to encryption) because I believe security to be the responsability
of the end application, not of Lucene. Lucene in my opinion can only provide
encryption services. None of the encryption APIs themselves, wether written
by a third party or by Sun, can guarantee security either. Hence why Lucene
cannot do it also. What it can do is provide the encryption of the data and
its index. Any application using this proposed API extensions will have to
work out the extent to which it can provide security within the context of
all the other APIs involved and  the application requirements themselves. 
I have to agree with you that at some stage Lucene will have to stop
providing new functionality or it will become unmaintenable. But has it
reached that stage yet?
Victor

Luke Nezda wrote:
> 
> Victor-
> Your point is well taken that a comprehensive encryption strategy is not
> quite analogous to compression which is involves more than a
> transformation
> of field values to a more compact form since it requires (at a minimum)
> all
> data structures which comprise the index be encrypted too.  Maybe I spoke
> to
> soon.
> 
> However, after considering this more, I think the scheme would need to be
(Continue reading)

Nicolas Lalevée | 1 Dec 09:20 2006

Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted

Le Vendredi 1 Décembre 2006 01:33, negrinv a écrit :
> Thank you Robert for your commnets. I am inclined to agree with you, but I
> would like to establish first of all if simplicity of implementation is the
> overriding consideration. But before I dwell on that let me say that i have
> discovered that I am not a master of DIFF file creation with Eclipse. The
> diff file attachement to my original posting is absurdly large and not
> correct. I have therefore attached a zip file containing the complete
> source code of the classes I modified. I leave it to others to extract the
> diffs properly.
> Back to the issue. So far the implementation has not been difficult
> considering that I knew nothing about Lucene internals before I started.
> The reason is that Lucene is very well structured and the changes just
> fitted nicely by adding some code in the right place with minimal changes
> to the existing code. But I admit that the proposed implementation so far
> is not complete and more work is required to overcome some of its
> restrictions. While I like your idea I believe that it imposed too large a
> granularity on the encrypted data, all fields will all kinds of data will
> be encrypted including  images and others which normally would be left
> alone, thus adding to the performance penalty due to encryption.

I don't agree with you here. In Lucene, you will encrypt the field data, the 
field names, and the tokens : I would say that is represents at least 2/3 of 
the index size. Then, with the implementation you suggest, I think (sorry I 
didn't took time to see you patch) that every time a lucene data need to be 
read, it is decrypted each time. With an encrypted FS, your kernel will 
maintain a cache in RAM for you, so it won't hurt so much.
It needs some bench to see what is effectively the best, but I have doubt that 
your solution will be faster.

Nicolas.
(Continue reading)

Doron Cohen (JIRA | 1 Dec 10:14 2006
Picon

[jira] Created: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

Sloppy Phrase Scoring Misbehavior
---------------------------------

                 Key: LUCENE-736
                 URL: http://issues.apache.org/jira/browse/LUCENE-736
             Project: Lucene - Java
          Issue Type: Bug
          Components: Search
            Reporter: Doron Cohen
         Assigned To: Doron Cohen
            Priority: Minor

This is an extension of https://issues.apache.org/jira/browse/LUCENE-697

In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase
search and scoring.

1) A phrase with a repeated word would be detected in a document although it is not there.
I.e. document = A B D C E , query = "B C B" would not find this document (as expected), but query "B C B"~2 would
find it. 
I think that no matter how large the slop is, this document should not be a match.

2) A document containing both orders of a query, symmetrically, would score differently for the queru and
for its reveresed form.
I.e. document = A B C B A would score differently for queries "B C"~2 and "C B"~2, although it is symmetric to both.

I will attach test cases that show both these problems and the one reported by Yonik in 697. 

--

-- 
This message is automatically generated by JIRA.
(Continue reading)

Doron Cohen (JIRA | 1 Dec 10:20 2006
Picon

[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

    [ http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12454844 ] 

Doron Cohen commented on LUCENE-697:
------------------------------------

I went on documenting sloppy phrase scorer and phrase scorer, so that the fix above can make it in more comfortably.
However while doing that I found that the scorer does not behave as I thought it should.
It seems to me that a problem with sloppy phrase scoring is broader  then the skipTo() behavior, and actually
the skipTo behavior is just a side effect of that.
So I am creating a new issue to discuss sloppy phrase scoring behavior, that if/when resolved would
probably also reolve this one. 
See https://issues.apache.org/jira/browse/LUCENE-736

> Scorer.skipTo affects sloppyPhrase scoring
> ------------------------------------------
>
>                 Key: LUCENE-697
>                 URL: http://issues.apache.org/jira/browse/LUCENE-697
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.0.0
>            Reporter: Yonik Seeley
>         Assigned To: Doron Cohen
>         Attachments: sloppy_phrase_skipTo.patch, sloppy_phrase_skipTo.patch2
>
>
> If you mix skipTo() and next(), you get different scores than what is returned to a hit collector.

--

-- 
(Continue reading)

Doron Cohen (JIRA | 1 Dec 10:26 2006
Picon

[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

     [ http://issues.apache.org/jira/browse/LUCENE-736?page=all ]

Doron Cohen updated LUCENE-736:
-------------------------------

    Attachment: sloppy_phrase_tests.patch.txt

sloppy_phrase_tests.patch.txt  contains:

- two test cases added in TestPhraseQuery. 
These new tests currently fail. 

- skipTo() behavior tests that were originaly in issue 697. 
This too currently fails.

> Sloppy Phrase Scoring Misbehavior
> ---------------------------------
>
>                 Key: LUCENE-736
>                 URL: http://issues.apache.org/jira/browse/LUCENE-736
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>            Priority: Minor
>         Attachments: sloppy_phrase_tests.patch.txt
>
>
> This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
(Continue reading)

negrinv | 1 Dec 11:10 2006
Picon

Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted


Nicolas Lalevée-2 wrote:
> 
> Le Vendredi 1 Décembre 2006 01:33, negrinv a écrit :
>> Thank you Robert for your commnets. I am inclined to agree with you, but
>> I
>> would like to establish first of all if simplicity of implementation is
>> the
>> overriding consideration. But before I dwell on that let me say that i
>> have
>> discovered that I am not a master of DIFF file creation with Eclipse. The
>> diff file attachement to my original posting is absurdly large and not
>> correct. I have therefore attached a zip file containing the complete
>> source code of the classes I modified. I leave it to others to extract
>> the
>> diffs properly.
>> Back to the issue. So far the implementation has not been difficult
>> considering that I knew nothing about Lucene internals before I started.
>> The reason is that Lucene is very well structured and the changes just
>> fitted nicely by adding some code in the right place with minimal changes
>> to the existing code. But I admit that the proposed implementation so far
>> is not complete and more work is required to overcome some of its
>> restrictions. While I like your idea I believe that it imposed too large
>> a
>> granularity on the encrypted data, all fields will all kinds of data will
>> be encrypted including  images and others which normally would be left
>> alone, thus adding to the performance penalty due to encryption.
> 
> I don't agree with you here. In Lucene, you will encrypt the field data,
> the 
(Continue reading)

Doron Cohen (JIRA | 1 Dec 11:26 2006
Picon

[jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

     [ http://issues.apache.org/jira/browse/LUCENE-736?page=all ]

Doron Cohen updated LUCENE-736:
-------------------------------

    Attachment: sloppy_phrase_java.patch.txt
                perf-search-new.log
                perf-search-orig.log

Attached sloppy_phrase_java.patch.txt is fixing the failing new tests. 
This also includes the skipTo() bug from issue 697.

The fix does not guarantee that document A B C B A would score "A B C"~4 and "C B A"~4 the same. 
It does that for "B C"~2 and "C B"~2.
This is because a general fix for that (at least the one that I devised) would be too expensive.
Although this is an interesting case, I'd like to think it is not an important one.

This fix comes with a performance cost:  about 15% degradation in CPU activity of sloppy phrase scoring, as
the attcahed perf logs show.
Here is the summary of these tests:

.......Operation..........runCnt...recsPerRun.....rec/s..elapsedSec
Orig:..SearchSameRdr_3000......4.........3000.....216.1.......55.52
New:...SearchSameRdr_3000......4.........3000.....187.8.......63.91

I think that in a real life scenario - real index, real documents, real queries - this extra CPU will be shaded
by IO, but I also belive we should refrain from slowing down search, so, unhappy with this degradation
(anyone would:-), I would look for a other ways to fix this - ideas are welcome.

Perf test was done using the task benchmark framework (see issue 675), The logs show also the queries that
(Continue reading)

Nicolas Lalevée | 1 Dec 11:49 2006

Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted

Le Vendredi 1 Décembre 2006 11:10, negrinv a écrit :
> Nicolas Lalevée-2 wrote:
> > Le Vendredi 1 Décembre 2006 01:33, negrinv a écrit :
> >> Thank you Robert for your commnets. I am inclined to agree with you, but
> >> I
> >> would like to establish first of all if simplicity of implementation is
> >> the
> >> overriding consideration. But before I dwell on that let me say that i
> >> have
> >> discovered that I am not a master of DIFF file creation with Eclipse.
> >> The diff file attachement to my original posting is absurdly large and
> >> not correct. I have therefore attached a zip file containing the
> >> complete source code of the classes I modified. I leave it to others to
> >> extract the
> >> diffs properly.
> >> Back to the issue. So far the implementation has not been difficult
> >> considering that I knew nothing about Lucene internals before I started.
> >> The reason is that Lucene is very well structured and the changes just
> >> fitted nicely by adding some code in the right place with minimal
> >> changes to the existing code. But I admit that the proposed
> >> implementation so far is not complete and more work is required to
> >> overcome some of its restrictions. While I like your idea I believe that
> >> it imposed too large a
> >> granularity on the encrypted data, all fields will all kinds of data
> >> will be encrypted including  images and others which normally would be
> >> left alone, thus adding to the performance penalty due to encryption.
> >
> > I don't agree with you here. In Lucene, you will encrypt the field data,
> > the
> > field names, and the tokens : I would say that is represents at least 2/3
(Continue reading)

Yonik Seeley (JIRA | 1 Dec 18:06 2006
Picon

[jira] Commented: (LUCENE-736) Sloppy Phrase Scoring Misbehavior

    [ http://issues.apache.org/jira/browse/LUCENE-736?page=comments#action_12454955 ] 

Yonik Seeley commented on LUCENE-736:
-------------------------------------

Great investigations Doron!
Personally I'm more concerned with (1) than (2).  Was the fix for one issue more responsible for the
performance loss than the other?

> Sloppy Phrase Scoring Misbehavior
> ---------------------------------
>
>                 Key: LUCENE-736
>                 URL: http://issues.apache.org/jira/browse/LUCENE-736
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>            Priority: Minor
>         Attachments: perf-search-new.log, perf-search-orig.log, sloppy_phrase_java.patch.txt, sloppy_phrase_tests.patch.txt
>
>
> This is an extension of https://issues.apache.org/jira/browse/LUCENE-697
> In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase
search and scoring.
> 1) A phrase with a repeated word would be detected in a document although it is not there.
> I.e. document = A B D C E , query = "B C B" would not find this document (as expected), but query "B C B"~2 would
find it. 
> I think that no matter how large the slop is, this document should not be a match.
(Continue reading)


Gmane