[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

Back to archive index

Beena Emerson memis****@gmail*****
2013年 10月 3日 (木) 15:48:26 JST

On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****> wrote:

> On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote <amitl****@gmail*****>
> wrote:
> > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****>
> wrote:
> >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson <memis****@gmail*****>
> wrote:
> >>> Please find attached the updated patch to rebase against the new HEAD
> and
> >>> also implements the comments I had given before.
> I extracted the similarity function code from the patch as the separate one
> so that we can more easily review and commit it. I'd like to work on this
> first.
> After committing it, I'd like to work on the remaining similarity search
> code.
> Attached patch just implements pg_bigm version of similarity function.
> This is in WIP yet. The description of bigm_similarity function must be
> added into the document. The regression test must be updated.

> While reviewing the similartiy function, I found that there is one big
> problem
> in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but
> pg_trgm
> version of similarity function is not. Please see the following example:
> =# select similarity('wow', 'WOW');
>  similarity
> ------------
>           1
> (1 row)
> =# select bigm_similarity('wow', 'WOW');
>  bigm_similarity
> -----------------
>                0
> (1 row)
> Should we implement the *case-insensitive* bigm_similarity()?
The pg_bigm code is case sensitive, even the show_bigm and show_trgm behave

=# SELECT show_bigm ('ABC');
 {" A",AB,BC,"C "}
(1 row)

=# SELECT show_trgm ('ABC');
 {"  a"," ab",abc,"bc "}
(1 row)

This is because of the difference in code of generate_trgm and

fn generate_trgm: trgm_op.c ln 213 - 228
 while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen))
!= NULL)
                bword = lowerstr_with_len(bword, eword - bword);
                bytelen = strlen(bword);
                bytelen = eword - bword;

                memcpy(buf + LPADDING, bword, bytelen);

                buf[LPADDING + bytelen] = ' ';
                buf[LPADDING + bytelen + 1] = ' ';

fn generate_bigm: bigm_op.c ln 247 - 253
 while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen))
!= NULL)
                bytelen = eword - bword;
                memcpy(buf + LPADDING, bword, bytelen);

                buf[LPADDING + bytelen] = ' ';
                buf[LPADDING + bytelen + 1] = ' ';

Since similarity function uses this generate_bigm to get the bigrams and
then compare it, there is a difference in behavior.

So the way to make similarity function case-insensitive would be to change
generate_bigm and not the similarity code itself. Also, the change will
make the show_bigm function behave differently.

Beena Emerson
-------------- next part --------------

Pgbigm-hackers メーリングリストの案内
Back to archive index