Adding similarity() and similarity_op(), '%' to pg_bigm (Pgbigm-hackers) - pg_bigm

On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****> wrote:

> On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote <amitl****@gmail*****>
> wrote:
> > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****>
> wrote:
> >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson <memis****@gmail*****>
> wrote:
> >>> Please find attached the updated patch to rebase against the new HEAD
> and
> >>> also implements the comments I had given before.
>
> I extracted the similarity function code from the patch as the separate one
> so that we can more easily review and commit it. I'd like to work on this
> first.
> After committing it, I'd like to work on the remaining similarity search
> code.
>
> Attached patch just implements pg_bigm version of similarity function.
> This is in WIP yet. The description of bigm_similarity function must be
> added into the document. The regression test must be updated.
>

> While reviewing the similartiy function, I found that there is one big
> problem
> in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but
> pg_trgm
> version of similarity function is not. Please see the following example:
>
> =# select similarity('wow', 'WOW');
>  similarity
> ------------
>           1
> (1 row)
>
> =# select bigm_similarity('wow', 'WOW');
>  bigm_similarity
> -----------------
>                0
> (1 row)
>
> Should we implement the *case-insensitive* bigm_similarity()?
>
>
The pg_bigm code is case sensitive, even the show_bigm and show_trgm behave
differently.

=# SELECT show_bigm ('ABC');
     show_bigm
-------------------
 {" A",AB,BC,"C "}
(1 row)

=# SELECT show_trgm ('ABC');
        show_trgm
-------------------------
 {"  a"," ab",abc,"bc "}
(1 row)

This is because of the difference in code of generate_trgm and
generate_bigm.

fn generate_trgm: trgm_op.c ln 213 - 228
 while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen))
!= NULL)
        {
#ifdef IGNORECASE
                bword = lowerstr_with_len(bword, eword - bword);
                bytelen = strlen(bword);
#else
                bytelen = eword - bword;
#endif

                memcpy(buf + LPADDING, bword, bytelen);

#ifdef IGNORECASE
                pfree(bword);
#endif
                buf[LPADDING + bytelen] = ' ';
                buf[LPADDING + bytelen + 1] = ' ';
....

fn generate_bigm: bigm_op.c ln 247 - 253
 while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen))
!= NULL)
        {
                bytelen = eword - bword;
                memcpy(buf + LPADDING, bword, bytelen);

                buf[LPADDING + bytelen] = ' ';
                buf[LPADDING + bytelen + 1] = ' ';
....

Since similarity function uses this generate_bigm to get the bigrams and
then compare it, there is a difference in behavior.

So the way to make similarity function case-insensitive would be to change
generate_bigm and not the similarity code itself. Also, the change will
make the show_bigm function behave differently.

Beena Emerson
-------------- next part --------------
HTML$B$NE:IU%U%!%$%k$rJ]4I$7$^$7$?(B...
Télécharger 

pg_bigm
Fork

[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

pg_bigm Fork

[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

pg_bigm
Fork