[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

Back to archive index

Beena Emerson memis****@gmail*****
2013年 10月 3日 (木) 17:53:09 JST


On Thu, Oct 3, 2013 at 2:11 PM, Fujii Masao <masao****@gmail*****> wrote:

> On Thu, Oct 3, 2013 at 3:48 PM, Beena Emerson <memis****@gmail*****>
> wrote:
> > On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****>
> wrote:
> >>
> >> On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote <amitl****@gmail*****
> >
> >> wrote:
> >> > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****>
> >> > wrote:
> >> >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson
> >> >> <memis****@gmail*****> wrote:
> >> >>> Please find attached the updated patch to rebase against the new
> HEAD
> >> >>> and
> >> >>> also implements the comments I had given before.
> >>
> >> I extracted the similarity function code from the patch as the separate
> >> one
> >> so that we can more easily review and commit it. I'd like to work on
> this
> >> first.
> >> After committing it, I'd like to work on the remaining similarity search
> >> code.
> >>
> >> Attached patch just implements pg_bigm version of similarity function.
> >> This is in WIP yet. The description of bigm_similarity function must be
> >> added into the document. The regression test must be updated.
> >>
> >>
> >> While reviewing the similartiy function, I found that there is one big
> >> problem
> >> in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but
> >> pg_trgm
> >> version of similarity function is not. Please see the following example:
> >>
> >> =# select similarity('wow', 'WOW');
> >>  similarity
> >> ------------
> >>           1
> >> (1 row)
> >>
> >> =# select bigm_similarity('wow', 'WOW');
> >>  bigm_similarity
> >> -----------------
> >>                0
> >> (1 row)
> >>
> >> Should we implement the *case-insensitive* bigm_similarity()?
> >>
> >
> > The pg_bigm code is case sensitive, even the show_bigm and show_trgm
> behave
> > differently.
> >
> > =# SELECT show_bigm ('ABC');
> >      show_bigm
> > -------------------
> >  {" A",AB,BC,"C "}
> > (1 row)
> >
> > =# SELECT show_trgm ('ABC');
> >         show_trgm
> > -------------------------
> >  {"  a"," ab",abc,"bc "}
> > (1 row)
> >
> > This is because of the difference in code of generate_trgm and
> > generate_bigm.
> >
> > fn generate_trgm: trgm_op.c ln 213 - 228
> >  while ((bword = find_word(eword, slen - (eword - str), &eword,
> &charlen))
> > != NULL)
> >         {
> > #ifdef IGNORECASE
> >                 bword = lowerstr_with_len(bword, eword - bword);
> >                 bytelen = strlen(bword);
> > #else
> >                 bytelen = eword - bword;
> > #endif
> >
> >                 memcpy(buf + LPADDING, bword, bytelen);
> >
> > #ifdef IGNORECASE
> >                 pfree(bword);
> > #endif
> >                 buf[LPADDING + bytelen] = ' ';
> >                 buf[LPADDING + bytelen + 1] = ' ';
> > ....
> >
> > fn generate_bigm: bigm_op.c ln 247 - 253
> >  while ((bword = find_word(eword, slen - (eword - str), &eword,
> &charlen))
> > != NULL)
> >         {
> >                 bytelen = eword - bword;
> >                 memcpy(buf + LPADDING, bword, bytelen);
> >
> >                 buf[LPADDING + bytelen] = ' ';
> >                 buf[LPADDING + bytelen + 1] = ' ';
> > ....
> >
> > Since similarity function uses this generate_bigm to get the bigrams and
> > then compare it, there is a difference in behavior.
> >
> > So the way to make similarity function case-insensitive would be to
> change
> > generate_bigm and not the similarity code itself. Also, the change will
> make
> > the show_bigm function behave differently.
>
> Yes, generate_bigm would need to be updated to make bigm_similarity
> case-sensitive.
>
> *From a user point of view*, bigm_similarity() and upcoming similarity
> search should be case-sensitive? If yes, we should change generate_bigm,
> but its change must not affect the behavior of the full-text search at all.
>
> Or we should just implement both case-sensitive and -insensitive
> bigm_similarity() and similarity search?
>
> Thought?


 I am not sure I understood what you are trying to say.

Are you suggesting that bigm similarity functions should be case
insensitive and the current functions and search method should remain case
sensitive?

Beena Emerson
-------------- next part --------------
HTML$B$NE:IU%U%!%$%k$rJ]4I$7$^$7$?(B...
Télécharger 



Pgbigm-hackers メーリングリストの案内
Back to archive index