[OPEN-ILS-GENERAL] ***SPAM*** Re: Special diacritics

Dan Scott dan at coffeecode.net
Tue May 25 22:22:07 EDT 2010


On 25 May 2010 16:05, Hardy, Elaine <ehardy at georgialibraries.org> wrote:
> James,
>
> Yes. PINES has experienced this and has opened a helpdesk ticket with this
> specific author and Yrsa Sigurðardóttir . While not really an o, many sites
> and databases will retrieve searches like Hoeg for Høeg or Sigurdardottir
> for Sigurðardóttir. We were told it would take development for those
> searches to return the correct authors in Evergreen.

This is a tough one, but I think I can point you in the right rough
direction in the 1.6 code base. Search query normalization occurs in
Open-ILS/src/perlmods/OpenILS/Application/Storage/FTS.pm in the
naco_normalize() function, which attempts to follow the NACO
normalization rules for authority records. And indexing normalization
occurs in the public.naco_normalize database function (which is
basically a copy of the FTS.pm version).

It looks, though, like what you might want is to use the Perl module
Text::Unidecode
(http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm)
to always generate an ASCII transliterated version of the incoming
text. For example:

#!/usr/bin/perl
use strict;
use warnings;
use Text::Unidecode;
use utf8;

print unidecode('Peter Høeg Yrsa Sigurðardóttir') . "\n";

---

will print:

Peter Hoeg Yrsa Sigurdardottir

If you do this (essentially, replace the contents of naco_normalize
with unidecode()), you'll want to do this both when you index the
record itself, as well as for the search query (so that searching for
'Høeg' gets turned into a search for 'Hoeg' and will match the indexed
content).

Unless, that is, you also keep a non-normalized copy of the word in
the index so that you can provide more specific searches, or boosts
for exact matches. That gets a bit more complex!

Once you put a modified version of the naco_normalize() functions in
place, you would then have to identify which records contain the
characters of interest ("SELECT id FROM biblio.record_entry WHERE marc
LIKE '%ø%';" as a horribly crude example) and subject them to
reindexing.

The in-database ingest approach in trunk that Mike has put together
offers some more intriguing possibilities for customization. If I
recall correctly (rusty brain), in trunk you can turn specific
normalizations on or off for a given kind of field, so you could
potentially add a "unidecode" normalizer, then turn off naco_normalize
and turn on unidecode normalization if that's your preference.


More information about the Open-ils-general mailing list