[OPEN-ILS-DEV] UTF-8 vs UTF-16 (was: PATCH & RFC: Providing i18n support in OpenILSdatabase schema (diacritics))

Dan Scott denials at gmail.com
Mon Jun 4 15:48:13 EDT 2007


On 04/06/07, Wilkening, Chris <Chris.Wilkening at brodart.com> wrote:
> Does the UTF-8 encoding support diacritics? We've ran into problems with
> that and generally go with UTF-16 which has, so far, allowed us to
> maintain diacritics in database records. We generally run into that
> problem with Spanish records but, if my high-school-French-classes
> memory isn't faulty, French (as well as most of the Romantic languages)
> have diacritics to one degree or another

Hi Chris:

MARC21 only supports MARC8 or UTF-8 encodings, so we have to choose
one or the other. UTF-8 is an encoding of the Unicode standard
character set, which (as of Unicode Standard 5.0) covers > 99,000
characters. Compared to MARC8, UTF-8 is widely used throughout other
applications, so I think we're pretty safe with UTF-8.

There's a good overview of Unicode and its various encodings at
http://www.unicode.org/standard/principles.html

I'm not sure why you've run into problems with UTF-8 that are resolved
by UTF-16 -- sounds interesting!

-- 
Dan Scott
Laurentian University


More information about the Open-ils-dev mailing list