Perl 5 version 16.1 documentation

Unicode::Collate::Locale

NAME

Unicode::Collate::Locale - Linguistic tailoring for DUCET via Unicode::Collate

SYNOPSIS

  1. use Unicode::Collate::Locale;
  2. #construct
  3. $Collator = Unicode::Collate::Locale->
  4. new(locale => $locale_name, %tailoring);
  5. #sort
  6. @sorted = $Collator->sort(@not_sorted);
  7. #compare
  8. $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.

Note: Strings in @not_sorted , $a and $b are interpreted according to Perl's Unicode support. See perlunicode, perluniintro, perlunitut, perlunifaq, utf8. Otherwise you can use preprocess (cf. Unicode::Collate ) or should decode them before.

DESCRIPTION

This module provides linguistic tailoring for it taking advantage of Unicode::Collate .

Constructor

The new method returns a collator object.

A parameter list for the constructor is a hash, which can include a special key locale and its value (case-insensitive) standing for a Unicode base language code (two or three-letter). For example, Unicode::Collate::Locale->new(locale => 'FR') returns a collator tailored for French.

$locale_name may be suffixed with a Unicode script code (four-letter), a Unicode region code, a Unicode language variant code. These codes are case-insensitive, and separated with '_' or '-' . E.g. en_US for English in USA, az_Cyrl for Azerbaijani in the Cyrillic script, es_ES_traditional for Spanish in Spain (Traditional).

If $locale_name is not available, fallback is selected in the following order:

  1. 1. language with a variant code
  2. 2. language with a script code
  3. 3. language with a region code
  4. 4. language
  5. 5. default

Tailoring tags provided by Unicode::Collate are allowed as long as they are not used for locale support. Esp. the table tag is always untailorable, since it is reserved for DUCET.

E.g. a collator for French, which ignores diacritics and case difference (i.e. level 1), with reversed case ordering and no normalization.

  1. Unicode::Collate::Locale->new(
  2. level => 1,
  3. locale => 'fr',
  4. upper_before_lower => 1,
  5. normalization => undef
  6. )

Overriding a behavior already tailored by locale is disallowed if such a tailoring is passed to new() .

  1. Unicode::Collate::Locale->new(
  2. locale => 'da',
  3. upper_before_lower => 0, # causes error as reserved by 'da'
  4. )

However change() inherited from Unicode::Collate allows such a tailoring that is reserved by locale . Examples:

  1. new(locale => 'ca')->change(backwards => undef)
  2. new(locale => 'da')->change(upper_before_lower => 0)
  3. new(locale => 'ja')->change(overrideCJK => undef)

Methods

Unicode::Collate::Locale is a subclass of Unicode::Collate and methods other than new are inherited from Unicode::Collate .

Here is a list of additional methods:

  • $Collator->getlocale

    Returns a language code accepted and used actually on collation. If linguistic tailoring is not provided for a language code you passed (intensionally for some languages, or due to the incomplete implementation), this method returns a string 'default' meaning no special tailoring.

  • $Collator->locale_version

    (Since Unicode::Collate::Locale 0.87) Returns the version number (perhaps /\d\.\d\d/ ) of the locale, as that of Locale/*.pl.

    Note: Locale/*.pl that a collator uses should be identified by a combination of return values from getlocale and locale_version .

A list of tailorable locales

  1. locale name description
  2. --------------------------------------------------------------
  3. af Afrikaans
  4. ar Arabic
  5. as Assamese
  6. az Azerbaijani (Azeri)
  7. be Belarusian
  8. bg Bulgarian
  9. bn Bengali
  10. bs Bosnian
  11. ca Catalan
  12. cs Czech
  13. cy Welsh
  14. da Danish
  15. de__phonebook German (umlaut as 'ae', 'oe', 'ue')
  16. eo Esperanto
  17. es Spanish
  18. es__traditional Spanish ('ch' and 'll' as a grapheme)
  19. et Estonian
  20. fa Persian
  21. fi Finnish (v and w are primary equal)
  22. fi__phonebook Finnish (v and w as separate characters)
  23. fil Filipino
  24. fo Faroese
  25. fr French
  26. gu Gujarati
  27. ha Hausa
  28. haw Hawaiian
  29. hi Hindi
  30. hr Croatian
  31. hu Hungarian
  32. hy Armenian
  33. ig Igbo
  34. is Icelandic
  35. ja Japanese [1]
  36. kk Kazakh
  37. kl Kalaallisut
  38. kn Kannada
  39. ko Korean [2]
  40. kok Konkani
  41. ln Lingala
  42. lt Lithuanian
  43. lv Latvian
  44. mk Macedonian
  45. ml Malayalam
  46. mr Marathi
  47. mt Maltese
  48. nb Norwegian Bokmal
  49. nn Norwegian Nynorsk
  50. nso Northern Sotho
  51. om Oromo
  52. or Oriya
  53. pa Punjabi
  54. pl Polish
  55. ro Romanian
  56. ru Russian
  57. sa Sanskrit
  58. se Northern Sami
  59. si Sinhala
  60. si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
  61. sk Slovak
  62. sl Slovenian
  63. sq Albanian
  64. sr Serbian
  65. sr_Latn Serbian in Latin (tailored as Croatian)
  66. sv Swedish (v and w are primary equal)
  67. sv__reformed Swedish (v and w as separate characters)
  68. ta Tamil
  69. te Telugu
  70. th Thai
  71. tn Tswana
  72. to Tonga
  73. tr Turkish
  74. uk Ukrainian
  75. ur Urdu
  76. vi Vietnamese
  77. wae Walser
  78. wo Wolof
  79. yo Yoruba
  80. zh Chinese
  81. zh__big5han Chinese (ideographs: big5 order)
  82. zh__gb2312han Chinese (ideographs: GB-2312 order)
  83. zh__pinyin Chinese (ideographs: pinyin order) [3]
  84. zh__stroke Chinese (ideographs: stroke order) [3]
  85. --------------------------------------------------------------

Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian), it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu (Zulu).

Note

[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their normal form. The difference between hiragana and katakana is at the 4th level, the comparison also requires (variable => 'Non-ignorable') , and then katakana_before_hiragana has no effect.

[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary (level 2) greater than, the corresponding hangul syllable.

[3] zh__pinyin and zh__stroke: implemented alt='short', where a smaller number of ideographs are tailored.

INSTALL

Installation of Unicode::Collate::Locale requires Collate/Locale.pm, Collate/Locale/*.pm, Collate/CJK/*.pm and Collate/allkeys.txt. On building, Unicode::Collate::Locale doesn't require any of data/*.txt, gendata/*, and mklocale. Tests for Unicode::Collate::Locale are named t/loc_*.t.

CAVEAT

  • tailoring is not maximum

    Even if a certain letter is tailored, its equivalent would not always tailored as well as it. For example, even though W is tailored, fullwidth W (U+FF37 ), W with acute (U+1E82 ), etc. are not tailored. The result may depend on whether source strings are normalized or not, and whether decomposed or composed. Thus (normalization => undef) is less preferred.

AUTHOR

The Unicode::Collate::Locale module for perl was written by SADAHIRO Tomoyuki, <SADAHIRO@cpan.org>. This module is Copyright(C) 2004-2012, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO