Return to Revisions

1 of 4

asked Nov 5, 2013 at 6:33

Sorting Vietnamese utf8 index with make-rules in xindy package

I am using the xindy package to compile the index containing Vietnamese. It works fine except the order within words. The "standard" order is accepted as the following, which is different from the current xindy default order:

a à ả ã á ạ ă ằ ẳ ẵ ắ ặ â ầ ẩ ẫ ấ ậ b c d đ e è ẻ ẽ é ẹ ê ề ể ễ ế ệ f g h i ì ỉ ĩ í ị j k l m n o ò ỏ õ ó ọ ô ồ ổ ỗ ố ộ ơ ờ ở ỡ ớ ợ p q r s t u ù ủ ũ ú ụ ư ừ ử ữ ứ ự v w x y ỳ ỷ ỹ ý ỵ z

I edited the vietnamese/utf8.pl.in (attached below) and tested with this version. However, there are some words are ordered in unexpected way. The expected order, for example, should be,

Hiên
Hiền
Hiển
Hiễn
Hiến
Hiện

But it is not sorted in that order, i.e., the two last numbers, 5 & 6 are now placed at the positions of 3 & 4. The same happened with words: Lan Làn Lản Lãn Lán Lạn (note the tonal diacritics).

Anyone?

Here is the revised make-rules perl script for Vietnamese:

#!/usr/bin/perl $language = "Vietnamese"; $prefix = "vi"; $script = "latin"; $alphabet = [ ['A', ['a','A'],['à','À'],['ả','Ả'],['ã','Ã'],['á','Á'],['ạ','Ạ']], ['Ă', ['ă','Ă'],['ằ','Ằ'],['ẳ','Ẳ'],['ẵ','Ẵ'],['ắ','Ắ'],['ặ','Ặ']], ['Â', ['â','Â'],['ầ','Ầ'],['ẩ','Ẩ'],['ẫ','Ẫ'],['ấ','Ấ'],['ậ','Ậ']], [], # a with ogonek (polish) ['B', ['b','B']], [], # b with hook (hausa) ['C', ['c','C']], [], # ch (spanish/traditonal) [], # cs (hungarian) [], # c with caron (many) [], # c with acute (croatian, lower sorbian, polish) [], # c with circumflex (esperanto) [], # c with cedilla (albanian, kurdish, turkish) ['D', ['d','D']], [], # dh (albanian) [], # dz (hungarian) [], # dzs (hungarian) [], # d+z with caron (croatian) [], # d+z with acute (upper sorbian) [], # d with caron (slovak/large) ['Đ', ['đ','Đ']], [], # d with hook (hausa) [], # eth (icelandic) ['E', ['e','E'],['è','È'],['ẻ','Ẻ'],['ẽ','Ẽ'],['é','É'],['ẹ','Ẹ']], [], # e with caron (lower/upper sorbian) ['Ê', ['ê','Ê'],['ề','Ề'],['ể','Ể'],['ễ','Ễ'],['ế','Ế'],['ệ','Ệ']], [], # e with diaeresis (albanian) [], # e with ogonek (polish) ['F', ['f','F']], ['G', ['g','G']], [], # gj (albanian) [], # gy (hungarian) [], # g with circumflex (esperanto) [], # g with breve (turkish) [], # g with cedilla/comma (latvian) [], # postpalatal fricative (gypsy/northrussian) ['H', ['h','H']], [], # h with circumflex (esperanto) [], # ch (many) [], # dotless i (turkish) ['I', ['i','I'],['ì','Ì'],['ỉ','Ỉ'],['ĩ','Ĩ'],['í','Í'],['ị','Ị']], [], # i with inverted breve below (gypsy/northrussian) [], # i with circumflex (kurdish, romanian) [], # i with diaeresis (gypsy/northrussian) ['J', ['j','J']], [], # j with circumflex (esperanto) ['K', ['k','K']], [], # kh (gypsy/northrussian) [], # k with cedilla/comma (latvian) [], # k with hook (hausa) [], # x (gypsy/northrussian) [], # l with stroke (lower/upper sorbian) ['L', ['l','L']], [], # lj (croatian) [], # ll (albanian, spanish/traditonal) [], # ly (hungarian) [], # l with cedilla/comma (latvian) [], # l with stroke (polish) [], # l with caron (slovak/large) ['M', ['m','M']], ['N', ['n','N']], [], # nj (albanian, croatian) [], # ny (hungarian) [], # n with caron (slovak/large) [], # n with acute (lower/upper sorbian, polish) [], # n with tilde (spanish/modern, spanish/traditional) [], # n with cedilla/comma (latvian) ['O', ['o','O'],['ò','Ò'],['ỏ','Ỏ'],['õ','Õ'],['ó','Ó'],['ọ','Ọ']], [], # o with acute (polish, upper sorbian) ['Ô', ['ô','Ô'],['ồ','Ồ'],['ổ','Ổ'],['ỗ','Ỗ'],['ố','Ố'],['ộ','Ộ']], ['Ơ', ['ơ','Ơ'],['ờ','Ờ'],['ở','Ở'],['ỡ','Ỡ'],['ớ','Ớ'],['ợ','Ợ']], [], # o with diaeresis (hungarian, turkish) ['P', ['p','P']], [], # ph (gypsy/northrussian) ['Q', ['q','Q']], ['R', ['r','R']], [], # rr (albanian) [], # r with caron (czech, slovak/large, upper sorbian) [], # r with acute (lower sorbian) [], # r with cedilla/comma (latvian) ['S', ['s','S']], [], # sh (albanian) [], # sz (hungarian) [], # s with caron (many) [], # s with acute (lower sorbian, polish) [], # s with circumflex (esperanto) [], # s with comma below (romanian) [], # s with cedilla (kurdish, turkish) [], # z (estonian) [], # z with caron (estonian) ['T', ['t','T']], [], # th (albanian) [], # ty (hungarian) [], # t with caron (slovak/large) [], # t with comma below (romanian) [], # c with acute (upper sorbian) @@ ['U', ['u','U'],['ù','Ù'],['ủ','Ủ'],['ũ','Ũ'],['ú','Ú'],['ụ','Ụ']], [], # u with breve (esperanto) [], # u with circumflex (kurdish) ['Ư', ['ư','Ư'],['ừ','Ừ'],['ử','Ử'],['ữ','Ữ'],['ứ','Ứ'],['ự','Ự']], [], # u with diaeresis (hungarian, turkish) ['V', ['v','V']], ['W', ['w','W']], [], # o with tilde (estonian) [], # a with diaeresis (estonian) [], # o with diaeresis (estonian) [], # u with diaeresis (estonian) ['X', ['x','X']], [], # xh (albanian) ['Y', ['y','Y'],['ỳ','Ỳ'],['ỷ','Ỷ'],['ỹ','Ỹ'],['ý','Ý'],['ỵ','Ỵ']], [], # y preceded by apostrophe (hausa) [], # yogh (english) ['Z', ['z','Z']], [], # zh (albanian) [], # zs (hungarian) [], # z with caron (many) [], # z with acute (lower sorbian, polish) [], # z with dot above (polish) [], # thorn (icelandic) [], # wynn (english) [], # ligature ae (danish, icelandic, norwegian) [], # o with stroke (danish, norwegian) [], # a with ring above (danish, norwegian, swedish) [], # a with diaeresis (finnish, swedish) [], # o with diaeresis (finnish, swedish) [], # a with ring above (icelandic) ]; $sortcase = 'Aa'; #$sortcase = 'aA'; $ligatures = [ ]; @special = ('?', '!', '.', 'letters', '-', '\''); do 'make-rules.pl';

asked Nov 5, 2013 at 6:33

user39400