This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
make the "back to top" links optional
[perl5.git] / lib / unicode / Unicode3.html
CommitLineData
505afebf
JH
1<html>
2
3
4
5<head>
6
7<meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 4.0">
8
9<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
10
11<link REL="stylesheet" HREF="http://www.unicode.org/unicode.css" TYPE="text/css">
12
13<title>UnicodeData File Format</title>
14
15</head>
16
17
18
19<body>
20
21
22
23<h1>UnicodeData File Format<br>
24Version 3.0.0</h1>
25
26
27
28<table BORDER="1" CELLSPACING="2" CELLPADDING="0" HEIGHT="87" WIDTH="100%">
29
30 <tr>
31
32 <td VALIGN="TOP" width="144">Revision</td>
33
34 <td VALIGN="TOP">3.0.0</td>
35
36 </tr>
37
38 <tr>
39
40 <td VALIGN="TOP" width="144">Authors</td>
41
42 <td VALIGN="TOP">Mark Davis and Ken Whistler</td>
43
44 </tr>
45
46 <tr>
47
48 <td VALIGN="TOP" width="144">Date</td>
49
50 <td VALIGN="TOP">1999-09-12</td>
51
52 </tr>
53
54 <tr>
55
56 <td VALIGN="TOP" width="144">This Version</td>
57
58 <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
59
60 </tr>
61
62 <tr>
63
64 <td VALIGN="TOP" width="144">Previous Version</td>
65
66 <td VALIGN="TOP">n/a</td>
67
68 </tr>
69
70 <tr>
71
72 <td VALIGN="TOP" width="144">Latest Version</td>
73
74 <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
75
76 </tr>
77
78</table>
79
80
81
82<p align="center">Copyright © 1995-1999 Unicode, Inc. All Rights reserved.<br>
83
84<i>For more information, including Disclamer and Limitations, see <a HREF="UnicodeCharacterDatabase-3.0.0.html">UnicodeCharacterDatabase-3.0.0.html</a> </i></p>
85
86
87
88<p>This document describes the format of the UnicodeData.txt file, which is one of the
89
90files in the Unicode Character Database. The document is divided into the following
91
92sections:
93
94
95
96<ul>
97
98 <li><a HREF="#Field Formats">Field Formats</a> <ul>
99
100 <li><a HREF="#General Category">General Category</a> </li>
101
102 <li><a HREF="#Bidirectional Category">Bidirectional Category</a> </li>
103
104 <li><a HREF="#Character Decomposition">Character Decomposition Mapping</a> </li>
105
106 <li><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </li>
107
108 <li><a HREF="#Decompositions and Normalization">Decompositions and Normalization</a> </li>
109
110 <li><a HREF="#Case Mappings">Case Mappings</a> </li>
111
112 </ul>
113
114 </li>
115
116 <li><a HREF="#Property Invariants">Property Invariants</a> </li>
117
118 <li><a HREF="#Modification History">Modification History</a> </li>
119
120</ul>
121
122
123
124<p><b>Warning: </b>the information in this file does not completely describe the use and
125
126interpretation of Unicode character properties and behavior. It must be used in
127
128conjunction with the data in the other files in the Unicode Character Database, and relies
129
130on the notation and definitions supplied in <i><a href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html"> The Unicode
131Standard</a></i>. All chapter references
132
133are to Version 3.0 of the standard.</p>
134
135
136
137<h2><a NAME="Field Formats"></a>Field Formats</h2>
138
139
140
141<p>The file consists of lines containing fields terminated by semicolons. Each line
142
143represents the data for one encoded character in the Unicode Standard. Every encoded
144
145character has a data entry, with the exception of certain special ranges, as detailed
146
147below.
148
149
150
151<ul>
152
153 <li>There are six special ranges of characters that are represented only by their start and
154
155 end characters, since the properties in the file are uniform, except for code values
156
157 (which are all sequential and assigned). </li>
158
159 <li>The names of CJK ideograph characters and the names and decompositions of Hangul
160
161 syllable characters are algorithmically derivable. (See the Unicode Standard and <a
162
163 HREF="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical Report #15</a> for
164
165 more information). </li>
166
167 <li>Surrogate code values and private use characters have no names. </li>
168
169 <li>The Private Use character outside of the BMP (U+F0000..U+FFFFD, U+100000..U+10FFFD) are
170
171 not listed. These correspond to surrogate pairs where the first surrogate is in the High
172
173 Surrogate Private Use section. </li>
174
175</ul>
176
177
178
179<p>The exact ranges represented by start and end characters are:
180
181
182
183<ul>
184
185 <li>CJK Ideographs Extension A (U+3400 - U+4DB5) </li>
186
187 <li>CJK Ideographs (U+4E00 - U+9FA5) </li>
188
189 <li>Hangul Syllables (U+AC00 - U+D7A3) </li>
190
191 <li>Non-Private Use High Surrogates (U+D800 - U+DB7F) </li>
192
193 <li>Private Use High Surrogates (U+DB80 - U+DBFF) </li>
194
195 <li>Low Surrogates (U+DC00 - U+DFFF) </li>
196
197 <li>The Private Use Area (U+E000 - U+F8FF) </li>
198
199</ul>
200
201
202
203<p>The following table describes the format and meaning of each field in a data entry in
204
205the UnicodeData file. Fields which contain normative information are so indicated.</p>
206
207
208
209<table BORDER="1" CELLSPACING="2" CELLPADDING="2">
210
211 <tr>
212
213 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Field</th>
214
215 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Name</th>
216
217 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Status</th>
218
219 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Explanation</th>
220
221 </tr>
222
223 <tr>
224
225 <th VALIGN="top">0</th>
226
227 <td VALIGN="top">Code value</td>
228
229 <td VALIGN="top">normative</td>
230
231 <td VALIGN="top">Code value in 4-digit hexadecimal format.</td>
232
233 </tr>
234
235 <tr>
236
237 <th VALIGN="top">1</th>
238
239 <td VALIGN="top">Character name</td>
240
241 <td VALIGN="top">normative</td>
242
243 <td VALIGN="top">These names match exactly the names published in Chapter 14 of the
244
245 Unicode Standard, Version 3.0.</td>
246
247 </tr>
248
249 <tr>
250
251 <th VALIGN="top">2</th>
252
253 <td VALIGN="top"><a HREF="#General Category">General Category</a> </td>
254
255 <td VALIGN="top">normative / informative<br>
256
257 (see below)</td>
258
259 <td VALIGN="top">This is a useful breakdown into various &quot;character types&quot; which
260
261 can be used as a default categorization in implementations. See below for a brief
262
263 explanation.</td>
264
265 </tr>
266
267 <tr>
268
269 <th VALIGN="top">3</th>
270
271 <td VALIGN="top"><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </td>
272
273 <td VALIGN="top">normative</td>
274
275 <td VALIGN="top">The classes used for the Canonical Ordering Algorithm in the Unicode
276
277 Standard. These classes are also printed in Chapter 4 of the Unicode Standard.</td>
278
279 </tr>
280
281 <tr>
282
283 <th VALIGN="top">4</th>
284
285 <td VALIGN="top"><a HREF="#Bidirectional Category">Bidirectional Category</a> </td>
286
287 <td VALIGN="top">normative</td>
288
289 <td VALIGN="top">See the list below for an explanation of the abbreviations used in this
290
291 field. These are the categories required by the Bidirectional Behavior Algorithm in the
292
293 Unicode Standard. These categories are summarized in Chapter 3 of the Unicode Standard.</td>
294
295 </tr>
296
297 <tr>
298
299 <th VALIGN="top">5</th>
300
301 <td VALIGN="top"><a HREF="#Character Decomposition">Character Decomposition
302 Mapping</a></td>
303
304 <td VALIGN="top">normative</td>
305
306 <td VALIGN="top">In the Unicode Standard, not all of the mappings are full (maximal)
307
308 decompositions. Recursive application of look-up for decompositions will, in all cases,
309
310 lead to a maximal decomposition. The decomposition mappings match exactly the
311
312 decomposition mappings published with the character names in the Unicode Standard.</td>
313
314 </tr>
315
316 <tr>
317
318 <th VALIGN="top">6</th>
319
320 <td VALIGN="top">Decimal digit value</td>
321
322 <td VALIGN="top">normative</td>
323
324 <td VALIGN="top">This is a numeric field. If the character has the decimal digit property,
325
326 as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented
327
328 with an integer value in this field</td>
329
330 </tr>
331
332 <tr>
333
334 <th VALIGN="top">7</th>
335
336 <td VALIGN="top">Digit value</td>
337
338 <td VALIGN="top">normative</td>
339
340 <td VALIGN="top">This is a numeric field. If the character represents a digit, not
341
342 necessarily a decimal digit, the value is here. This covers digits which do not form
343
344 decimal radix forms, such as the compatibility superscript digits</td>
345
346 </tr>
347
348 <tr>
349
350 <th VALIGN="top">8</th>
351
352 <td VALIGN="top">Numeric value</td>
353
354 <td VALIGN="top">normative</td>
355
356 <td VALIGN="top">This is a numeric field. If the character has the numeric property, as
357
358 specified in Chapter 4 of the Unicode Standard, the value of that character is represented
359
360 with an integer or rational number in this field. This includes fractions as, e.g.,
361
362 &quot;1/5&quot; for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values
363
364 for compatibility characters such as circled numbers.</td>
365
366 </tr>
367
368 <tr>
369
370 <th VALIGN="top">8</th>
371
372 <td VALIGN="top">Mirrored</td>
373
374 <td VALIGN="top">normative</td>
375
376 <td VALIGN="top">If the character has been identified as a &quot;mirrored&quot; character
377
378 in bidirectional text, this field has the value &quot;Y&quot;; otherwise &quot;N&quot;.
379
380 The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard.</td>
381
382 </tr>
383
384 <tr>
385
386 <th VALIGN="top">10</th>
387
388 <td VALIGN="top">Unicode 1.0 Name</td>
389
390 <td VALIGN="top">informative</td>
391
392 <td VALIGN="top">This is the old name as published in Unicode 1.0. This name is only
393
394 provided when it is significantly different from the Unicode 3.0 name for the character.</td>
395
396 </tr>
397
398 <tr>
399
400 <th VALIGN="top">11</th>
401
402 <td VALIGN="top">10646 comment field</td>
403
404 <td VALIGN="top">informative</td>
405
406 <td VALIGN="top">This is the ISO 10646 comment field. It is in parantheses in the 10646
407
408 names list.</td>
409
410 </tr>
411
412 <tr>
413
414 <th VALIGN="top">12</th>
415
416 <td VALIGN="top"><a HREF="#Case Mappings">Uppercase Mapping</a></td>
417
418 <td VALIGN="top">informative</td>
419
420 <td VALIGN="top">Upper case equivalent mapping. If a character is part of an alphabet with
421
422 case distinctions, and has an upper case equivalent, then the upper case equivalent is in
423
424 this field. See the explanation below on case distinctions. These mappings are always
425
426 one-to-one, not one-to-many or many-to-one. This field is informative.</td>
427
428 </tr>
429
430 <tr>
431
432 <th VALIGN="top">13</th>
433
434 <td VALIGN="top"><a HREF="#Case Mappings">Lowercase Mapping</a></td>
435
436 <td VALIGN="top">informative</td>
437
438 <td VALIGN="top">Similar to Uppercase mapping</td>
439
440 </tr>
441
442 <tr>
443
444 <th VALIGN="top">14</th>
445
446 <td VALIGN="top"><a HREF="#Case Mappings">Titlecase Mapping</a></td>
447
448 <td VALIGN="top">informative</td>
449
450 <td VALIGN="top">Similar to Uppercase mapping</td>
451
452 </tr>
453
454</table>
455
456
457
458<h3><a NAME="General Category"></a>General Category</h3>
459
460
461
462<p>The values in this field are abbreviations for the following. Some of the values are
463
464normative, and some are informative. For more information, see the Unicode Standard.</p>
465
466
467
468<p><b>Note:</b> the standard does not assign information to control characters (except for
469
470certain cases in the Bidirectional Algorithm). Implementations will generally also assign
471
472categories to certain control characters, notably CR and LF, according to platform
473
474conventions.</p>
475
476
477
478<h4>Normative Categories</h4>
479
480
481
482<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
483
484 <tr>
485
486 <th><p ALIGN="LEFT">Abbr.</th>
487
488 <th><p ALIGN="LEFT">Description</th>
489
490 </tr>
491
492 <tr>
493
494 <td ALIGN="CENTER">Lu</td>
495
496 <td>Letter, Uppercase</td>
497
498 </tr>
499
500 <tr>
501
502 <td ALIGN="CENTER">Ll</td>
503
504 <td>Letter, Lowercase</td>
505
506 </tr>
507
508 <tr>
509
510 <td ALIGN="CENTER">Lt</td>
511
512 <td>Letter, Titlecase</td>
513
514 </tr>
515
516 <tr>
517
518 <td ALIGN="CENTER">Mn</td>
519
520 <td>Mark, Non-Spacing</td>
521
522 </tr>
523
524 <tr>
525
526 <td ALIGN="CENTER">Mc</td>
527
528 <td>Mark, Spacing Combining</td>
529
530 </tr>
531
532 <tr>
533
534 <td ALIGN="CENTER">Me</td>
535
536 <td>Mark, Enclosing</td>
537
538 </tr>
539
540 <tr>
541
542 <td ALIGN="CENTER">Nd</td>
543
544 <td>Number, Decimal Digit</td>
545
546 </tr>
547
548 <tr>
549
550 <td ALIGN="CENTER">Nl</td>
551
552 <td>Number, Letter</td>
553
554 </tr>
555
556 <tr>
557
558 <td ALIGN="CENTER">No</td>
559
560 <td>Number, Other</td>
561
562 </tr>
563
564 <tr>
565
566 <td ALIGN="CENTER">Zs</td>
567
568 <td>Separator, Space</td>
569
570 </tr>
571
572 <tr>
573
574 <td ALIGN="CENTER">Zl</td>
575
576 <td>Separator, Line</td>
577
578 </tr>
579
580 <tr>
581
582 <td ALIGN="CENTER">Zp</td>
583
584 <td>Separator, Paragraph</td>
585
586 </tr>
587
588 <tr>
589
590 <td ALIGN="CENTER">Cc</td>
591
592 <td>Other, Control</td>
593
594 </tr>
595
596 <tr>
597
598 <td ALIGN="CENTER">Cf</td>
599
600 <td>Other, Format</td>
601
602 </tr>
603
604 <tr>
605
606 <td ALIGN="CENTER">Cs</td>
607
608 <td>Other, Surrogate</td>
609
610 </tr>
611
612 <tr>
613
614 <td ALIGN="CENTER">Co</td>
615
616 <td>Other, Private Use</td>
617
618 </tr>
619
620 <tr>
621
622 <td ALIGN="CENTER">Cn</td>
623
624 <td>Other, Not Assigned (no characters in the file have this property)</td>
625
626 </tr>
627
628</table>
629
630
631
632<h4>Informative Categories</h4>
633
634
635
636<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
637
638 <tr>
639
640 <th><p ALIGN="LEFT">Abbr.</th>
641
642 <th><p ALIGN="LEFT">Description</th>
643
644 </tr>
645
646 <tr>
647
648 <td ALIGN="CENTER">Lm</td>
649
650 <td>Letter, Modifier</td>
651
652 </tr>
653
654 <tr>
655
656 <td ALIGN="CENTER">Lo</td>
657
658 <td>Letter, Other</td>
659
660 </tr>
661
662 <tr>
663
664 <td ALIGN="CENTER">Pc</td>
665
666 <td>Punctuation, Connector</td>
667
668 </tr>
669
670 <tr>
671
672 <td ALIGN="CENTER">Pd</td>
673
674 <td>Punctuation, Dash</td>
675
676 </tr>
677
678 <tr>
679
680 <td ALIGN="CENTER">Ps</td>
681
682 <td>Punctuation, Open</td>
683
684 </tr>
685
686 <tr>
687
688 <td ALIGN="CENTER">Pe</td>
689
690 <td>Punctuation, Close</td>
691
692 </tr>
693
694 <tr>
695
696 <td ALIGN="CENTER">Pi</td>
697
698 <td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td>
699
700 </tr>
701
702 <tr>
703
704 <td ALIGN="CENTER">Pf</td>
705
706 <td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td>
707
708 </tr>
709
710 <tr>
711
712 <td ALIGN="CENTER">Po</td>
713
714 <td>Punctuation, Other</td>
715
716 </tr>
717
718 <tr>
719
720 <td ALIGN="CENTER">Sm</td>
721
722 <td>Symbol, Math</td>
723
724 </tr>
725
726 <tr>
727
728 <td ALIGN="CENTER">Sc</td>
729
730 <td>Symbol, Currency</td>
731
732 </tr>
733
734 <tr>
735
736 <td ALIGN="CENTER">Sk</td>
737
738 <td>Symbol, Modifier</td>
739
740 </tr>
741
742 <tr>
743
744 <td ALIGN="CENTER">So</td>
745
746 <td>Symbol, Other</td>
747
748 </tr>
749
750</table>
751
752
753
754<h3><a NAME="Bidirectional Category"></a>Bidirectional Category</h3>
755
756
757
758<p>Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional
759
760Behavior and an explanation of the significance of these categories. An up-to-date version
761
762can be found on <a HREF="http://www.unicode.org/unicode/reports/tr9/">Unicode Technical
763
764Report #9: The Bidirectional Algorithm</a>. These values are normative.</p>
765
766
767
768<table BORDER="0" CELLPADDING="2">
769
770 <tr>
771
772 <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Type</th>
773
774 <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Description</th>
775
776 </tr>
777
778 <tr>
779
780 <td VALIGN="TOP"><b>L</b></td>
781
782 <td VALIGN="TOP">Left-to-Right</td>
783
784 </tr>
785
786 <tr>
787
788 <td VALIGN="TOP"><b>LRE</b></td>
789
790 <td VALIGN="TOP">Left-to-Right Embedding</td>
791
792 </tr>
793
794 <tr>
795
796 <td VALIGN="TOP"><b>LRO</b></td>
797
798 <td VALIGN="TOP">Left-to-Right Override</td>
799
800 </tr>
801
802 <tr>
803
804 <td VALIGN="TOP"><b>R</b></td>
805
806 <td VALIGN="TOP">Right-to-Left</td>
807
808 </tr>
809
810 <tr>
811
812 <td VALIGN="TOP"><b>AL</b></td>
813
814 <td VALIGN="TOP">Right-to-Left Arabic</td>
815
816 </tr>
817
818 <tr>
819
820 <td VALIGN="TOP"><b>RLE</b></td>
821
822 <td VALIGN="TOP">Right-to-Left Embedding</td>
823
824 </tr>
825
826 <tr>
827
828 <td VALIGN="TOP"><b>RLO</b></td>
829
830 <td VALIGN="TOP">Right-to-Left Override</td>
831
832 </tr>
833
834 <tr>
835
836 <td VALIGN="TOP"><b>PDF</b></td>
837
838 <td VALIGN="TOP">Pop Directional Format</td>
839
840 </tr>
841
842 <tr>
843
844 <td VALIGN="TOP"><b>EN</b></td>
845
846 <td VALIGN="TOP">European Number</td>
847
848 </tr>
849
850 <tr>
851
852 <td VALIGN="TOP"><b>ES</b></td>
853
854 <td VALIGN="TOP">European Number Separator</td>
855
856 </tr>
857
858 <tr>
859
860 <td VALIGN="TOP"><b>ET</b></td>
861
862 <td VALIGN="TOP">European Number Terminator</td>
863
864 </tr>
865
866 <tr>
867
868 <td VALIGN="TOP"><b>AN</b></td>
869
870 <td VALIGN="TOP">Arabic Number</td>
871
872 </tr>
873
874 <tr>
875
876 <td VALIGN="TOP"><b>CS</b></td>
877
878 <td VALIGN="TOP">Common Number Separator</td>
879
880 </tr>
881
882 <tr>
883
884 <td VALIGN="TOP"><b>NSM</b></td>
885
886 <td VALIGN="TOP">Non-Spacing Mark</td>
887
888 </tr>
889
890 <tr>
891
892 <td VALIGN="TOP"><b>BN</b></td>
893
894 <td VALIGN="TOP">Boundary Neutral</td>
895
896 </tr>
897
898 <tr>
899
900 <td VALIGN="TOP"><b>B</b></td>
901
902 <td VALIGN="TOP">Paragraph Separator</td>
903
904 </tr>
905
906 <tr>
907
908 <td VALIGN="TOP"><b>S</b></td>
909
910 <td VALIGN="TOP">Segment Separator</td>
911
912 </tr>
913
914 <tr>
915
916 <td VALIGN="TOP"><b>WS</b></td>
917
918 <td VALIGN="TOP">Whitespace</td>
919
920 </tr>
921
922 <tr>
923
924 <td VALIGN="TOP"><b>ON</b></td>
925
926 <td VALIGN="TOP">Other Neutrals</td>
927
928 </tr>
929
930</table>
931
932
933
934<h3><a NAME="Character Decomposition"></a>Character Decomposition Mapping</h3>
935
936
937
938<p>The decomposition is a normative property of a character. The tags supplied with
939
940certain decomposition mappings generally indicate formatting information. Where no such
941
942tag is given, the mapping is designated as canonical. Conversely, the presence of a
943
944formatting tag also indicates that the mapping is a compatibility mapping and not a
945
946canonical mapping. In the absence of other formatting information in a compatibility
947
948mapping, the tag is used to distinguish it from canonical mappings.</p>
949
950
951
952<p>In some instances a canonical mapping or a compatibility mapping may consist of a
953
954single character. For a canonical mapping, this indicates that the character is a
955
956canonical equivalent of another single character. For a compatibility mapping, this
957
958indicates that the character is a compatibility equivalent of another single character.
959
960The compatibility formatting tags used are:</p>
961
962
963
964<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
965
966 <tr>
967
968 <th>Tag</th>
969
970 <th><p ALIGN="LEFT">Description</th>
971
972 </tr>
973
974 <tr>
975
976 <td ALIGN="CENTER">&lt;font&gt;&nbsp;&nbsp;</td>
977
978 <td>A font variant (e.g. a blackletter form).</td>
979
980 </tr>
981
982 <tr>
983
984 <td ALIGN="CENTER">&lt;noBreak&gt;&nbsp;&nbsp;</td>
985
986 <td>A no-break version of a space or hyphen.</td>
987
988 </tr>
989
990 <tr>
991
992 <td ALIGN="CENTER">&lt;initial&gt;&nbsp;&nbsp;</td>
993
994 <td>An initial presentation form (Arabic).</td>
995
996 </tr>
997
998 <tr>
999
1000 <td ALIGN="CENTER">&lt;medial&gt;&nbsp;&nbsp;</td>
1001
1002 <td>A medial presentation form (Arabic).</td>
1003
1004 </tr>
1005
1006 <tr>
1007
1008 <td ALIGN="CENTER">&lt;final&gt;&nbsp;&nbsp;</td>
1009
1010 <td>A final presentation form (Arabic).</td>
1011
1012 </tr>
1013
1014 <tr>
1015
1016 <td ALIGN="CENTER">&lt;isolated&gt;&nbsp;&nbsp;</td>
1017
1018 <td>An isolated presentation form (Arabic).</td>
1019
1020 </tr>
1021
1022 <tr>
1023
1024 <td ALIGN="CENTER">&lt;circle&gt;&nbsp;&nbsp;</td>
1025
1026 <td>An encircled form.</td>
1027
1028 </tr>
1029
1030 <tr>
1031
1032 <td ALIGN="CENTER">&lt;super&gt;&nbsp;&nbsp;</td>
1033
1034 <td>A superscript form.</td>
1035
1036 </tr>
1037
1038 <tr>
1039
1040 <td ALIGN="CENTER">&lt;sub&gt;&nbsp;&nbsp;</td>
1041
1042 <td>A subscript form.</td>
1043
1044 </tr>
1045
1046 <tr>
1047
1048 <td ALIGN="CENTER">&lt;vertical&gt;&nbsp;&nbsp;</td>
1049
1050 <td>A vertical layout presentation form.</td>
1051
1052 </tr>
1053
1054 <tr>
1055
1056 <td ALIGN="CENTER">&lt;wide&gt;&nbsp;&nbsp;</td>
1057
1058 <td>A wide (or zenkaku) compatibility character.</td>
1059
1060 </tr>
1061
1062 <tr>
1063
1064 <td ALIGN="CENTER">&lt;narrow&gt;&nbsp;&nbsp;</td>
1065
1066 <td>A narrow (or hankaku) compatibility character.</td>
1067
1068 </tr>
1069
1070 <tr>
1071
1072 <td ALIGN="CENTER">&lt;small&gt;&nbsp;&nbsp;</td>
1073
1074 <td>A small variant form (CNS compatibility).</td>
1075
1076 </tr>
1077
1078 <tr>
1079
1080 <td ALIGN="CENTER">&lt;square&gt;&nbsp;&nbsp;</td>
1081
1082 <td>A CJK squared font variant.</td>
1083
1084 </tr>
1085
1086 <tr>
1087
1088 <td ALIGN="CENTER">&lt;fraction&gt;&nbsp;&nbsp;</td>
1089
1090 <td>A vulgar fraction form.</td>
1091
1092 </tr>
1093
1094 <tr>
1095
1096 <td ALIGN="CENTER">&lt;compat&gt;&nbsp;&nbsp;</td>
1097
1098 <td>Otherwise unspecified compatibility character.</td>
1099
1100 </tr>
1101
1102</table>
1103
1104
1105
1106<p><b>Reminder: </b>There is a difference between decomposition and decomposition mapping.
1107
1108The decomposition mappings are defined in the UnicodeData, while the decomposition (also
1109
1110termed &quot;full decomposition&quot;) is defined in Chapter 3 to use those mappings
1111<i>
1112
1113recursively.</i>
1114
1115
1116
1117<ul>
1118
1119 <li>The canonical decomposition is formed by recursively applying the canonical mappings,
1120
1121 then applying the canonical reordering algorithm. </li>
1122
1123 <li>The compatibility decomposition is formed by recursively applying the canonical <em>and</em>
1124
1125 compatibility mappings, then applying the canonical reordering algorithm. </li>
1126
1127</ul>
1128
1129
1130
1131<h3><a NAME="Canonical Combining Classes"></a>Canonical Combining Classes</h3>
1132
1133
1134
1135<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
1136
1137 <tr>
1138
1139 <th><p ALIGN="LEFT">Value</th>
1140
1141 <th><p ALIGN="LEFT">Description</th>
1142
1143 </tr>
1144
1145 <tr>
1146
1147 <td ALIGN="RIGHT">0:</td>
1148
1149 <td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td>
1150
1151 </tr>
1152
1153 <tr>
1154
1155 <td ALIGN="RIGHT">1:</td>
1156
1157 <td>Overlays and interior</td>
1158
1159 </tr>
1160
1161 <tr>
1162
1163 <td ALIGN="RIGHT">7:</td>
1164
1165 <td>Nuktas</td>
1166
1167 </tr>
1168
1169 <tr>
1170
1171 <td ALIGN="RIGHT">8:</td>
1172
1173 <td>Hiragana/Katakana voicing marks</td>
1174
1175 </tr>
1176
1177 <tr>
1178
1179 <td ALIGN="RIGHT">9:</td>
1180
1181 <td>Viramas</td>
1182
1183 </tr>
1184
1185 <tr>
1186
1187 <td ALIGN="RIGHT">10:</td>
1188
1189 <td>Start of fixed position classes</td>
1190
1191 </tr>
1192
1193 <tr>
1194
1195 <td ALIGN="RIGHT">199:</td>
1196
1197 <td>End of fixed position classes</td>
1198
1199 </tr>
1200
1201 <tr>
1202
1203 <td ALIGN="RIGHT">200:</td>
1204
1205 <td>Below left attached</td>
1206
1207 </tr>
1208
1209 <tr>
1210
1211 <td ALIGN="RIGHT">202:</td>
1212
1213 <td>Below attached</td>
1214
1215 </tr>
1216
1217 <tr>
1218
1219 <td ALIGN="RIGHT">204:</td>
1220
1221 <td>Below right attached</td>
1222
1223 </tr>
1224
1225 <tr>
1226
1227 <td ALIGN="RIGHT">208:</td>
1228
1229 <td>Left attached (reordrant around single base character)</td>
1230
1231 </tr>
1232
1233 <tr>
1234
1235 <td ALIGN="RIGHT">210:</td>
1236
1237 <td>Right attached</td>
1238
1239 </tr>
1240
1241 <tr>
1242
1243 <td ALIGN="RIGHT">212:</td>
1244
1245 <td>Above left attached</td>
1246
1247 </tr>
1248
1249 <tr>
1250
1251 <td ALIGN="RIGHT">214:</td>
1252
1253 <td>Above attached</td>
1254
1255 </tr>
1256
1257 <tr>
1258
1259 <td ALIGN="RIGHT">216:</td>
1260
1261 <td>Above right attached</td>
1262
1263 </tr>
1264
1265 <tr>
1266
1267 <td ALIGN="RIGHT">218:</td>
1268
1269 <td>Below left</td>
1270
1271 </tr>
1272
1273 <tr>
1274
1275 <td ALIGN="RIGHT">220:</td>
1276
1277 <td>Below</td>
1278
1279 </tr>
1280
1281 <tr>
1282
1283 <td ALIGN="RIGHT">222:</td>
1284
1285 <td>Below right</td>
1286
1287 </tr>
1288
1289 <tr>
1290
1291 <td ALIGN="RIGHT">224:</td>
1292
1293 <td>Left (reordrant around single base character)</td>
1294
1295 </tr>
1296
1297 <tr>
1298
1299 <td ALIGN="RIGHT">226:</td>
1300
1301 <td>Right</td>
1302
1303 </tr>
1304
1305 <tr>
1306
1307 <td ALIGN="RIGHT">228:</td>
1308
1309 <td>Above left</td>
1310
1311 </tr>
1312
1313 <tr>
1314
1315 <td ALIGN="RIGHT">230:</td>
1316
1317 <td>Above</td>
1318
1319 </tr>
1320
1321 <tr>
1322
1323 <td ALIGN="RIGHT">232:</td>
1324
1325 <td>Above right</td>
1326
1327 </tr>
1328
1329 <tr>
1330
1331 <td ALIGN="RIGHT">233:</td>
1332
1333 <td>Double below</td>
1334
1335 </tr>
1336
1337 <tr>
1338
1339 <td ALIGN="RIGHT">234:</td>
1340
1341 <td>Double above</td>
1342
1343 </tr>
1344
1345 <tr>
1346
1347 <td ALIGN="RIGHT">240:</td>
1348
1349 <td>Below (iota subscript)</td>
1350
1351 </tr>
1352
1353</table>
1354
1355
1356
1357<p><strong>Note: </strong>some of the combining classes in this list do not currently have
1358
1359members but are specified here for completeness.</p>
1360
1361
1362
1363<h3><a NAME="Decompositions and Normalization"></a>Decompositions and Normalization</h3>
1364
1365
1366
1367<p>Decomposition is specified in Chapter 3. <a href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Technical Report #15:
1368
1369Normalization Forms</i></a> specifies the interaction between decomposition and normalization. The
1370
1371most up-to-date version is found on <a HREF="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15/</a>.
1372
1373That report specifies how the decompositions defined in UnicodeData.txt are used to derive
1374
1375normalized forms of Unicode text.</p>
1376
1377
1378
1379<p>Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions
1380
1381in the UnicodeData.txt file can be used to recursively derive the full decomposition in
1382
1383canonical order, without the need to separately apply canonical reordering. However,
1384
1385canonical reordering of combining character sequences must still be applied in
1386
1387decomposition when normalizing source text which contains any combining marks.</p>
1388
1389
1390
1391<h3><a NAME="Case Mappings"></a>Case Mappings</h3>
1392
1393
1394
1395<p>The case mapping is an informative, default mapping. Case itself, on the other hand,
1396
1397has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A is normatively
1398
1399uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER A is informative. The
1400
1401reason for this is that case can be considered to be an inherent property of a particular
1402
1403character (and is usually, but not always, derivable from the presence of the terms
1404
1405&quot;CAPITAL&quot; or &quot;SMALL&quot; in the character name), but case mappings between
1406
1407characters are occasionally influenced by local conventions. For example, certain
1408
1409languages, such as Turkish, German, French, or Greek may have small deviations from the
1410
1411default mappings listed in UnicodeData.</p>
1412
1413
1414
1415<p>In addition to uppercase and lowercase, because of the inclusion of certain composite
1416
1417characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ, there is a third case,
1418
1419called <i>titlecase</i>, which is used where the first letter of a word is to be
1420
1421capitalized (e.g. UPPERCASE, Titlecase, lowercase). An example of such a titlecase letter
1422
1423is 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.</p>
1424
1425
1426
1427<p>The uppercase, titlecase and lowercase fields are only included for characters that
1428
1429have a single corresponding character of that type. Composite characters (such as
1430
1431&quot;339D SQUARE CM&quot;) that do not have a single corresponding character of that type
1432
1433can be cased by decomposition.</p>
1434
1435
1436
1437<p>For compatibility with existing parsers, UnicodeData only contains case mappings for
1438
1439characters where they are one-to-one mappings; it also omits information about
1440
1441context-sensitive case mappings. Information about these special cases can be found in a
1442
1443separate data file, SpecialCasing.txt,
1444
1445which has been added starting with the 2.1.8 update to the Unicode data files.
1446
1447SpecialCasing.txt contains additional informative case mappings that are either not
1448
1449one-to-one or which are context-sensitive.</p>
1450
1451
1452
1453<h2><a NAME="Property Invariants"></a>Property Invariants</h2>
1454
1455
1456
1457<p>Values in UnicodeData.txt are subject to correction as errors are found; however, some
1458
1459characteristics of the categories themselves can be considered invariants. Applications
1460
1461may wish to take these invariants into account when choosing how to implement character
1462
1463properties. The following is a partial list of known invariants for the Unicode Character
1464
1465Database.</p>
1466
1467
1468
1469<h4>Database Fields</h4>
1470
1471
1472
1473<ul>
1474
1475 <li>The number of fields in UnicodeData.txt is fixed. </li>
1476
1477 <li>The order of the fields is also fixed. <ul>
1478
1479 <li>Any additional information about character properties to be added in the future will
1480
1481 appear in separate data tables, rather than being added on to the existing table or by
1482
1483 subdivision or reinterpretation of existing fields. </li>
1484
1485 </ul>
1486
1487 </li>
1488
1489</ul>
1490
1491
1492
1493<h4>General Category</h4>
1494
1495
1496
1497<ul>
1498
1499 <li>There will never be more than 32 General Category values. <ul>
1500
1501 <li>It is very unlikely that the Unicode Technical Committee will subdivide the General
1502
1503 Category partition any further, since that can cause implementations to misbehave. Because
1504
1505 the General Category is limited to 32 values, 5 bits can be used to represent the
1506
1507 information, and a 32-bit integer can be used as a bitmask to represent arbitrary sets of
1508
1509 categories. </li>
1510
1511 </ul>
1512
1513 </li>
1514
1515</ul>
1516
1517
1518
1519<h4>Combining Classes</h4>
1520
1521
1522
1523<ul>
1524
1525 <li>Combining classes are limited to the values 0 to 255. <ul>
1526
1527 <li>In practice, there are far fewer than 256 values used. Implementations may take
1528
1529 advantage of this fact for compression, since only the ordering of the non-zero values
1530
1531 matters for the Canonical Reordering Algorithm. It is possible for up to 256 values to be
1532
1533 used in the future; however, UTC decisions in the future may restrict the number of values
1534
1535 to 128, since this has implementation advantages. [Signed bytes can be used without
1536
1537 widening to ints in Java, for example.] </li>
1538
1539 </ul>
1540
1541 </li>
1542
1543 <li>All characters other than those of General Category M* have the combining class 0. <ul>
1544
1545 <li>Currently, all characters other than those of General Category Mn have the value 0.
1546
1547 However, some characters of General Category Me or Mc may be given non-zero values in the
1548
1549 future. </li>
1550
1551 <li>The precise values above the value 0 are not invariant--only the relative ordering is
1552
1553 considered normative. For example, it is not guaranteed in future versions that the class
1554
1555 of U+05B4 will be precisely 14. </li>
1556
1557 </ul>
1558
1559 </li>
1560
1561</ul>
1562
1563
1564
1565<h4>Case</h4>
1566
1567
1568
1569<ul>
1570
1571 <li>Characters of type Lu, Lt, or Ll are called <i>cased</i>. All characters with an Upper,
1572
1573 Lower, or Titlecase mapping are cased characters. <ul>
1574
1575 <li>However, characters with the General Categories of Lu, Ll, or Lt may not always have
1576
1577 case mappings, and case mappings may vary by locale. (See
1578
1579 ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt). </li>
1580
1581 </ul>
1582
1583 </li>
1584
1585</ul>
1586
1587
1588
1589<h4>Canonical Decomposition</h4>
1590
1591
1592
1593<ul>
1594
1595 <li>Canonical mappings are always in canonical order. </li>
1596
1597 <li>Canonical mappings have only the first of a pair possibly further decomposing. </li>
1598
1599 <li>Canonical decompositions are &quot;transparent&quot; to other character data: <ul>
1600
1601 <li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt> </li>
1602
1603 <li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt> </li>
1604
1605 <li><tt>CombiningClass(a) = CombiningClass(principal(canonicalDecomposition(a))</tt><br>
1606
1607 where principal(a) is the first character not of type Mn, or the first character if all
1608
1609 characters are of type Mn. </li>
1610
1611 </ul>
1612
1613 </li>
1614
1615 <li>However, because there are sometimes missing case pairs, and because of some legacy
1616
1617 characters, it is only generally true that: <ul>
1618
1619 <li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt> </li>
1620
1621 <li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt> </li>
1622
1623 <li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt> </li>
1624
1625 </ul>
1626
1627 </li>
1628
1629</ul>
1630
1631
1632
1633<h2><a NAME="Modification History"></a>Modification History</h2>
1634
1635
1636
1637<p>This section provides a summary of the changes between update versions of the Unicode
1638
1639Standard.</p>
1640
1641
1642
1643<h3><a href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0"> Unicode 3.0.0</a></h3>
1644
1645
1646
1647<p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and
1648
1649a number of property changes. These are summarized in Appendex D of <em>The Unicode
1650
1651Standard, Version 3.0.</em></p>
1652
1653
1654
1655<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode 2.1.9</a> </h3>
1656
1657
1658
1659<p>Modifications made for Version 2.1.9 of UnicodeData.txt include:
1660
1661
1662
1663<ul>
1664
1665 <li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR. </li>
1666
1667 <li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE </li>
1668
1669 <li>Corrected combining class for U+0F35 and U+0F37 to 220. </li>
1670
1671 <li>Corrected combining class for U+0F71 to 129. </li>
1672
1673 <li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR. </li>
1674
1675 <li>Added&nbsp; decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5,
1676
1677 U+03D6, U+03F0..U+03F2. </li>
1678
1679 <li>Removed&nbsp; decompositions from the conjoining jamo block: U+1100..U+11F8. </li>
1680
1681 <li>Changes to decomposition mappings for some Tibetan vowels for consistency in
1682
1683 normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81) </li>
1684
1685 <li>Updated the decomposition mappings for several Vietnamese characters with two diacritics
1686
1687 (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive
1688
1689 decomposition can be generated directly in canonically reordered form (not a normative
1690
1691 change). </li>
1692
1693 <li>Updated the decomposition mappings for several Arabic compatibility characters involving
1694
1695 shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so
1696
1697 that the decompositions are generated directly in canonically reordered form (not a
1698
1699 normative change). </li>
1700
1701 <li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE
1702
1703 SEPARATOR. </li>
1704
1705 <li>Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035,
1706
1707 U+FF9E, U+FF9F. </li>
1708
1709 <li>Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375. </li>
1710
1711 <li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL. </li>
1712
1713 <li>Added Unicode 1.0 names for many Tibetan characters (informative). </li>
1714
1715</ul>
1716
1717
1718
1719<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode 2.1.8</a> </h3>
1720
1721
1722
1723<p>Modifications made for Version 2.1.8 of UnicodeData.txt include:
1724
1725
1726
1727<ul>
1728
1729 <li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that
1730
1731 decompositions involving iota subscript are derivable directly in canonically reordered
1732
1733 form; this also has a bearing on simplification of casing of polytonic Greek. </li>
1734
1735 <li>Changes in decompositions related to Greek tonos. These result from the clarification
1736
1737 that monotonic Greek &quot;tonos&quot; should be equated with U+0301 COMBINING ACUTE,
1738
1739 rather than with U+030D COMBINING VERTICAL LINE ABOVE. (All Greek characters in the Greek
1740
1741 block involving &quot;tonos&quot;; some Greek characters in the polytonic Greek in the
1742
1743 1FXX block.) </li>
1744
1745 <li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0) </li>
1746
1747 <li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These changes
1748
1749 simplify normalization. </li>
1750
1751 <li>Removed canonical decomposition for Latin Candrabindu. (U+0310) </li>
1752
1753 <li>Corrected error in canonical decomposition for U+1FF4. </li>
1754
1755 <li>Added compatibility decompositions to clarify collation tables. (U+2100, U+2101, U+2105,
1756
1757 U+2106, U+1E9A) </li>
1758
1759 <li>A series of general category changes to assist the convergence of of Unicode definition
1760
1761 of identifier with ISO TR 10176: <ul>
1762
1763 <li>So &gt; Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B </li>
1764
1765 <li>Po &gt; Lo: U+0E2F, U+0EAF, U+3006 </li>
1766
1767 <li>Lm &gt; Sk: U+309B, U+309C </li>
1768
1769 <li>Po &gt; Pc: U+30FB, U+FF65 </li>
1770
1771 <li>Ps/Pe &gt; Mn: U+0F3E, U+0F3F </li>
1772
1773 </ul>
1774
1775 </li>
1776
1777 <li>A series of bidi property changes for consistency. <ul>
1778
1779 <li>L &gt; ET: U+09F2, U+09F3 </li>
1780
1781 <li>ON &gt; L: U+3007 </li>
1782
1783 <li>L &gt; ON: U+0F3A..U+0F3D, U+037E, U+0387 </li>
1784
1785 </ul>
1786
1787 </li>
1788
1789 <li>Add case mapping: U+01A6 &lt;-&gt; U+0280 </li>
1790
1791 <li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039, U+203A. </li>
1792
1793 <li>Changes to combining class values. Most Indic fixed position class non-spacing marks
1794
1795 were changed to combining class 0. This fixes some inconsistencies in how canonical
1796
1797 reordering would apply to Indic scripts, including Tibetan. Indic interacting top/bottom
1798
1799 fixed position classes were merged into single (non-zero) classes as part of this change.
1800
1801 Tibetan subjoined consonants are changed from combining class 6 to combining class 0. Thai
1802
1803 pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari stress marks into generic
1804
1805 above and below combining classes (U+0951, U+0952). </li>
1806
1807 <li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E, etc., scattered
1808
1809 positions to U+FA29) </li>
1810
1811</ul>
1812
1813
1814
1815<h3>Version 2.1.7</h3>
1816
1817
1818
1819<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1820
1821
1822
1823<h3>Version 2.1.6</h3>
1824
1825
1826
1827<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1828
1829
1830
1831<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode 2.1.5</a> </h3>
1832
1833
1834
1835<p>Modifications made for Version 2.1.5 of UnicodeData.txt include:
1836
1837
1838
1839<ul>
1840
1841 <li>Changed decomposition for U+FF9E and U+FF9F so that correct collation weighting will
1842
1843 automatically result from the canonical equivalences. </li>
1844
1845 <li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, U+04E0, U+04E1,
1846
1847 U+04E8, U+04E9 (the implication being that no canonical equivalence is claimed between
1848
1849 these 8 characters and similar Latin letters), and updated 4 canonical decompositions for
1850
1851 U+04DB, U+04DC, U+04EA, U+04EB to reflect the implied difference in the base character. </li>
1852
1853 <li>Added Pi, and Pf categories and assigned the relevant quotation marks to those
1854
1855 categories, based on the Unicode Technical Corrigendum on Quotation Characters. </li>
1856
1857 <li>Updating of many bidi properties, following the advice of the ad hoc committee on bidi,
1858
1859 and to make the bidi properties of compatibility characters more consistent. </li>
1860
1861 <li>Changed category of several Tibetan characters: U+0F3E, U+0F3F, U+0F88..U+0F8B to make
1862
1863 them non-combining, reflecting the combined opinion of Tibetan experts. </li>
1864
1865 <li>Added case mapping for U+03F2. </li>
1866
1867 <li>Corrected case mapping for U+0275. </li>
1868
1869 <li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. U+03F2. </li>
1870
1871 <li>Corrected compatibility label for U+2121. </li>
1872
1873 <li>Add specific entries for all the CJK compatibility ideographs, U+F900..U+FA2D, so the
1874
1875 canonical decomposition for each (the URO character it is equivalent to) can be carried in
1876
1877 the database. </li>
1878
1879</ul>
1880
1881
1882
1883<h3>Version 2.1.4</h3>
1884
1885
1886
1887<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1888
1889
1890
1891<h3>Version 2.1.3</h3>
1892
1893
1894
1895<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1896
1897
1898
1899<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode 2.1.2</a> </h3>
1900
1901
1902
1903<p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode
1904
1905Standard, Version 2.1 (from Version 2.0) include:
1906
1907
1908
1909<ul>
1910
1911 <li>Added two characters (U+20AC and U+FFFC). </li>
1912
1913 <li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007. </li>
1914
1915 <li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, U+03C2, U+1E9B. </li>
1916
1917 <li>Changed combining order class for U+0F71. </li>
1918
1919 <li>Corrected canonical decompositions for U+0F73, U+1FBE. </li>
1920
1921 <li>Changed decomposition for U+FB1F from compatibility to canonical. </li>
1922
1923 <li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB. </li>
1924
1925 <li>Corrected compatibility decompositions for U+2469, U+246A, U+3358. </li>
1926
1927</ul>
1928
1929
1930
1931<h3>Version 2.1.1</h3>
1932
1933
1934
1935<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1936
1937
1938
1939<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode 2.0.0</a> </h3>
1940
1941
1942
1943<p>The modifications made in updating UnicodeData.txt for the Unicode
1944
1945Standard, Version 2.0 include:
1946
1947
1948
1949<ul>
1950
1951 <li>Fixed decompositions with TONOS to use correct NSM: 030D. </li>
1952
1953 <li>Removed old Hangul Syllables; mapping to new characters are in a separate table. </li>
1954
1955 <li>Marked compatibility decompositions with additional tags. </li>
1956
1957 <li>Changed old tag names for clarity. </li>
1958
1959 <li>Revision of decompositions to use first-level decomposition, instead of maximal
1960
1961 decomposition. </li>
1962
1963 <li>Correction of all known errors in decompositions from earlier versions. </li>
1964
1965 <li>Added control code names (as old Unicode names). </li>
1966
1967 <li>Added Hangul Jamo decompositions. </li>
1968
1969 <li>Added Number category to match properties list in book. </li>
1970
1971 <li>Fixed categories of Koranic Arabic marks. </li>
1972
1973 <li>Fixed categories of precomposed characters to match decomposition where possible. </li>
1974
1975 <li>Added Hebrew cantillation marks and the Tibetan script. </li>
1976
1977 <li>Added place holders for ranges such as CJK Ideographic Area and the Private Use Area. </li>
1978
1979 <li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the
1980
1981 database. </li>
1982
1983</ul>
1984
1985</body>
1986
1987</html>
1988