Commit | Line | Data |
---|---|---|
505afebf JH |
1 | <html> |
2 | ||
3 | ||
4 | ||
5 | <head> | |
6 | ||
7 | <meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 4.0"> | |
8 | ||
9 | <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> | |
10 | ||
11 | <link REL="stylesheet" HREF="http://www.unicode.org/unicode.css" TYPE="text/css"> | |
12 | ||
13 | <title>UnicodeData File Format</title> | |
14 | ||
15 | </head> | |
16 | ||
17 | ||
18 | ||
19 | <body> | |
20 | ||
21 | ||
22 | ||
23 | <h1>UnicodeData File Format<br> | |
24 | Version 3.0.0</h1> | |
25 | ||
26 | ||
27 | ||
28 | <table BORDER="1" CELLSPACING="2" CELLPADDING="0" HEIGHT="87" WIDTH="100%"> | |
29 | ||
30 | <tr> | |
31 | ||
32 | <td VALIGN="TOP" width="144">Revision</td> | |
33 | ||
34 | <td VALIGN="TOP">3.0.0</td> | |
35 | ||
36 | </tr> | |
37 | ||
38 | <tr> | |
39 | ||
40 | <td VALIGN="TOP" width="144">Authors</td> | |
41 | ||
42 | <td VALIGN="TOP">Mark Davis and Ken Whistler</td> | |
43 | ||
44 | </tr> | |
45 | ||
46 | <tr> | |
47 | ||
48 | <td VALIGN="TOP" width="144">Date</td> | |
49 | ||
50 | <td VALIGN="TOP">1999-09-12</td> | |
51 | ||
52 | </tr> | |
53 | ||
54 | <tr> | |
55 | ||
56 | <td VALIGN="TOP" width="144">This Version</td> | |
57 | ||
58 | <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td> | |
59 | ||
60 | </tr> | |
61 | ||
62 | <tr> | |
63 | ||
64 | <td VALIGN="TOP" width="144">Previous Version</td> | |
65 | ||
66 | <td VALIGN="TOP">n/a</td> | |
67 | ||
68 | </tr> | |
69 | ||
70 | <tr> | |
71 | ||
72 | <td VALIGN="TOP" width="144">Latest Version</td> | |
73 | ||
74 | <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td> | |
75 | ||
76 | </tr> | |
77 | ||
78 | </table> | |
79 | ||
80 | ||
81 | ||
82 | <p align="center">Copyright © 1995-1999 Unicode, Inc. All Rights reserved.<br> | |
83 | ||
84 | <i>For more information, including Disclamer and Limitations, see <a HREF="UnicodeCharacterDatabase-3.0.0.html">UnicodeCharacterDatabase-3.0.0.html</a> </i></p> | |
85 | ||
86 | ||
87 | ||
88 | <p>This document describes the format of the UnicodeData.txt file, which is one of the | |
89 | ||
90 | files in the Unicode Character Database. The document is divided into the following | |
91 | ||
92 | sections: | |
93 | ||
94 | ||
95 | ||
96 | <ul> | |
97 | ||
98 | <li><a HREF="#Field Formats">Field Formats</a> <ul> | |
99 | ||
100 | <li><a HREF="#General Category">General Category</a> </li> | |
101 | ||
102 | <li><a HREF="#Bidirectional Category">Bidirectional Category</a> </li> | |
103 | ||
104 | <li><a HREF="#Character Decomposition">Character Decomposition Mapping</a> </li> | |
105 | ||
106 | <li><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </li> | |
107 | ||
108 | <li><a HREF="#Decompositions and Normalization">Decompositions and Normalization</a> </li> | |
109 | ||
110 | <li><a HREF="#Case Mappings">Case Mappings</a> </li> | |
111 | ||
112 | </ul> | |
113 | ||
114 | </li> | |
115 | ||
116 | <li><a HREF="#Property Invariants">Property Invariants</a> </li> | |
117 | ||
118 | <li><a HREF="#Modification History">Modification History</a> </li> | |
119 | ||
120 | </ul> | |
121 | ||
122 | ||
123 | ||
124 | <p><b>Warning: </b>the information in this file does not completely describe the use and | |
125 | ||
126 | interpretation of Unicode character properties and behavior. It must be used in | |
127 | ||
128 | conjunction with the data in the other files in the Unicode Character Database, and relies | |
129 | ||
130 | on the notation and definitions supplied in <i><a href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html"> The Unicode | |
131 | Standard</a></i>. All chapter references | |
132 | ||
133 | are to Version 3.0 of the standard.</p> | |
134 | ||
135 | ||
136 | ||
137 | <h2><a NAME="Field Formats"></a>Field Formats</h2> | |
138 | ||
139 | ||
140 | ||
141 | <p>The file consists of lines containing fields terminated by semicolons. Each line | |
142 | ||
143 | represents the data for one encoded character in the Unicode Standard. Every encoded | |
144 | ||
145 | character has a data entry, with the exception of certain special ranges, as detailed | |
146 | ||
147 | below. | |
148 | ||
149 | ||
150 | ||
151 | <ul> | |
152 | ||
153 | <li>There are six special ranges of characters that are represented only by their start and | |
154 | ||
155 | end characters, since the properties in the file are uniform, except for code values | |
156 | ||
157 | (which are all sequential and assigned). </li> | |
158 | ||
159 | <li>The names of CJK ideograph characters and the names and decompositions of Hangul | |
160 | ||
161 | syllable characters are algorithmically derivable. (See the Unicode Standard and <a | |
162 | ||
163 | HREF="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical Report #15</a> for | |
164 | ||
165 | more information). </li> | |
166 | ||
167 | <li>Surrogate code values and private use characters have no names. </li> | |
168 | ||
169 | <li>The Private Use character outside of the BMP (U+F0000..U+FFFFD, U+100000..U+10FFFD) are | |
170 | ||
171 | not listed. These correspond to surrogate pairs where the first surrogate is in the High | |
172 | ||
173 | Surrogate Private Use section. </li> | |
174 | ||
175 | </ul> | |
176 | ||
177 | ||
178 | ||
179 | <p>The exact ranges represented by start and end characters are: | |
180 | ||
181 | ||
182 | ||
183 | <ul> | |
184 | ||
185 | <li>CJK Ideographs Extension A (U+3400 - U+4DB5) </li> | |
186 | ||
187 | <li>CJK Ideographs (U+4E00 - U+9FA5) </li> | |
188 | ||
189 | <li>Hangul Syllables (U+AC00 - U+D7A3) </li> | |
190 | ||
191 | <li>Non-Private Use High Surrogates (U+D800 - U+DB7F) </li> | |
192 | ||
193 | <li>Private Use High Surrogates (U+DB80 - U+DBFF) </li> | |
194 | ||
195 | <li>Low Surrogates (U+DC00 - U+DFFF) </li> | |
196 | ||
197 | <li>The Private Use Area (U+E000 - U+F8FF) </li> | |
198 | ||
199 | </ul> | |
200 | ||
201 | ||
202 | ||
203 | <p>The following table describes the format and meaning of each field in a data entry in | |
204 | ||
205 | the UnicodeData file. Fields which contain normative information are so indicated.</p> | |
206 | ||
207 | ||
208 | ||
209 | <table BORDER="1" CELLSPACING="2" CELLPADDING="2"> | |
210 | ||
211 | <tr> | |
212 | ||
213 | <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Field</th> | |
214 | ||
215 | <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Name</th> | |
216 | ||
217 | <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Status</th> | |
218 | ||
219 | <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Explanation</th> | |
220 | ||
221 | </tr> | |
222 | ||
223 | <tr> | |
224 | ||
225 | <th VALIGN="top">0</th> | |
226 | ||
227 | <td VALIGN="top">Code value</td> | |
228 | ||
229 | <td VALIGN="top">normative</td> | |
230 | ||
231 | <td VALIGN="top">Code value in 4-digit hexadecimal format.</td> | |
232 | ||
233 | </tr> | |
234 | ||
235 | <tr> | |
236 | ||
237 | <th VALIGN="top">1</th> | |
238 | ||
239 | <td VALIGN="top">Character name</td> | |
240 | ||
241 | <td VALIGN="top">normative</td> | |
242 | ||
243 | <td VALIGN="top">These names match exactly the names published in Chapter 14 of the | |
244 | ||
245 | Unicode Standard, Version 3.0.</td> | |
246 | ||
247 | </tr> | |
248 | ||
249 | <tr> | |
250 | ||
251 | <th VALIGN="top">2</th> | |
252 | ||
253 | <td VALIGN="top"><a HREF="#General Category">General Category</a> </td> | |
254 | ||
255 | <td VALIGN="top">normative / informative<br> | |
256 | ||
257 | (see below)</td> | |
258 | ||
259 | <td VALIGN="top">This is a useful breakdown into various "character types" which | |
260 | ||
261 | can be used as a default categorization in implementations. See below for a brief | |
262 | ||
263 | explanation.</td> | |
264 | ||
265 | </tr> | |
266 | ||
267 | <tr> | |
268 | ||
269 | <th VALIGN="top">3</th> | |
270 | ||
271 | <td VALIGN="top"><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </td> | |
272 | ||
273 | <td VALIGN="top">normative</td> | |
274 | ||
275 | <td VALIGN="top">The classes used for the Canonical Ordering Algorithm in the Unicode | |
276 | ||
277 | Standard. These classes are also printed in Chapter 4 of the Unicode Standard.</td> | |
278 | ||
279 | </tr> | |
280 | ||
281 | <tr> | |
282 | ||
283 | <th VALIGN="top">4</th> | |
284 | ||
285 | <td VALIGN="top"><a HREF="#Bidirectional Category">Bidirectional Category</a> </td> | |
286 | ||
287 | <td VALIGN="top">normative</td> | |
288 | ||
289 | <td VALIGN="top">See the list below for an explanation of the abbreviations used in this | |
290 | ||
291 | field. These are the categories required by the Bidirectional Behavior Algorithm in the | |
292 | ||
293 | Unicode Standard. These categories are summarized in Chapter 3 of the Unicode Standard.</td> | |
294 | ||
295 | </tr> | |
296 | ||
297 | <tr> | |
298 | ||
299 | <th VALIGN="top">5</th> | |
300 | ||
301 | <td VALIGN="top"><a HREF="#Character Decomposition">Character Decomposition | |
302 | Mapping</a></td> | |
303 | ||
304 | <td VALIGN="top">normative</td> | |
305 | ||
306 | <td VALIGN="top">In the Unicode Standard, not all of the mappings are full (maximal) | |
307 | ||
308 | decompositions. Recursive application of look-up for decompositions will, in all cases, | |
309 | ||
310 | lead to a maximal decomposition. The decomposition mappings match exactly the | |
311 | ||
312 | decomposition mappings published with the character names in the Unicode Standard.</td> | |
313 | ||
314 | </tr> | |
315 | ||
316 | <tr> | |
317 | ||
318 | <th VALIGN="top">6</th> | |
319 | ||
320 | <td VALIGN="top">Decimal digit value</td> | |
321 | ||
322 | <td VALIGN="top">normative</td> | |
323 | ||
324 | <td VALIGN="top">This is a numeric field. If the character has the decimal digit property, | |
325 | ||
326 | as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented | |
327 | ||
328 | with an integer value in this field</td> | |
329 | ||
330 | </tr> | |
331 | ||
332 | <tr> | |
333 | ||
334 | <th VALIGN="top">7</th> | |
335 | ||
336 | <td VALIGN="top">Digit value</td> | |
337 | ||
338 | <td VALIGN="top">normative</td> | |
339 | ||
340 | <td VALIGN="top">This is a numeric field. If the character represents a digit, not | |
341 | ||
342 | necessarily a decimal digit, the value is here. This covers digits which do not form | |
343 | ||
344 | decimal radix forms, such as the compatibility superscript digits</td> | |
345 | ||
346 | </tr> | |
347 | ||
348 | <tr> | |
349 | ||
350 | <th VALIGN="top">8</th> | |
351 | ||
352 | <td VALIGN="top">Numeric value</td> | |
353 | ||
354 | <td VALIGN="top">normative</td> | |
355 | ||
356 | <td VALIGN="top">This is a numeric field. If the character has the numeric property, as | |
357 | ||
358 | specified in Chapter 4 of the Unicode Standard, the value of that character is represented | |
359 | ||
360 | with an integer or rational number in this field. This includes fractions as, e.g., | |
361 | ||
362 | "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values | |
363 | ||
364 | for compatibility characters such as circled numbers.</td> | |
365 | ||
366 | </tr> | |
367 | ||
368 | <tr> | |
369 | ||
370 | <th VALIGN="top">8</th> | |
371 | ||
372 | <td VALIGN="top">Mirrored</td> | |
373 | ||
374 | <td VALIGN="top">normative</td> | |
375 | ||
376 | <td VALIGN="top">If the character has been identified as a "mirrored" character | |
377 | ||
378 | in bidirectional text, this field has the value "Y"; otherwise "N". | |
379 | ||
380 | The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard.</td> | |
381 | ||
382 | </tr> | |
383 | ||
384 | <tr> | |
385 | ||
386 | <th VALIGN="top">10</th> | |
387 | ||
388 | <td VALIGN="top">Unicode 1.0 Name</td> | |
389 | ||
390 | <td VALIGN="top">informative</td> | |
391 | ||
392 | <td VALIGN="top">This is the old name as published in Unicode 1.0. This name is only | |
393 | ||
394 | provided when it is significantly different from the Unicode 3.0 name for the character.</td> | |
395 | ||
396 | </tr> | |
397 | ||
398 | <tr> | |
399 | ||
400 | <th VALIGN="top">11</th> | |
401 | ||
402 | <td VALIGN="top">10646 comment field</td> | |
403 | ||
404 | <td VALIGN="top">informative</td> | |
405 | ||
406 | <td VALIGN="top">This is the ISO 10646 comment field. It is in parantheses in the 10646 | |
407 | ||
408 | names list.</td> | |
409 | ||
410 | </tr> | |
411 | ||
412 | <tr> | |
413 | ||
414 | <th VALIGN="top">12</th> | |
415 | ||
416 | <td VALIGN="top"><a HREF="#Case Mappings">Uppercase Mapping</a></td> | |
417 | ||
418 | <td VALIGN="top">informative</td> | |
419 | ||
420 | <td VALIGN="top">Upper case equivalent mapping. If a character is part of an alphabet with | |
421 | ||
422 | case distinctions, and has an upper case equivalent, then the upper case equivalent is in | |
423 | ||
424 | this field. See the explanation below on case distinctions. These mappings are always | |
425 | ||
426 | one-to-one, not one-to-many or many-to-one. This field is informative.</td> | |
427 | ||
428 | </tr> | |
429 | ||
430 | <tr> | |
431 | ||
432 | <th VALIGN="top">13</th> | |
433 | ||
434 | <td VALIGN="top"><a HREF="#Case Mappings">Lowercase Mapping</a></td> | |
435 | ||
436 | <td VALIGN="top">informative</td> | |
437 | ||
438 | <td VALIGN="top">Similar to Uppercase mapping</td> | |
439 | ||
440 | </tr> | |
441 | ||
442 | <tr> | |
443 | ||
444 | <th VALIGN="top">14</th> | |
445 | ||
446 | <td VALIGN="top"><a HREF="#Case Mappings">Titlecase Mapping</a></td> | |
447 | ||
448 | <td VALIGN="top">informative</td> | |
449 | ||
450 | <td VALIGN="top">Similar to Uppercase mapping</td> | |
451 | ||
452 | </tr> | |
453 | ||
454 | </table> | |
455 | ||
456 | ||
457 | ||
458 | <h3><a NAME="General Category"></a>General Category</h3> | |
459 | ||
460 | ||
461 | ||
462 | <p>The values in this field are abbreviations for the following. Some of the values are | |
463 | ||
464 | normative, and some are informative. For more information, see the Unicode Standard.</p> | |
465 | ||
466 | ||
467 | ||
468 | <p><b>Note:</b> the standard does not assign information to control characters (except for | |
469 | ||
470 | certain cases in the Bidirectional Algorithm). Implementations will generally also assign | |
471 | ||
472 | categories to certain control characters, notably CR and LF, according to platform | |
473 | ||
474 | conventions.</p> | |
475 | ||
476 | ||
477 | ||
478 | <h4>Normative Categories</h4> | |
479 | ||
480 | ||
481 | ||
482 | <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> | |
483 | ||
484 | <tr> | |
485 | ||
486 | <th><p ALIGN="LEFT">Abbr.</th> | |
487 | ||
488 | <th><p ALIGN="LEFT">Description</th> | |
489 | ||
490 | </tr> | |
491 | ||
492 | <tr> | |
493 | ||
494 | <td ALIGN="CENTER">Lu</td> | |
495 | ||
496 | <td>Letter, Uppercase</td> | |
497 | ||
498 | </tr> | |
499 | ||
500 | <tr> | |
501 | ||
502 | <td ALIGN="CENTER">Ll</td> | |
503 | ||
504 | <td>Letter, Lowercase</td> | |
505 | ||
506 | </tr> | |
507 | ||
508 | <tr> | |
509 | ||
510 | <td ALIGN="CENTER">Lt</td> | |
511 | ||
512 | <td>Letter, Titlecase</td> | |
513 | ||
514 | </tr> | |
515 | ||
516 | <tr> | |
517 | ||
518 | <td ALIGN="CENTER">Mn</td> | |
519 | ||
520 | <td>Mark, Non-Spacing</td> | |
521 | ||
522 | </tr> | |
523 | ||
524 | <tr> | |
525 | ||
526 | <td ALIGN="CENTER">Mc</td> | |
527 | ||
528 | <td>Mark, Spacing Combining</td> | |
529 | ||
530 | </tr> | |
531 | ||
532 | <tr> | |
533 | ||
534 | <td ALIGN="CENTER">Me</td> | |
535 | ||
536 | <td>Mark, Enclosing</td> | |
537 | ||
538 | </tr> | |
539 | ||
540 | <tr> | |
541 | ||
542 | <td ALIGN="CENTER">Nd</td> | |
543 | ||
544 | <td>Number, Decimal Digit</td> | |
545 | ||
546 | </tr> | |
547 | ||
548 | <tr> | |
549 | ||
550 | <td ALIGN="CENTER">Nl</td> | |
551 | ||
552 | <td>Number, Letter</td> | |
553 | ||
554 | </tr> | |
555 | ||
556 | <tr> | |
557 | ||
558 | <td ALIGN="CENTER">No</td> | |
559 | ||
560 | <td>Number, Other</td> | |
561 | ||
562 | </tr> | |
563 | ||
564 | <tr> | |
565 | ||
566 | <td ALIGN="CENTER">Zs</td> | |
567 | ||
568 | <td>Separator, Space</td> | |
569 | ||
570 | </tr> | |
571 | ||
572 | <tr> | |
573 | ||
574 | <td ALIGN="CENTER">Zl</td> | |
575 | ||
576 | <td>Separator, Line</td> | |
577 | ||
578 | </tr> | |
579 | ||
580 | <tr> | |
581 | ||
582 | <td ALIGN="CENTER">Zp</td> | |
583 | ||
584 | <td>Separator, Paragraph</td> | |
585 | ||
586 | </tr> | |
587 | ||
588 | <tr> | |
589 | ||
590 | <td ALIGN="CENTER">Cc</td> | |
591 | ||
592 | <td>Other, Control</td> | |
593 | ||
594 | </tr> | |
595 | ||
596 | <tr> | |
597 | ||
598 | <td ALIGN="CENTER">Cf</td> | |
599 | ||
600 | <td>Other, Format</td> | |
601 | ||
602 | </tr> | |
603 | ||
604 | <tr> | |
605 | ||
606 | <td ALIGN="CENTER">Cs</td> | |
607 | ||
608 | <td>Other, Surrogate</td> | |
609 | ||
610 | </tr> | |
611 | ||
612 | <tr> | |
613 | ||
614 | <td ALIGN="CENTER">Co</td> | |
615 | ||
616 | <td>Other, Private Use</td> | |
617 | ||
618 | </tr> | |
619 | ||
620 | <tr> | |
621 | ||
622 | <td ALIGN="CENTER">Cn</td> | |
623 | ||
624 | <td>Other, Not Assigned (no characters in the file have this property)</td> | |
625 | ||
626 | </tr> | |
627 | ||
628 | </table> | |
629 | ||
630 | ||
631 | ||
632 | <h4>Informative Categories</h4> | |
633 | ||
634 | ||
635 | ||
636 | <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> | |
637 | ||
638 | <tr> | |
639 | ||
640 | <th><p ALIGN="LEFT">Abbr.</th> | |
641 | ||
642 | <th><p ALIGN="LEFT">Description</th> | |
643 | ||
644 | </tr> | |
645 | ||
646 | <tr> | |
647 | ||
648 | <td ALIGN="CENTER">Lm</td> | |
649 | ||
650 | <td>Letter, Modifier</td> | |
651 | ||
652 | </tr> | |
653 | ||
654 | <tr> | |
655 | ||
656 | <td ALIGN="CENTER">Lo</td> | |
657 | ||
658 | <td>Letter, Other</td> | |
659 | ||
660 | </tr> | |
661 | ||
662 | <tr> | |
663 | ||
664 | <td ALIGN="CENTER">Pc</td> | |
665 | ||
666 | <td>Punctuation, Connector</td> | |
667 | ||
668 | </tr> | |
669 | ||
670 | <tr> | |
671 | ||
672 | <td ALIGN="CENTER">Pd</td> | |
673 | ||
674 | <td>Punctuation, Dash</td> | |
675 | ||
676 | </tr> | |
677 | ||
678 | <tr> | |
679 | ||
680 | <td ALIGN="CENTER">Ps</td> | |
681 | ||
682 | <td>Punctuation, Open</td> | |
683 | ||
684 | </tr> | |
685 | ||
686 | <tr> | |
687 | ||
688 | <td ALIGN="CENTER">Pe</td> | |
689 | ||
690 | <td>Punctuation, Close</td> | |
691 | ||
692 | </tr> | |
693 | ||
694 | <tr> | |
695 | ||
696 | <td ALIGN="CENTER">Pi</td> | |
697 | ||
698 | <td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td> | |
699 | ||
700 | </tr> | |
701 | ||
702 | <tr> | |
703 | ||
704 | <td ALIGN="CENTER">Pf</td> | |
705 | ||
706 | <td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td> | |
707 | ||
708 | </tr> | |
709 | ||
710 | <tr> | |
711 | ||
712 | <td ALIGN="CENTER">Po</td> | |
713 | ||
714 | <td>Punctuation, Other</td> | |
715 | ||
716 | </tr> | |
717 | ||
718 | <tr> | |
719 | ||
720 | <td ALIGN="CENTER">Sm</td> | |
721 | ||
722 | <td>Symbol, Math</td> | |
723 | ||
724 | </tr> | |
725 | ||
726 | <tr> | |
727 | ||
728 | <td ALIGN="CENTER">Sc</td> | |
729 | ||
730 | <td>Symbol, Currency</td> | |
731 | ||
732 | </tr> | |
733 | ||
734 | <tr> | |
735 | ||
736 | <td ALIGN="CENTER">Sk</td> | |
737 | ||
738 | <td>Symbol, Modifier</td> | |
739 | ||
740 | </tr> | |
741 | ||
742 | <tr> | |
743 | ||
744 | <td ALIGN="CENTER">So</td> | |
745 | ||
746 | <td>Symbol, Other</td> | |
747 | ||
748 | </tr> | |
749 | ||
750 | </table> | |
751 | ||
752 | ||
753 | ||
754 | <h3><a NAME="Bidirectional Category"></a>Bidirectional Category</h3> | |
755 | ||
756 | ||
757 | ||
758 | <p>Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional | |
759 | ||
760 | Behavior and an explanation of the significance of these categories. An up-to-date version | |
761 | ||
762 | can be found on <a HREF="http://www.unicode.org/unicode/reports/tr9/">Unicode Technical | |
763 | ||
764 | Report #9: The Bidirectional Algorithm</a>. These values are normative.</p> | |
765 | ||
766 | ||
767 | ||
768 | <table BORDER="0" CELLPADDING="2"> | |
769 | ||
770 | <tr> | |
771 | ||
772 | <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Type</th> | |
773 | ||
774 | <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Description</th> | |
775 | ||
776 | </tr> | |
777 | ||
778 | <tr> | |
779 | ||
780 | <td VALIGN="TOP"><b>L</b></td> | |
781 | ||
782 | <td VALIGN="TOP">Left-to-Right</td> | |
783 | ||
784 | </tr> | |
785 | ||
786 | <tr> | |
787 | ||
788 | <td VALIGN="TOP"><b>LRE</b></td> | |
789 | ||
790 | <td VALIGN="TOP">Left-to-Right Embedding</td> | |
791 | ||
792 | </tr> | |
793 | ||
794 | <tr> | |
795 | ||
796 | <td VALIGN="TOP"><b>LRO</b></td> | |
797 | ||
798 | <td VALIGN="TOP">Left-to-Right Override</td> | |
799 | ||
800 | </tr> | |
801 | ||
802 | <tr> | |
803 | ||
804 | <td VALIGN="TOP"><b>R</b></td> | |
805 | ||
806 | <td VALIGN="TOP">Right-to-Left</td> | |
807 | ||
808 | </tr> | |
809 | ||
810 | <tr> | |
811 | ||
812 | <td VALIGN="TOP"><b>AL</b></td> | |
813 | ||
814 | <td VALIGN="TOP">Right-to-Left Arabic</td> | |
815 | ||
816 | </tr> | |
817 | ||
818 | <tr> | |
819 | ||
820 | <td VALIGN="TOP"><b>RLE</b></td> | |
821 | ||
822 | <td VALIGN="TOP">Right-to-Left Embedding</td> | |
823 | ||
824 | </tr> | |
825 | ||
826 | <tr> | |
827 | ||
828 | <td VALIGN="TOP"><b>RLO</b></td> | |
829 | ||
830 | <td VALIGN="TOP">Right-to-Left Override</td> | |
831 | ||
832 | </tr> | |
833 | ||
834 | <tr> | |
835 | ||
836 | <td VALIGN="TOP"><b>PDF</b></td> | |
837 | ||
838 | <td VALIGN="TOP">Pop Directional Format</td> | |
839 | ||
840 | </tr> | |
841 | ||
842 | <tr> | |
843 | ||
844 | <td VALIGN="TOP"><b>EN</b></td> | |
845 | ||
846 | <td VALIGN="TOP">European Number</td> | |
847 | ||
848 | </tr> | |
849 | ||
850 | <tr> | |
851 | ||
852 | <td VALIGN="TOP"><b>ES</b></td> | |
853 | ||
854 | <td VALIGN="TOP">European Number Separator</td> | |
855 | ||
856 | </tr> | |
857 | ||
858 | <tr> | |
859 | ||
860 | <td VALIGN="TOP"><b>ET</b></td> | |
861 | ||
862 | <td VALIGN="TOP">European Number Terminator</td> | |
863 | ||
864 | </tr> | |
865 | ||
866 | <tr> | |
867 | ||
868 | <td VALIGN="TOP"><b>AN</b></td> | |
869 | ||
870 | <td VALIGN="TOP">Arabic Number</td> | |
871 | ||
872 | </tr> | |
873 | ||
874 | <tr> | |
875 | ||
876 | <td VALIGN="TOP"><b>CS</b></td> | |
877 | ||
878 | <td VALIGN="TOP">Common Number Separator</td> | |
879 | ||
880 | </tr> | |
881 | ||
882 | <tr> | |
883 | ||
884 | <td VALIGN="TOP"><b>NSM</b></td> | |
885 | ||
886 | <td VALIGN="TOP">Non-Spacing Mark</td> | |
887 | ||
888 | </tr> | |
889 | ||
890 | <tr> | |
891 | ||
892 | <td VALIGN="TOP"><b>BN</b></td> | |
893 | ||
894 | <td VALIGN="TOP">Boundary Neutral</td> | |
895 | ||
896 | </tr> | |
897 | ||
898 | <tr> | |
899 | ||
900 | <td VALIGN="TOP"><b>B</b></td> | |
901 | ||
902 | <td VALIGN="TOP">Paragraph Separator</td> | |
903 | ||
904 | </tr> | |
905 | ||
906 | <tr> | |
907 | ||
908 | <td VALIGN="TOP"><b>S</b></td> | |
909 | ||
910 | <td VALIGN="TOP">Segment Separator</td> | |
911 | ||
912 | </tr> | |
913 | ||
914 | <tr> | |
915 | ||
916 | <td VALIGN="TOP"><b>WS</b></td> | |
917 | ||
918 | <td VALIGN="TOP">Whitespace</td> | |
919 | ||
920 | </tr> | |
921 | ||
922 | <tr> | |
923 | ||
924 | <td VALIGN="TOP"><b>ON</b></td> | |
925 | ||
926 | <td VALIGN="TOP">Other Neutrals</td> | |
927 | ||
928 | </tr> | |
929 | ||
930 | </table> | |
931 | ||
932 | ||
933 | ||
934 | <h3><a NAME="Character Decomposition"></a>Character Decomposition Mapping</h3> | |
935 | ||
936 | ||
937 | ||
938 | <p>The decomposition is a normative property of a character. The tags supplied with | |
939 | ||
940 | certain decomposition mappings generally indicate formatting information. Where no such | |
941 | ||
942 | tag is given, the mapping is designated as canonical. Conversely, the presence of a | |
943 | ||
944 | formatting tag also indicates that the mapping is a compatibility mapping and not a | |
945 | ||
946 | canonical mapping. In the absence of other formatting information in a compatibility | |
947 | ||
948 | mapping, the tag is used to distinguish it from canonical mappings.</p> | |
949 | ||
950 | ||
951 | ||
952 | <p>In some instances a canonical mapping or a compatibility mapping may consist of a | |
953 | ||
954 | single character. For a canonical mapping, this indicates that the character is a | |
955 | ||
956 | canonical equivalent of another single character. For a compatibility mapping, this | |
957 | ||
958 | indicates that the character is a compatibility equivalent of another single character. | |
959 | ||
960 | The compatibility formatting tags used are:</p> | |
961 | ||
962 | ||
963 | ||
964 | <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> | |
965 | ||
966 | <tr> | |
967 | ||
968 | <th>Tag</th> | |
969 | ||
970 | <th><p ALIGN="LEFT">Description</th> | |
971 | ||
972 | </tr> | |
973 | ||
974 | <tr> | |
975 | ||
976 | <td ALIGN="CENTER"><font> </td> | |
977 | ||
978 | <td>A font variant (e.g. a blackletter form).</td> | |
979 | ||
980 | </tr> | |
981 | ||
982 | <tr> | |
983 | ||
984 | <td ALIGN="CENTER"><noBreak> </td> | |
985 | ||
986 | <td>A no-break version of a space or hyphen.</td> | |
987 | ||
988 | </tr> | |
989 | ||
990 | <tr> | |
991 | ||
992 | <td ALIGN="CENTER"><initial> </td> | |
993 | ||
994 | <td>An initial presentation form (Arabic).</td> | |
995 | ||
996 | </tr> | |
997 | ||
998 | <tr> | |
999 | ||
1000 | <td ALIGN="CENTER"><medial> </td> | |
1001 | ||
1002 | <td>A medial presentation form (Arabic).</td> | |
1003 | ||
1004 | </tr> | |
1005 | ||
1006 | <tr> | |
1007 | ||
1008 | <td ALIGN="CENTER"><final> </td> | |
1009 | ||
1010 | <td>A final presentation form (Arabic).</td> | |
1011 | ||
1012 | </tr> | |
1013 | ||
1014 | <tr> | |
1015 | ||
1016 | <td ALIGN="CENTER"><isolated> </td> | |
1017 | ||
1018 | <td>An isolated presentation form (Arabic).</td> | |
1019 | ||
1020 | </tr> | |
1021 | ||
1022 | <tr> | |
1023 | ||
1024 | <td ALIGN="CENTER"><circle> </td> | |
1025 | ||
1026 | <td>An encircled form.</td> | |
1027 | ||
1028 | </tr> | |
1029 | ||
1030 | <tr> | |
1031 | ||
1032 | <td ALIGN="CENTER"><super> </td> | |
1033 | ||
1034 | <td>A superscript form.</td> | |
1035 | ||
1036 | </tr> | |
1037 | ||
1038 | <tr> | |
1039 | ||
1040 | <td ALIGN="CENTER"><sub> </td> | |
1041 | ||
1042 | <td>A subscript form.</td> | |
1043 | ||
1044 | </tr> | |
1045 | ||
1046 | <tr> | |
1047 | ||
1048 | <td ALIGN="CENTER"><vertical> </td> | |
1049 | ||
1050 | <td>A vertical layout presentation form.</td> | |
1051 | ||
1052 | </tr> | |
1053 | ||
1054 | <tr> | |
1055 | ||
1056 | <td ALIGN="CENTER"><wide> </td> | |
1057 | ||
1058 | <td>A wide (or zenkaku) compatibility character.</td> | |
1059 | ||
1060 | </tr> | |
1061 | ||
1062 | <tr> | |
1063 | ||
1064 | <td ALIGN="CENTER"><narrow> </td> | |
1065 | ||
1066 | <td>A narrow (or hankaku) compatibility character.</td> | |
1067 | ||
1068 | </tr> | |
1069 | ||
1070 | <tr> | |
1071 | ||
1072 | <td ALIGN="CENTER"><small> </td> | |
1073 | ||
1074 | <td>A small variant form (CNS compatibility).</td> | |
1075 | ||
1076 | </tr> | |
1077 | ||
1078 | <tr> | |
1079 | ||
1080 | <td ALIGN="CENTER"><square> </td> | |
1081 | ||
1082 | <td>A CJK squared font variant.</td> | |
1083 | ||
1084 | </tr> | |
1085 | ||
1086 | <tr> | |
1087 | ||
1088 | <td ALIGN="CENTER"><fraction> </td> | |
1089 | ||
1090 | <td>A vulgar fraction form.</td> | |
1091 | ||
1092 | </tr> | |
1093 | ||
1094 | <tr> | |
1095 | ||
1096 | <td ALIGN="CENTER"><compat> </td> | |
1097 | ||
1098 | <td>Otherwise unspecified compatibility character.</td> | |
1099 | ||
1100 | </tr> | |
1101 | ||
1102 | </table> | |
1103 | ||
1104 | ||
1105 | ||
1106 | <p><b>Reminder: </b>There is a difference between decomposition and decomposition mapping. | |
1107 | ||
1108 | The decomposition mappings are defined in the UnicodeData, while the decomposition (also | |
1109 | ||
1110 | termed "full decomposition") is defined in Chapter 3 to use those mappings | |
1111 | <i> | |
1112 | ||
1113 | recursively.</i> | |
1114 | ||
1115 | ||
1116 | ||
1117 | <ul> | |
1118 | ||
1119 | <li>The canonical decomposition is formed by recursively applying the canonical mappings, | |
1120 | ||
1121 | then applying the canonical reordering algorithm. </li> | |
1122 | ||
1123 | <li>The compatibility decomposition is formed by recursively applying the canonical <em>and</em> | |
1124 | ||
1125 | compatibility mappings, then applying the canonical reordering algorithm. </li> | |
1126 | ||
1127 | </ul> | |
1128 | ||
1129 | ||
1130 | ||
1131 | <h3><a NAME="Canonical Combining Classes"></a>Canonical Combining Classes</h3> | |
1132 | ||
1133 | ||
1134 | ||
1135 | <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> | |
1136 | ||
1137 | <tr> | |
1138 | ||
1139 | <th><p ALIGN="LEFT">Value</th> | |
1140 | ||
1141 | <th><p ALIGN="LEFT">Description</th> | |
1142 | ||
1143 | </tr> | |
1144 | ||
1145 | <tr> | |
1146 | ||
1147 | <td ALIGN="RIGHT">0:</td> | |
1148 | ||
1149 | <td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td> | |
1150 | ||
1151 | </tr> | |
1152 | ||
1153 | <tr> | |
1154 | ||
1155 | <td ALIGN="RIGHT">1:</td> | |
1156 | ||
1157 | <td>Overlays and interior</td> | |
1158 | ||
1159 | </tr> | |
1160 | ||
1161 | <tr> | |
1162 | ||
1163 | <td ALIGN="RIGHT">7:</td> | |
1164 | ||
1165 | <td>Nuktas</td> | |
1166 | ||
1167 | </tr> | |
1168 | ||
1169 | <tr> | |
1170 | ||
1171 | <td ALIGN="RIGHT">8:</td> | |
1172 | ||
1173 | <td>Hiragana/Katakana voicing marks</td> | |
1174 | ||
1175 | </tr> | |
1176 | ||
1177 | <tr> | |
1178 | ||
1179 | <td ALIGN="RIGHT">9:</td> | |
1180 | ||
1181 | <td>Viramas</td> | |
1182 | ||
1183 | </tr> | |
1184 | ||
1185 | <tr> | |
1186 | ||
1187 | <td ALIGN="RIGHT">10:</td> | |
1188 | ||
1189 | <td>Start of fixed position classes</td> | |
1190 | ||
1191 | </tr> | |
1192 | ||
1193 | <tr> | |
1194 | ||
1195 | <td ALIGN="RIGHT">199:</td> | |
1196 | ||
1197 | <td>End of fixed position classes</td> | |
1198 | ||
1199 | </tr> | |
1200 | ||
1201 | <tr> | |
1202 | ||
1203 | <td ALIGN="RIGHT">200:</td> | |
1204 | ||
1205 | <td>Below left attached</td> | |
1206 | ||
1207 | </tr> | |
1208 | ||
1209 | <tr> | |
1210 | ||
1211 | <td ALIGN="RIGHT">202:</td> | |
1212 | ||
1213 | <td>Below attached</td> | |
1214 | ||
1215 | </tr> | |
1216 | ||
1217 | <tr> | |
1218 | ||
1219 | <td ALIGN="RIGHT">204:</td> | |
1220 | ||
1221 | <td>Below right attached</td> | |
1222 | ||
1223 | </tr> | |
1224 | ||
1225 | <tr> | |
1226 | ||
1227 | <td ALIGN="RIGHT">208:</td> | |
1228 | ||
1229 | <td>Left attached (reordrant around single base character)</td> | |
1230 | ||
1231 | </tr> | |
1232 | ||
1233 | <tr> | |
1234 | ||
1235 | <td ALIGN="RIGHT">210:</td> | |
1236 | ||
1237 | <td>Right attached</td> | |
1238 | ||
1239 | </tr> | |
1240 | ||
1241 | <tr> | |
1242 | ||
1243 | <td ALIGN="RIGHT">212:</td> | |
1244 | ||
1245 | <td>Above left attached</td> | |
1246 | ||
1247 | </tr> | |
1248 | ||
1249 | <tr> | |
1250 | ||
1251 | <td ALIGN="RIGHT">214:</td> | |
1252 | ||
1253 | <td>Above attached</td> | |
1254 | ||
1255 | </tr> | |
1256 | ||
1257 | <tr> | |
1258 | ||
1259 | <td ALIGN="RIGHT">216:</td> | |
1260 | ||
1261 | <td>Above right attached</td> | |
1262 | ||
1263 | </tr> | |
1264 | ||
1265 | <tr> | |
1266 | ||
1267 | <td ALIGN="RIGHT">218:</td> | |
1268 | ||
1269 | <td>Below left</td> | |
1270 | ||
1271 | </tr> | |
1272 | ||
1273 | <tr> | |
1274 | ||
1275 | <td ALIGN="RIGHT">220:</td> | |
1276 | ||
1277 | <td>Below</td> | |
1278 | ||
1279 | </tr> | |
1280 | ||
1281 | <tr> | |
1282 | ||
1283 | <td ALIGN="RIGHT">222:</td> | |
1284 | ||
1285 | <td>Below right</td> | |
1286 | ||
1287 | </tr> | |
1288 | ||
1289 | <tr> | |
1290 | ||
1291 | <td ALIGN="RIGHT">224:</td> | |
1292 | ||
1293 | <td>Left (reordrant around single base character)</td> | |
1294 | ||
1295 | </tr> | |
1296 | ||
1297 | <tr> | |
1298 | ||
1299 | <td ALIGN="RIGHT">226:</td> | |
1300 | ||
1301 | <td>Right</td> | |
1302 | ||
1303 | </tr> | |
1304 | ||
1305 | <tr> | |
1306 | ||
1307 | <td ALIGN="RIGHT">228:</td> | |
1308 | ||
1309 | <td>Above left</td> | |
1310 | ||
1311 | </tr> | |
1312 | ||
1313 | <tr> | |
1314 | ||
1315 | <td ALIGN="RIGHT">230:</td> | |
1316 | ||
1317 | <td>Above</td> | |
1318 | ||
1319 | </tr> | |
1320 | ||
1321 | <tr> | |
1322 | ||
1323 | <td ALIGN="RIGHT">232:</td> | |
1324 | ||
1325 | <td>Above right</td> | |
1326 | ||
1327 | </tr> | |
1328 | ||
1329 | <tr> | |
1330 | ||
1331 | <td ALIGN="RIGHT">233:</td> | |
1332 | ||
1333 | <td>Double below</td> | |
1334 | ||
1335 | </tr> | |
1336 | ||
1337 | <tr> | |
1338 | ||
1339 | <td ALIGN="RIGHT">234:</td> | |
1340 | ||
1341 | <td>Double above</td> | |
1342 | ||
1343 | </tr> | |
1344 | ||
1345 | <tr> | |
1346 | ||
1347 | <td ALIGN="RIGHT">240:</td> | |
1348 | ||
1349 | <td>Below (iota subscript)</td> | |
1350 | ||
1351 | </tr> | |
1352 | ||
1353 | </table> | |
1354 | ||
1355 | ||
1356 | ||
1357 | <p><strong>Note: </strong>some of the combining classes in this list do not currently have | |
1358 | ||
1359 | members but are specified here for completeness.</p> | |
1360 | ||
1361 | ||
1362 | ||
1363 | <h3><a NAME="Decompositions and Normalization"></a>Decompositions and Normalization</h3> | |
1364 | ||
1365 | ||
1366 | ||
1367 | <p>Decomposition is specified in Chapter 3. <a href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Technical Report #15: | |
1368 | ||
1369 | Normalization Forms</i></a> specifies the interaction between decomposition and normalization. The | |
1370 | ||
1371 | most up-to-date version is found on <a HREF="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15/</a>. | |
1372 | ||
1373 | That report specifies how the decompositions defined in UnicodeData.txt are used to derive | |
1374 | ||
1375 | normalized forms of Unicode text.</p> | |
1376 | ||
1377 | ||
1378 | ||
1379 | <p>Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions | |
1380 | ||
1381 | in the UnicodeData.txt file can be used to recursively derive the full decomposition in | |
1382 | ||
1383 | canonical order, without the need to separately apply canonical reordering. However, | |
1384 | ||
1385 | canonical reordering of combining character sequences must still be applied in | |
1386 | ||
1387 | decomposition when normalizing source text which contains any combining marks.</p> | |
1388 | ||
1389 | ||
1390 | ||
1391 | <h3><a NAME="Case Mappings"></a>Case Mappings</h3> | |
1392 | ||
1393 | ||
1394 | ||
1395 | <p>The case mapping is an informative, default mapping. Case itself, on the other hand, | |
1396 | ||
1397 | has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A is normatively | |
1398 | ||
1399 | uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER A is informative. The | |
1400 | ||
1401 | reason for this is that case can be considered to be an inherent property of a particular | |
1402 | ||
1403 | character (and is usually, but not always, derivable from the presence of the terms | |
1404 | ||
1405 | "CAPITAL" or "SMALL" in the character name), but case mappings between | |
1406 | ||
1407 | characters are occasionally influenced by local conventions. For example, certain | |
1408 | ||
1409 | languages, such as Turkish, German, French, or Greek may have small deviations from the | |
1410 | ||
1411 | default mappings listed in UnicodeData.</p> | |
1412 | ||
1413 | ||
1414 | ||
1415 | <p>In addition to uppercase and lowercase, because of the inclusion of certain composite | |
1416 | ||
1417 | characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ, there is a third case, | |
1418 | ||
1419 | called <i>titlecase</i>, which is used where the first letter of a word is to be | |
1420 | ||
1421 | capitalized (e.g. UPPERCASE, Titlecase, lowercase). An example of such a titlecase letter | |
1422 | ||
1423 | is 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.</p> | |
1424 | ||
1425 | ||
1426 | ||
1427 | <p>The uppercase, titlecase and lowercase fields are only included for characters that | |
1428 | ||
1429 | have a single corresponding character of that type. Composite characters (such as | |
1430 | ||
1431 | "339D SQUARE CM") that do not have a single corresponding character of that type | |
1432 | ||
1433 | can be cased by decomposition.</p> | |
1434 | ||
1435 | ||
1436 | ||
1437 | <p>For compatibility with existing parsers, UnicodeData only contains case mappings for | |
1438 | ||
1439 | characters where they are one-to-one mappings; it also omits information about | |
1440 | ||
1441 | context-sensitive case mappings. Information about these special cases can be found in a | |
1442 | ||
1443 | separate data file, SpecialCasing.txt, | |
1444 | ||
1445 | which has been added starting with the 2.1.8 update to the Unicode data files. | |
1446 | ||
1447 | SpecialCasing.txt contains additional informative case mappings that are either not | |
1448 | ||
1449 | one-to-one or which are context-sensitive.</p> | |
1450 | ||
1451 | ||
1452 | ||
1453 | <h2><a NAME="Property Invariants"></a>Property Invariants</h2> | |
1454 | ||
1455 | ||
1456 | ||
1457 | <p>Values in UnicodeData.txt are subject to correction as errors are found; however, some | |
1458 | ||
1459 | characteristics of the categories themselves can be considered invariants. Applications | |
1460 | ||
1461 | may wish to take these invariants into account when choosing how to implement character | |
1462 | ||
1463 | properties. The following is a partial list of known invariants for the Unicode Character | |
1464 | ||
1465 | Database.</p> | |
1466 | ||
1467 | ||
1468 | ||
1469 | <h4>Database Fields</h4> | |
1470 | ||
1471 | ||
1472 | ||
1473 | <ul> | |
1474 | ||
1475 | <li>The number of fields in UnicodeData.txt is fixed. </li> | |
1476 | ||
1477 | <li>The order of the fields is also fixed. <ul> | |
1478 | ||
1479 | <li>Any additional information about character properties to be added in the future will | |
1480 | ||
1481 | appear in separate data tables, rather than being added on to the existing table or by | |
1482 | ||
1483 | subdivision or reinterpretation of existing fields. </li> | |
1484 | ||
1485 | </ul> | |
1486 | ||
1487 | </li> | |
1488 | ||
1489 | </ul> | |
1490 | ||
1491 | ||
1492 | ||
1493 | <h4>General Category</h4> | |
1494 | ||
1495 | ||
1496 | ||
1497 | <ul> | |
1498 | ||
1499 | <li>There will never be more than 32 General Category values. <ul> | |
1500 | ||
1501 | <li>It is very unlikely that the Unicode Technical Committee will subdivide the General | |
1502 | ||
1503 | Category partition any further, since that can cause implementations to misbehave. Because | |
1504 | ||
1505 | the General Category is limited to 32 values, 5 bits can be used to represent the | |
1506 | ||
1507 | information, and a 32-bit integer can be used as a bitmask to represent arbitrary sets of | |
1508 | ||
1509 | categories. </li> | |
1510 | ||
1511 | </ul> | |
1512 | ||
1513 | </li> | |
1514 | ||
1515 | </ul> | |
1516 | ||
1517 | ||
1518 | ||
1519 | <h4>Combining Classes</h4> | |
1520 | ||
1521 | ||
1522 | ||
1523 | <ul> | |
1524 | ||
1525 | <li>Combining classes are limited to the values 0 to 255. <ul> | |
1526 | ||
1527 | <li>In practice, there are far fewer than 256 values used. Implementations may take | |
1528 | ||
1529 | advantage of this fact for compression, since only the ordering of the non-zero values | |
1530 | ||
1531 | matters for the Canonical Reordering Algorithm. It is possible for up to 256 values to be | |
1532 | ||
1533 | used in the future; however, UTC decisions in the future may restrict the number of values | |
1534 | ||
1535 | to 128, since this has implementation advantages. [Signed bytes can be used without | |
1536 | ||
1537 | widening to ints in Java, for example.] </li> | |
1538 | ||
1539 | </ul> | |
1540 | ||
1541 | </li> | |
1542 | ||
1543 | <li>All characters other than those of General Category M* have the combining class 0. <ul> | |
1544 | ||
1545 | <li>Currently, all characters other than those of General Category Mn have the value 0. | |
1546 | ||
1547 | However, some characters of General Category Me or Mc may be given non-zero values in the | |
1548 | ||
1549 | future. </li> | |
1550 | ||
1551 | <li>The precise values above the value 0 are not invariant--only the relative ordering is | |
1552 | ||
1553 | considered normative. For example, it is not guaranteed in future versions that the class | |
1554 | ||
1555 | of U+05B4 will be precisely 14. </li> | |
1556 | ||
1557 | </ul> | |
1558 | ||
1559 | </li> | |
1560 | ||
1561 | </ul> | |
1562 | ||
1563 | ||
1564 | ||
1565 | <h4>Case</h4> | |
1566 | ||
1567 | ||
1568 | ||
1569 | <ul> | |
1570 | ||
1571 | <li>Characters of type Lu, Lt, or Ll are called <i>cased</i>. All characters with an Upper, | |
1572 | ||
1573 | Lower, or Titlecase mapping are cased characters. <ul> | |
1574 | ||
1575 | <li>However, characters with the General Categories of Lu, Ll, or Lt may not always have | |
1576 | ||
1577 | case mappings, and case mappings may vary by locale. (See | |
1578 | ||
1579 | ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt). </li> | |
1580 | ||
1581 | </ul> | |
1582 | ||
1583 | </li> | |
1584 | ||
1585 | </ul> | |
1586 | ||
1587 | ||
1588 | ||
1589 | <h4>Canonical Decomposition</h4> | |
1590 | ||
1591 | ||
1592 | ||
1593 | <ul> | |
1594 | ||
1595 | <li>Canonical mappings are always in canonical order. </li> | |
1596 | ||
1597 | <li>Canonical mappings have only the first of a pair possibly further decomposing. </li> | |
1598 | ||
1599 | <li>Canonical decompositions are "transparent" to other character data: <ul> | |
1600 | ||
1601 | <li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt> </li> | |
1602 | ||
1603 | <li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt> </li> | |
1604 | ||
1605 | <li><tt>CombiningClass(a) = CombiningClass(principal(canonicalDecomposition(a))</tt><br> | |
1606 | ||
1607 | where principal(a) is the first character not of type Mn, or the first character if all | |
1608 | ||
1609 | characters are of type Mn. </li> | |
1610 | ||
1611 | </ul> | |
1612 | ||
1613 | </li> | |
1614 | ||
1615 | <li>However, because there are sometimes missing case pairs, and because of some legacy | |
1616 | ||
1617 | characters, it is only generally true that: <ul> | |
1618 | ||
1619 | <li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt> </li> | |
1620 | ||
1621 | <li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt> </li> | |
1622 | ||
1623 | <li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt> </li> | |
1624 | ||
1625 | </ul> | |
1626 | ||
1627 | </li> | |
1628 | ||
1629 | </ul> | |
1630 | ||
1631 | ||
1632 | ||
1633 | <h2><a NAME="Modification History"></a>Modification History</h2> | |
1634 | ||
1635 | ||
1636 | ||
1637 | <p>This section provides a summary of the changes between update versions of the Unicode | |
1638 | ||
1639 | Standard.</p> | |
1640 | ||
1641 | ||
1642 | ||
1643 | <h3><a href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0"> Unicode 3.0.0</a></h3> | |
1644 | ||
1645 | ||
1646 | ||
1647 | <p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and | |
1648 | ||
1649 | a number of property changes. These are summarized in Appendex D of <em>The Unicode | |
1650 | ||
1651 | Standard, Version 3.0.</em></p> | |
1652 | ||
1653 | ||
1654 | ||
1655 | <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode 2.1.9</a> </h3> | |
1656 | ||
1657 | ||
1658 | ||
1659 | <p>Modifications made for Version 2.1.9 of UnicodeData.txt include: | |
1660 | ||
1661 | ||
1662 | ||
1663 | <ul> | |
1664 | ||
1665 | <li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR. </li> | |
1666 | ||
1667 | <li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE </li> | |
1668 | ||
1669 | <li>Corrected combining class for U+0F35 and U+0F37 to 220. </li> | |
1670 | ||
1671 | <li>Corrected combining class for U+0F71 to 129. </li> | |
1672 | ||
1673 | <li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR. </li> | |
1674 | ||
1675 | <li>Added decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5, | |
1676 | ||
1677 | U+03D6, U+03F0..U+03F2. </li> | |
1678 | ||
1679 | <li>Removed decompositions from the conjoining jamo block: U+1100..U+11F8. </li> | |
1680 | ||
1681 | <li>Changes to decomposition mappings for some Tibetan vowels for consistency in | |
1682 | ||
1683 | normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81) </li> | |
1684 | ||
1685 | <li>Updated the decomposition mappings for several Vietnamese characters with two diacritics | |
1686 | ||
1687 | (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive | |
1688 | ||
1689 | decomposition can be generated directly in canonically reordered form (not a normative | |
1690 | ||
1691 | change). </li> | |
1692 | ||
1693 | <li>Updated the decomposition mappings for several Arabic compatibility characters involving | |
1694 | ||
1695 | shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so | |
1696 | ||
1697 | that the decompositions are generated directly in canonically reordered form (not a | |
1698 | ||
1699 | normative change). </li> | |
1700 | ||
1701 | <li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE | |
1702 | ||
1703 | SEPARATOR. </li> | |
1704 | ||
1705 | <li>Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035, | |
1706 | ||
1707 | U+FF9E, U+FF9F. </li> | |
1708 | ||
1709 | <li>Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375. </li> | |
1710 | ||
1711 | <li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL. </li> | |
1712 | ||
1713 | <li>Added Unicode 1.0 names for many Tibetan characters (informative). </li> | |
1714 | ||
1715 | </ul> | |
1716 | ||
1717 | ||
1718 | ||
1719 | <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode 2.1.8</a> </h3> | |
1720 | ||
1721 | ||
1722 | ||
1723 | <p>Modifications made for Version 2.1.8 of UnicodeData.txt include: | |
1724 | ||
1725 | ||
1726 | ||
1727 | <ul> | |
1728 | ||
1729 | <li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that | |
1730 | ||
1731 | decompositions involving iota subscript are derivable directly in canonically reordered | |
1732 | ||
1733 | form; this also has a bearing on simplification of casing of polytonic Greek. </li> | |
1734 | ||
1735 | <li>Changes in decompositions related to Greek tonos. These result from the clarification | |
1736 | ||
1737 | that monotonic Greek "tonos" should be equated with U+0301 COMBINING ACUTE, | |
1738 | ||
1739 | rather than with U+030D COMBINING VERTICAL LINE ABOVE. (All Greek characters in the Greek | |
1740 | ||
1741 | block involving "tonos"; some Greek characters in the polytonic Greek in the | |
1742 | ||
1743 | 1FXX block.) </li> | |
1744 | ||
1745 | <li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0) </li> | |
1746 | ||
1747 | <li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These changes | |
1748 | ||
1749 | simplify normalization. </li> | |
1750 | ||
1751 | <li>Removed canonical decomposition for Latin Candrabindu. (U+0310) </li> | |
1752 | ||
1753 | <li>Corrected error in canonical decomposition for U+1FF4. </li> | |
1754 | ||
1755 | <li>Added compatibility decompositions to clarify collation tables. (U+2100, U+2101, U+2105, | |
1756 | ||
1757 | U+2106, U+1E9A) </li> | |
1758 | ||
1759 | <li>A series of general category changes to assist the convergence of of Unicode definition | |
1760 | ||
1761 | of identifier with ISO TR 10176: <ul> | |
1762 | ||
1763 | <li>So > Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B </li> | |
1764 | ||
1765 | <li>Po > Lo: U+0E2F, U+0EAF, U+3006 </li> | |
1766 | ||
1767 | <li>Lm > Sk: U+309B, U+309C </li> | |
1768 | ||
1769 | <li>Po > Pc: U+30FB, U+FF65 </li> | |
1770 | ||
1771 | <li>Ps/Pe > Mn: U+0F3E, U+0F3F </li> | |
1772 | ||
1773 | </ul> | |
1774 | ||
1775 | </li> | |
1776 | ||
1777 | <li>A series of bidi property changes for consistency. <ul> | |
1778 | ||
1779 | <li>L > ET: U+09F2, U+09F3 </li> | |
1780 | ||
1781 | <li>ON > L: U+3007 </li> | |
1782 | ||
1783 | <li>L > ON: U+0F3A..U+0F3D, U+037E, U+0387 </li> | |
1784 | ||
1785 | </ul> | |
1786 | ||
1787 | </li> | |
1788 | ||
1789 | <li>Add case mapping: U+01A6 <-> U+0280 </li> | |
1790 | ||
1791 | <li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039, U+203A. </li> | |
1792 | ||
1793 | <li>Changes to combining class values. Most Indic fixed position class non-spacing marks | |
1794 | ||
1795 | were changed to combining class 0. This fixes some inconsistencies in how canonical | |
1796 | ||
1797 | reordering would apply to Indic scripts, including Tibetan. Indic interacting top/bottom | |
1798 | ||
1799 | fixed position classes were merged into single (non-zero) classes as part of this change. | |
1800 | ||
1801 | Tibetan subjoined consonants are changed from combining class 6 to combining class 0. Thai | |
1802 | ||
1803 | pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari stress marks into generic | |
1804 | ||
1805 | above and below combining classes (U+0951, U+0952). </li> | |
1806 | ||
1807 | <li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E, etc., scattered | |
1808 | ||
1809 | positions to U+FA29) </li> | |
1810 | ||
1811 | </ul> | |
1812 | ||
1813 | ||
1814 | ||
1815 | <h3>Version 2.1.7</h3> | |
1816 | ||
1817 | ||
1818 | ||
1819 | <p><i>This version was for internal change tracking only, and never publicly released.</i></p> | |
1820 | ||
1821 | ||
1822 | ||
1823 | <h3>Version 2.1.6</h3> | |
1824 | ||
1825 | ||
1826 | ||
1827 | <p><i>This version was for internal change tracking only, and never publicly released.</i></p> | |
1828 | ||
1829 | ||
1830 | ||
1831 | <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode 2.1.5</a> </h3> | |
1832 | ||
1833 | ||
1834 | ||
1835 | <p>Modifications made for Version 2.1.5 of UnicodeData.txt include: | |
1836 | ||
1837 | ||
1838 | ||
1839 | <ul> | |
1840 | ||
1841 | <li>Changed decomposition for U+FF9E and U+FF9F so that correct collation weighting will | |
1842 | ||
1843 | automatically result from the canonical equivalences. </li> | |
1844 | ||
1845 | <li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, U+04E0, U+04E1, | |
1846 | ||
1847 | U+04E8, U+04E9 (the implication being that no canonical equivalence is claimed between | |
1848 | ||
1849 | these 8 characters and similar Latin letters), and updated 4 canonical decompositions for | |
1850 | ||
1851 | U+04DB, U+04DC, U+04EA, U+04EB to reflect the implied difference in the base character. </li> | |
1852 | ||
1853 | <li>Added Pi, and Pf categories and assigned the relevant quotation marks to those | |
1854 | ||
1855 | categories, based on the Unicode Technical Corrigendum on Quotation Characters. </li> | |
1856 | ||
1857 | <li>Updating of many bidi properties, following the advice of the ad hoc committee on bidi, | |
1858 | ||
1859 | and to make the bidi properties of compatibility characters more consistent. </li> | |
1860 | ||
1861 | <li>Changed category of several Tibetan characters: U+0F3E, U+0F3F, U+0F88..U+0F8B to make | |
1862 | ||
1863 | them non-combining, reflecting the combined opinion of Tibetan experts. </li> | |
1864 | ||
1865 | <li>Added case mapping for U+03F2. </li> | |
1866 | ||
1867 | <li>Corrected case mapping for U+0275. </li> | |
1868 | ||
1869 | <li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. U+03F2. </li> | |
1870 | ||
1871 | <li>Corrected compatibility label for U+2121. </li> | |
1872 | ||
1873 | <li>Add specific entries for all the CJK compatibility ideographs, U+F900..U+FA2D, so the | |
1874 | ||
1875 | canonical decomposition for each (the URO character it is equivalent to) can be carried in | |
1876 | ||
1877 | the database. </li> | |
1878 | ||
1879 | </ul> | |
1880 | ||
1881 | ||
1882 | ||
1883 | <h3>Version 2.1.4</h3> | |
1884 | ||
1885 | ||
1886 | ||
1887 | <p><i>This version was for internal change tracking only, and never publicly released.</i></p> | |
1888 | ||
1889 | ||
1890 | ||
1891 | <h3>Version 2.1.3</h3> | |
1892 | ||
1893 | ||
1894 | ||
1895 | <p><i>This version was for internal change tracking only, and never publicly released.</i></p> | |
1896 | ||
1897 | ||
1898 | ||
1899 | <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode 2.1.2</a> </h3> | |
1900 | ||
1901 | ||
1902 | ||
1903 | <p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode | |
1904 | ||
1905 | Standard, Version 2.1 (from Version 2.0) include: | |
1906 | ||
1907 | ||
1908 | ||
1909 | <ul> | |
1910 | ||
1911 | <li>Added two characters (U+20AC and U+FFFC). </li> | |
1912 | ||
1913 | <li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007. </li> | |
1914 | ||
1915 | <li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, U+03C2, U+1E9B. </li> | |
1916 | ||
1917 | <li>Changed combining order class for U+0F71. </li> | |
1918 | ||
1919 | <li>Corrected canonical decompositions for U+0F73, U+1FBE. </li> | |
1920 | ||
1921 | <li>Changed decomposition for U+FB1F from compatibility to canonical. </li> | |
1922 | ||
1923 | <li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB. </li> | |
1924 | ||
1925 | <li>Corrected compatibility decompositions for U+2469, U+246A, U+3358. </li> | |
1926 | ||
1927 | </ul> | |
1928 | ||
1929 | ||
1930 | ||
1931 | <h3>Version 2.1.1</h3> | |
1932 | ||
1933 | ||
1934 | ||
1935 | <p><i>This version was for internal change tracking only, and never publicly released.</i></p> | |
1936 | ||
1937 | ||
1938 | ||
1939 | <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode 2.0.0</a> </h3> | |
1940 | ||
1941 | ||
1942 | ||
1943 | <p>The modifications made in updating UnicodeData.txt for the Unicode | |
1944 | ||
1945 | Standard, Version 2.0 include: | |
1946 | ||
1947 | ||
1948 | ||
1949 | <ul> | |
1950 | ||
1951 | <li>Fixed decompositions with TONOS to use correct NSM: 030D. </li> | |
1952 | ||
1953 | <li>Removed old Hangul Syllables; mapping to new characters are in a separate table. </li> | |
1954 | ||
1955 | <li>Marked compatibility decompositions with additional tags. </li> | |
1956 | ||
1957 | <li>Changed old tag names for clarity. </li> | |
1958 | ||
1959 | <li>Revision of decompositions to use first-level decomposition, instead of maximal | |
1960 | ||
1961 | decomposition. </li> | |
1962 | ||
1963 | <li>Correction of all known errors in decompositions from earlier versions. </li> | |
1964 | ||
1965 | <li>Added control code names (as old Unicode names). </li> | |
1966 | ||
1967 | <li>Added Hangul Jamo decompositions. </li> | |
1968 | ||
1969 | <li>Added Number category to match properties list in book. </li> | |
1970 | ||
1971 | <li>Fixed categories of Koranic Arabic marks. </li> | |
1972 | ||
1973 | <li>Fixed categories of precomposed characters to match decomposition where possible. </li> | |
1974 | ||
1975 | <li>Added Hebrew cantillation marks and the Tibetan script. </li> | |
1976 | ||
1977 | <li>Added place holders for ranges such as CJK Ideographic Area and the Private Use Area. </li> | |
1978 | ||
1979 | <li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the | |
1980 | ||
1981 | database. </li> | |
1982 | ||
1983 | </ul> | |
1984 | ||
1985 | </body> | |
1986 | ||
1987 | </html> | |
1988 |