Commit | Line | Data |
---|---|---|
1911be83 | 1 | The *.txt files were copied from |
8836d2a5 | 2 | |
99870f4d | 3 | ftp://www.unicode.org/Public/UNIDATA |
b6922eda | 4 | |
99870f4d | 5 | with subdirectories 'extracted' and 'auxiliary' |
61131c94 | 6 | |
99870f4d | 7 | The Unihan files were not included due to space considerations. Also NOT |
37e2e78e KW |
8 | included were any *.html files. It is possible to add the Unihan files, and |
9 | edit mktables (see instructions near its beginning) to look at them. | |
99870f4d KW |
10 | |
11 | The file 'version' should exist and be a single line with the Unicode version, | |
12 | like: | |
13 | 5.2.0 | |
61131c94 KW |
14 | |
15 | To be 8.3 filesystem friendly, the names of some of the input files have been | |
37e2e78e KW |
16 | changed from the values that are in the Unicode DB. Not all of the Test files |
17 | are currently used, so may not be present, so some of the mv's can fail. The | |
18 | .html Test files are not touched. | |
61131c94 KW |
19 | |
20 | mv PropertyValueAliases.txt PropValueAliases.txt | |
21 | mv NamedSequencesProv.txt NamedSqProv.txt | |
22 | mv DerivedAge.txt DAge.txt | |
23 | mv DerivedCoreProperties.txt DCoreProperties.txt | |
24 | mv DerivedNormalizationProps.txt DNormalizationProps.txt | |
25 | mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt | |
26 | mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt | |
27 | mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt | |
28 | mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt | |
29 | mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt | |
30 | mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt | |
31 | mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt | |
32 | mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt | |
33 | mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt | |
34 | mv extracted/DerivedNumericType.txt extracted/DNumType.txt | |
35 | mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt | |
36 | ||
37e2e78e KW |
37 | mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt |
38 | mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt | |
39 | mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt | |
40 | mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt | |
41 | ||
99870f4d KW |
42 | If you have the Unihan database (5.2 and above), you should also do the |
43 | following: | |
61131c94 | 44 | |
99870f4d KW |
45 | mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt |
46 | mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt | |
47 | mv Unihan_IRGSources.txt UnihanIRGSources.txt | |
48 | mv Unihan_NumericValues.txt UnihanNumericValues.txt | |
49 | mv Unihan_OtherMappings.txt UnihanOtherMappings.txt | |
50 | mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt | |
51 | mv Unihan_Readings.txt UnihanReadings.txt | |
52 | mv Unihan_Variants.txt UnihanVariants.txt | |
53 | ||
37e2e78e KW |
54 | If you download everything, the names of files that are not used by mktables |
55 | are not changed by the above, and will not work correctly as-is on 8.3 | |
56 | filesystems. | |
99870f4d KW |
57 | |
58 | mktables is used to generate the tables used by the rest of Perl. It will warn | |
59 | you about any *.txt files in the directory substructure that it doesn't know | |
60 | about. You should remove any so-identified, or edit mktables to add them to | |
61 | its lists to process. You can run | |
62 | ||
63 | mktables -globlist | |
64 | ||
65 | to have it try to process these tables generically. | |
66 | ||
0fa75b59 JH |
67 | FOR PUMPKINS |
68 | ||
99870f4d KW |
69 | The files are inter-related. If you take the latest UnicodeData.txt, for |
70 | example, but leave the older versions of other files, there can be subtle | |
37e2e78e KW |
71 | problems. So get everything available from Unicode, and delete those which |
72 | aren't needed. | |
99870f4d KW |
73 | |
74 | When moving to a new version of Unicode, you need to update 'version' by hand | |
75 | ||
76 | p4 edit version | |
77 | ... | |
78 | ||
79 | You should look in the Unicode release notes (which are probably towards the | |
80 | bottom of http://www.unicode.org/reports/tr44/) to see if any properties have | |
81 | newly been moved to be Obsolete, Deprecated, or Stabilized. The full names for | |
82 | these should be added to the respective lists near the beginning of mktables, | |
83 | using an 'if' to add them for just this Unicode version going forward, so that | |
84 | mktables can continue to be used for earlier Unicode versions. | |
85 | ||
86 | When putting out a new Perl release, think about if any of the Deprecated | |
87 | properties should be moved to Suppressed. | |
b6922eda | 88 | |
272d2fcc KW |
89 | perlrecharclass.pod has a list of all the characters that are white space, |
90 | which needs to be updated if there are changes. A quick way to check if there | |
91 | have been changes would be to see if the number of such characters listed in | |
92 | perluniprops.pod (generated by running mktables) for the property | |
93 | \p{White_Space} is no longer 26. Further investigation would then be necessary | |
94 | to classify the new characters as horizontal and vertical. | |
95 | ||
37e2e78e KW |
96 | The code in regexec.c for the \X match construct is intimately tied to the |
97 | regular expression in UAX #29 (http://www.unicode.org/reports/tr29/). You | |
98 | should see if it has changed, and if so regexec.c should be modified. The | |
99 | current one is | |
100 | ( CRLF | |
101 | | Prepend* ( Hangul-syllable | !Control ) | |
102 | ( Grapheme_Extend | Spacing_Mark)* | |
103 | | . ) | |
104 | ||
105 | mktables has many checks to warn you if there are unexpected or novel things | |
106 | that it doesn't know how to handle. | |
0fa75b59 | 107 | |
37e2e78e | 108 | Finally: |
0fa75b59 JH |
109 | |
110 | p4 submit | |
8836d2a5 JH |
111 | |
112 | -- | |
99870f4d | 113 | jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com |