This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
mktables: Shorten test file name
[perl5.git] / lib / unicore / README.perl
1 The *.txt files were copied from
2
3         ftp://www.unicode.org/Public/UNIDATA
4
5 with subdirectories 'extracted' and 'auxiliary'
6
7 The Unihan files were not included due to space considerations.  Also NOT
8 included were any *.html files.  It is possible to add the Unihan files, and
9 edit mktables (see instructions near its beginning) to look at them.
10
11 The file 'version' should exist and be a single line with the Unicode version,
12 like:
13 5.2.0
14
15 To be 8.3 filesystem friendly, the names of some of the input files have been
16 changed from the values that are in the Unicode DB.  Not all of the Test files
17 are currently used, so may not be present, so some of the mv's can fail.  The
18 .html Test files are not touched.
19
20 mv PropertyValueAliases.txt PropValueAliases.txt
21 mv NamedSequencesProv.txt NamedSqProv.txt
22 mv NormalizationTest.txt NormTest.txt
23 mv DerivedAge.txt DAge.txt
24 mv DerivedCoreProperties.txt DCoreProperties.txt
25 mv DerivedNormalizationProps.txt DNormalizationProps.txt
26 mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt
27 mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt
28 mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt
29 mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt
30 mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt
31 mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt
32 mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt
33 mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt
34 mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt
35 mv extracted/DerivedNumericType.txt extracted/DNumType.txt
36 mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt
37
38 mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt
39 mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt
40 mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt
41 mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt
42
43 If you have the Unihan database (5.2 and above), you should also do the
44 following:
45
46 mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt
47 mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt
48 mv Unihan_IRGSources.txt UnihanIRGSources.txt
49 mv Unihan_NumericValues.txt UnihanNumericValues.txt
50 mv Unihan_OtherMappings.txt UnihanOtherMappings.txt
51 mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt
52 mv Unihan_Readings.txt UnihanReadings.txt
53 mv Unihan_Variants.txt UnihanVariants.txt
54
55 If you download everything, the names of files that are not used by mktables
56 are not changed by the above, and will not work correctly as-is on 8.3
57 filesystems.
58
59 mktables is used to generate the tables used by the rest of Perl.  It will warn
60 you about any *.txt files in the directory substructure that it doesn't know
61 about.  You should remove any so-identified, or edit mktables to add them to
62 its lists to process.  You can run
63
64     mktables -globlist
65
66 to have it try to process these tables generically.
67
68 FOR PUMPKINS
69
70 The files are inter-related.  If you take the latest UnicodeData.txt, for
71 example, but leave the older versions of other files, there can be subtle
72 problems.  So get everything available from Unicode, and delete those which
73 aren't needed.
74
75 When moving to a new version of Unicode, you need to update 'version' by hand
76
77         p4 edit version
78         ...
79
80 You should look in the Unicode release notes (which are probably towards the
81 bottom of http://www.unicode.org/reports/tr44/) to see if any properties have
82 newly been moved to be Obsolete, Deprecated, or Stabilized.  The full names for
83 these should be added to the respective lists near the beginning of mktables,
84 using an 'if' to add them for just this Unicode version going forward, so that
85 mktables can continue to be used for earlier Unicode versions. 
86
87 When putting out a new Perl release, think about if any of the Deprecated
88 properties should be moved to Suppressed.
89
90 perlrecharclass.pod has a list of all the characters that are white space,
91 which needs to be updated if there are changes.  A quick way to check if there
92 have been changes would be to see if the number of such characters listed in
93 perluniprops.pod (generated by running mktables) for the property
94 \p{White_Space} is no longer 26.  Further investigation would then be necessary
95 to classify the new characters as horizontal and vertical.
96
97 The code in regexec.c for the \X match construct is intimately tied to the
98 regular expression in UAX #29 (http://www.unicode.org/reports/tr29/).  You
99 should see if it has changed, and if so regexec.c should be modified.  The
100 current one is
101 ( CRLF
102 | Prepend* ( Hangul-syllable | !Control )
103   ( Grapheme_Extend | Spacing_Mark)*
104 | . )
105
106 mktables has many checks to warn you if there are unexpected or novel things
107 that it doesn't know how to handle.
108
109 perl.pod should be changed so that it gives the new name (which includes the
110 Unicode release number) for perluniprops.pod
111
112 Module::CoreList should be changed to include the new release
113
114 Also, you should regen l1_char_class_tab.h, by
115
116 perl regen/mk_L_charclass.pl
117
118 and, regen charclass_invlists.h by
119
120 perl regen/mk_invlists.pl
121
122 Finally:
123
124         p4 submit
125
126 -- 
127 jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com