This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Exporter on CPAN is 5.68 make blead reflect this
[perl5.git] / lib / unicore / README.perl
CommitLineData
76b64758
KW
1# Perl should compile and reasonably run any version of Unicode. That doesn't
2# mean that the test suite will run without showing errors. A few of the
3# very-Unicode specific test files have been modified to account for different
4# versions, but most have not. For example, some tests use characters that
5# aren't encoded in all Unicode versions; others have hard-coded the General
6# Categories that were correct at the time the test was written. Perl itself
7# will not compile under Unicode releases prior to 3.0 without a simple change to
8# Unicode::Normalize. mktables contains instructions for this, as well as other
9# hints for using older Unicode versions.
232ed87f 10
76b64758 11# The *.txt files were copied from
8836d2a5 12
76b64758 13# ftp://www.unicode.org/Public/UNIDATA
b6922eda 14
76b64758
KW
15# (which always points to the latest version) with subdirectories 'extracted' and
16# 'auxiliary'. Older versions are located under Public with an appropriate name.
61131c94 17
038a3166
KW
18# The Unihan files were not included due to space considerations. Also NOT
19# included were any *.html files. It is possible to add the Unihan files, and
20# edit mktables (see instructions near its beginning) to look at them.
99870f4d 21
038a3166
KW
22# The file named 'version' should exist and be a single line with the Unicode
23# version, like:
24# 5.2.0
61131c94 25
038a3166
KW
26# To be 8.3 filesystem friendly, the names of some of the input files have been
27# changed from the values that are in the Unicode DB. Not all of the Test
28# files are currently used, so may not be present, so some of the mv's can
29# fail. The .html Test files are not touched.
61131c94
KW
30
31mv PropertyValueAliases.txt PropValueAliases.txt
32mv NamedSequencesProv.txt NamedSqProv.txt
38a91a4e 33mv NormalizationTest.txt NormTest.txt
61131c94
KW
34mv DerivedAge.txt DAge.txt
35mv DerivedCoreProperties.txt DCoreProperties.txt
36mv DerivedNormalizationProps.txt DNormalizationProps.txt
232ed87f
KW
37
38# Some early releases don't have the extracted directory, and hence these files
39# should be moved to it.
76b64758
KW
40mkdir extracted 2>/dev/null
41mv DerivedBidiClass.txt DerivedBinaryProperties.txt extracted 2>/dev/null
42mv DerivedCombiningClass.txt DerivedDecompositionType.txt extracted 2>/dev/null
43mv DerivedEastAsianWidth.txt DerivedGeneralCategory.txt extracted 2>/dev/null
44mv DerivedJoiningGroup.txt DerivedJoiningType.txt extracted 2>/dev/null
45mv DerivedLineBreak.txt DerivedNumericType.txt DerivedNumericValues.txt extracted 2>/dev/null
232ed87f 46
61131c94
KW
47mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt
48mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt
49mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt
50mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt
51mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt
52mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt
53mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt
54mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt
55mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt
56mv extracted/DerivedNumericType.txt extracted/DNumType.txt
57mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt
58
37e2e78e
KW
59mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt
60mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt
61mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt
62mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt
63
038a3166
KW
64# If you have the Unihan database (5.2 and above), you should also do the
65# following:
61131c94 66
99870f4d
KW
67mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt
68mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt
69mv Unihan_IRGSources.txt UnihanIRGSources.txt
70mv Unihan_NumericValues.txt UnihanNumericValues.txt
71mv Unihan_OtherMappings.txt UnihanOtherMappings.txt
72mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt
73mv Unihan_Readings.txt UnihanReadings.txt
74mv Unihan_Variants.txt UnihanVariants.txt
75
038a3166
KW
76# If you download everything, the names of files that are not used by mktables
77# are not changed by the above, and hence may not work correctly as-is on 8.3
78# filesystems.
79
80# mktables is used to generate the tables used by the rest of Perl. It will
81# warn you about any *.txt files in the directory substructure that it doesn't
82# know about. You should remove any so-identified, or edit mktables to add
83# them to its lists to process. You can run
84#
85# mktables -globlist
86#
87#to have it try to process these tables generically.
88#
89# FOR PUMPKINS
90#
91# The files are inter-related. If you take the latest UnicodeData.txt, for
92# example, but leave the older versions of other files, there can be subtle
93# problems. So get everything available from Unicode, and delete those which
94# aren't needed.
95#
96# When moving to a new version of Unicode, you need to update 'version' by hand
97#
98# p4 edit version
99# ...
100#
101# You should look in the Unicode release notes (which are probably towards the
102# bottom of http://www.unicode.org/reports/tr44/) to see if any properties have
103# newly been moved to be Obsolete, Deprecated, or Stabilized. The full names
104# for these should be added to the respective lists near the beginning of
105# mktables, using an 'if' to add them for just this Unicode version going
106# forward, so that mktables can continue to be used for earlier Unicode
107# versions.
108#
109# When putting out a new Perl release, think about if any of the Deprecated
110# properties should be moved to Suppressed.
111#
112# perlrecharclass.pod has a list of all the characters that are white space,
113# which needs to be updated if there are changes. A quick way to check if
114# there have been changes would be to see if the number of such characters
115# listed in perluniprops.pod (generated by running mktables) for the property
116# \p{White_Space} is no longer 26. Further investigation would then be
117# necessary to classify the new characters as horizontal and vertical.
118#
119# The code in regexec.c for the \X match construct is intimately tied to the
120# regular expression in UAX #29 (http://www.unicode.org/reports/tr29/). You
121# should see if it has changed, and if so regexec.c should be modified. The
122# current one is
123# ( CRLF
124# | Prepend* ( Hangul-syllable | !Control )
125# ( Grapheme_Extend | Spacing_Mark)*
126# | . )
127#
128# mktables has many checks to warn you if there are unexpected or novel things
129# that it doesn't know how to handle.
130#
131# Module::CoreList should be changed to include the new release
132#
133# Also, you should regen l1_char_class_tab.h, by
134#
135# perl regen/mk_L_charclass.pl
136#
137# and, regen charclass_invlists.h by
138#
139# perl regen/mk_invlists.pl
140#
141# Finally:
142#
143# p4 submit
144#
145# --
146# jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com