[perl5.git] / lib / unicore / README.perl

# Perl should compile and reasonably run any version of Unicode.  That doesn't
# mean that the test suite will run without showing errors.  A few of the
# very-Unicode specific test files have been modified to account for different
# versions, but most have not.  For example, some tests use characters that
# aren't encoded in all Unicode versions; others have hard-coded the General
# Categories that were correct at the time the test was written.  Perl itself
# will not compile under Unicode releases prior to 3.0 without a simple change to
# Unicode::Normalize.  mktables contains instructions for this, as well as other
# hints for using older Unicode versions.

# The *.txt files were copied from

# 	ftp://www.unicode.org/Public/UNIDATA

# (which always points to the latest version) with subdirectories 'extracted' and
# 'auxiliary'.  Older versions are located under Public with an appropriate name.

# The Unihan files were not included due to space considerations.  Also NOT
# included were any *.html files.  It is possible to add the Unihan files, and
# edit mktables (see instructions near its beginning) to look at them.

# The file named 'version' should exist and be a single line with the Unicode
# version, like:
# 5.2.0

# To be 8.3 filesystem friendly, the names of some of the input files have been
# changed from the values that are in the Unicode DB.  Not all of the Test
# files are currently used, so may not be present, so some of the mv's can
# fail.  The .html Test files are not touched.

mv PropertyValueAliases.txt PropValueAliases.txt
mv NamedSequencesProv.txt NamedSqProv.txt
mv NormalizationTest.txt NormTest.txt
mv DerivedAge.txt DAge.txt
mv DerivedCoreProperties.txt DCoreProperties.txt
mv DerivedNormalizationProps.txt DNormalizationProps.txt

# Some early releases don't have the extracted directory, and hence these files
# should be moved to it.
mkdir extracted 2>/dev/null
mv DerivedBidiClass.txt DerivedBinaryProperties.txt extracted 2>/dev/null
mv DerivedCombiningClass.txt DerivedDecompositionType.txt extracted 2>/dev/null
mv DerivedEastAsianWidth.txt DerivedGeneralCategory.txt extracted 2>/dev/null
mv DerivedJoiningGroup.txt DerivedJoiningType.txt extracted 2>/dev/null
mv DerivedLineBreak.txt DerivedNumericType.txt DerivedNumericValues.txt extracted 2>/dev/null

mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt
mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt
mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt
mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt
mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt
mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt
mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt
mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt
mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt
mv extracted/DerivedNumericType.txt extracted/DNumType.txt
mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt

mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt
mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt
mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt
mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt

# If you have the Unihan database (5.2 and above), you should also do the
# following:

mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt
mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt
mv Unihan_IRGSources.txt UnihanIRGSources.txt
mv Unihan_NumericValues.txt UnihanNumericValues.txt
mv Unihan_OtherMappings.txt UnihanOtherMappings.txt
mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt
mv Unihan_Readings.txt UnihanReadings.txt
mv Unihan_Variants.txt UnihanVariants.txt

# If you download everything, the names of files that are not used by mktables
# are not changed by the above, and hence may not work correctly as-is on 8.3
# filesystems.

# mktables is used to generate the tables used by the rest of Perl.  It will
# warn you about any *.txt files in the directory substructure that it doesn't
# know about.  You should remove any so-identified, or edit mktables to add
# them to its lists to process.  You can run
#
#    mktables -globlist
#
#to have it try to process these tables generically.
#
# FOR PUMPKINS
#
# The files are inter-related.  If you take the latest UnicodeData.txt, for
# example, but leave the older versions of other files, there can be subtle
# problems.  So get everything available from Unicode, and delete those which
# aren't needed.
#
# When moving to a new version of Unicode, you need to update 'version' by hand
#
#	p4 edit version
# 	...
#
# You should look in the Unicode release notes (which are probably towards the
# bottom of http://www.unicode.org/reports/tr44/) to see if any properties have
# newly been moved to be Obsolete, Deprecated, or Stabilized.  The full names
# for these should be added to the respective lists near the beginning of
# mktables, using an 'if' to add them for just this Unicode version going
# forward, so that mktables can continue to be used for earlier Unicode
# versions.
#
# When putting out a new Perl release, think about if any of the Deprecated
# properties should be moved to Suppressed.
#
# perlrecharclass.pod has a list of all the characters that are white space,
# which needs to be updated if there are changes.  A quick way to check if
# there have been changes would be to see if the number of such characters
# listed in perluniprops.pod (generated by running mktables) for the property
# \p{White_Space} is no longer 26.  Further investigation would then be
# necessary to classify the new characters as horizontal and vertical.
#
# The code in regexec.c for the \X match construct is intimately tied to the
# regular expression in UAX #29 (http://www.unicode.org/reports/tr29/).  You
# should see if it has changed, and if so regexec.c should be modified.  The
# current one is
# ( CRLF
# | Prepend* ( Hangul-syllable | !Control )
#   ( Grapheme_Extend | Spacing_Mark)*
# | . )
#
# mktables has many checks to warn you if there are unexpected or novel things
# that it doesn't know how to handle.
#
# Module::CoreList should be changed to include the new release
#
# Also, you should regen l1_char_class_tab.h, by
#
# perl regen/mk_L_charclass.pl
#
# and, regen charclass_invlists.h by
#
# perl regen/mk_invlists.pl
#
# Finally:
#
# 	p4 submit
#
# --
# jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com
Commit	Line	Data
76b64758 KW	1	# Perl should compile and reasonably run any version of Unicode. That doesn't
	2	# mean that the test suite will run without showing errors. A few of the
	3	# very-Unicode specific test files have been modified to account for different
	4	# versions, but most have not. For example, some tests use characters that
	5	# aren't encoded in all Unicode versions; others have hard-coded the General
	6	# Categories that were correct at the time the test was written. Perl itself
	7	# will not compile under Unicode releases prior to 3.0 without a simple change to
	8	# Unicode::Normalize. mktables contains instructions for this, as well as other
	9	# hints for using older Unicode versions.
232ed87f	10
76b64758	11	# The *.txt files were copied from
8836d2a5	12
76b64758	13	# ftp://www.unicode.org/Public/UNIDATA
b6922eda	14
76b64758 KW	15	# (which always points to the latest version) with subdirectories 'extracted' and
76b64758 KW	16	# 'auxiliary'. Older versions are located under Public with an appropriate name.
61131c94	17
038a3166 KW	18	# The Unihan files were not included due to space considerations. Also NOT
	19	# included were any *.html files. It is possible to add the Unihan files, and
	20	# edit mktables (see instructions near its beginning) to look at them.
99870f4d	21
038a3166 KW	22	# The file named 'version' should exist and be a single line with the Unicode
	23	# version, like:
	24	# 5.2.0
61131c94	25
038a3166 KW	26	# To be 8.3 filesystem friendly, the names of some of the input files have been
	27	# changed from the values that are in the Unicode DB. Not all of the Test
	28	# files are currently used, so may not be present, so some of the mv's can
	29	# fail. The .html Test files are not touched.
61131c94 KW	30
	31	mv PropertyValueAliases.txt PropValueAliases.txt
	32	mv NamedSequencesProv.txt NamedSqProv.txt
38a91a4e	33	mv NormalizationTest.txt NormTest.txt
61131c94 KW	34	mv DerivedAge.txt DAge.txt
	35	mv DerivedCoreProperties.txt DCoreProperties.txt
	36	mv DerivedNormalizationProps.txt DNormalizationProps.txt
232ed87f KW	37
	38	# Some early releases don't have the extracted directory, and hence these files
	39	# should be moved to it.
76b64758 KW	40	mkdir extracted 2>/dev/null
	41	mv DerivedBidiClass.txt DerivedBinaryProperties.txt extracted 2>/dev/null
	42	mv DerivedCombiningClass.txt DerivedDecompositionType.txt extracted 2>/dev/null
	43	mv DerivedEastAsianWidth.txt DerivedGeneralCategory.txt extracted 2>/dev/null
	44	mv DerivedJoiningGroup.txt DerivedJoiningType.txt extracted 2>/dev/null
	45	mv DerivedLineBreak.txt DerivedNumericType.txt DerivedNumericValues.txt extracted 2>/dev/null
232ed87f	46
61131c94 KW	47	mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt
	48	mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt
	49	mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt
	50	mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt
	51	mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt
	52	mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt
	53	mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt
	54	mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt
	55	mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt
	56	mv extracted/DerivedNumericType.txt extracted/DNumType.txt
	57	mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt
	58
37e2e78e KW	59	mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt
	60	mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt
	61	mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt
	62	mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt
	63
038a3166 KW	64	# If you have the Unihan database (5.2 and above), you should also do the
038a3166 KW	65	# following:
61131c94	66
99870f4d KW	67	mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt
	68	mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt
	69	mv Unihan_IRGSources.txt UnihanIRGSources.txt
	70	mv Unihan_NumericValues.txt UnihanNumericValues.txt
	71	mv Unihan_OtherMappings.txt UnihanOtherMappings.txt
	72	mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt
	73	mv Unihan_Readings.txt UnihanReadings.txt
	74	mv Unihan_Variants.txt UnihanVariants.txt
	75
038a3166 KW	76	# If you download everything, the names of files that are not used by mktables
	77	# are not changed by the above, and hence may not work correctly as-is on 8.3
	78	# filesystems.
	79
	80	# mktables is used to generate the tables used by the rest of Perl. It will
	81	# warn you about any *.txt files in the directory substructure that it doesn't
	82	# know about. You should remove any so-identified, or edit mktables to add
	83	# them to its lists to process. You can run
	84	#
	85	# mktables -globlist
	86	#
	87	#to have it try to process these tables generically.
	88	#
	89	# FOR PUMPKINS
	90	#
	91	# The files are inter-related. If you take the latest UnicodeData.txt, for
	92	# example, but leave the older versions of other files, there can be subtle
	93	# problems. So get everything available from Unicode, and delete those which
	94	# aren't needed.
	95	#
	96	# When moving to a new version of Unicode, you need to update 'version' by hand
	97	#
	98	# p4 edit version
	99	# ...
	100	#
	101	# You should look in the Unicode release notes (which are probably towards the
	102	# bottom of http://www.unicode.org/reports/tr44/) to see if any properties have
	103	# newly been moved to be Obsolete, Deprecated, or Stabilized. The full names
	104	# for these should be added to the respective lists near the beginning of
	105	# mktables, using an 'if' to add them for just this Unicode version going
	106	# forward, so that mktables can continue to be used for earlier Unicode
	107	# versions.
	108	#
	109	# When putting out a new Perl release, think about if any of the Deprecated
	110	# properties should be moved to Suppressed.
	111	#
	112	# perlrecharclass.pod has a list of all the characters that are white space,
	113	# which needs to be updated if there are changes. A quick way to check if
	114	# there have been changes would be to see if the number of such characters
	115	# listed in perluniprops.pod (generated by running mktables) for the property
	116	# \p{White_Space} is no longer 26. Further investigation would then be
	117	# necessary to classify the new characters as horizontal and vertical.
	118	#
	119	# The code in regexec.c for the \X match construct is intimately tied to the
	120	# regular expression in UAX #29 (http://www.unicode.org/reports/tr29/). You
	121	# should see if it has changed, and if so regexec.c should be modified. The
	122	# current one is
	123	# ( CRLF
	124	# \| Prepend* ( Hangul-syllable \| !Control )
	125	# ( Grapheme_Extend \| Spacing_Mark)*
	126	# \| . )
	127	#
	128	# mktables has many checks to warn you if there are unexpected or novel things
	129	# that it doesn't know how to handle.
	130	#
	131	# Module::CoreList should be changed to include the new release
	132	#
	133	# Also, you should regen l1_char_class_tab.h, by
	134	#
	135	# perl regen/mk_L_charclass.pl
	136	#
	137	# and, regen charclass_invlists.h by
	138	#
	139	# perl regen/mk_invlists.pl
140	#
141	# Finally:
142	#
143	# p4 submit
144	#
145	# --
146	# jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com