This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Re: useless use of void context work-around
[perl5.git] / ext / Encode / Todo
CommitLineData
2c674647
JH
1Use Markus Kuhn's UTF-8 Decode Stress Tester at
2
3 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/
4
5Markus:
6>
7> What exactly is malformed UTF-8 data here?
8>
9> Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.
10>
11> Does it also cover overlong UTF-8 sequences, i.e. any string
12> containing any of the five bit sequences
13>
14> 1100000x,
15> 11100000 100xxxxx,
16> 11110000 1000xxxx,
17> 11111000 10000xxx,
18> 11111100 100000xx
19>
20> Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
21> surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
22> not occur in proper UTF-8 and UTF-32 data according to the standard
23> (see note 3 in section R.4 of UCS)?
24>
25> It might be useful, if the spec were clearer here.
26>
27> References:
28>
29> - ISO/IEC 10646-1:1993(E), Amd. 2,
30> http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
31>
32> - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
33>
34
35Markus:
36>
37> It is commonly considered to be good practice to reject at least
38> overlong UTF-8 sequences, otherwise one permits multiple encodings for
39> characters, which makes pattern matching far more difficult in
40> applications where strings are processed in both coded and decoded form.
41> It has been argued that this could easily lead to security
42> vulnerabilities. See
43>
44> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
45> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (section 4)
46> ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt (section 6)
47>
48> for a brief discussion.
49>