Commit | Line | Data |
---|---|---|
2c674647 JH |
1 | Use Markus Kuhn's UTF-8 Decode Stress Tester at |
2 | ||
3 | http://www.cl.cam.ac.uk/~mgk25/ucs/examples/ | |
4 | ||
5 | Markus: | |
6 | > | |
7 | > What exactly is malformed UTF-8 data here? | |
8 | > | |
9 | > Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2. | |
10 | > | |
11 | > Does it also cover overlong UTF-8 sequences, i.e. any string | |
12 | > containing any of the five bit sequences | |
13 | > | |
14 | > 1100000x, | |
15 | > 11100000 100xxxxx, | |
16 | > 11110000 1000xxxx, | |
17 | > 11111000 10000xxx, | |
18 | > 11111100 100000xx | |
19 | > | |
20 | > Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16 | |
21 | > surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must | |
22 | > not occur in proper UTF-8 and UTF-32 data according to the standard | |
23 | > (see note 3 in section R.4 of UCS)? | |
24 | > | |
25 | > It might be useful, if the spec were clearer here. | |
26 | > | |
27 | > References: | |
28 | > | |
29 | > - ISO/IEC 10646-1:1993(E), Amd. 2, | |
30 | > http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html | |
31 | > | |
32 | > - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 | |
33 | > | |
34 | ||
35 | Markus: | |
36 | > | |
37 | > It is commonly considered to be good practice to reject at least | |
38 | > overlong UTF-8 sequences, otherwise one permits multiple encodings for | |
39 | > characters, which makes pattern matching far more difficult in | |
40 | > applications where strings are processed in both coded and decoded form. | |
41 | > It has been argued that this could easily lead to security | |
42 | > vulnerabilities. See | |
43 | > | |
44 | > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 | |
45 | > http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (section 4) | |
46 | > ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt (section 6) | |
47 | > | |
48 | > for a brief discussion. | |
49 | > |