[perl5.git] / ext / Encode / Todo

Use Markus Kuhn's UTF-8 Decode Stress Tester at

	http://www.cl.cam.ac.uk/~mgk25/ucs/examples/

Markus:
> 
> What exactly is malformed UTF-8 data here?
> 
> Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.
> 
> Does it also cover overlong UTF-8 sequences, i.e. any string
> containing any of the five bit sequences
> 
>   1100000x,
>   11100000 100xxxxx,
>   11110000 1000xxxx,
>   11111000 10000xxx,
>   11111100 100000xx
> 
> Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
> surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
> not occur in proper UTF-8 and UTF-32 data according to the standard
> (see note 3 in section R.4 of UCS)?
> 
> It might be useful, if the spec were clearer here.
> 
> References:
> 
>   - ISO/IEC 10646-1:1993(E), Amd. 2,            
>     http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> 
>   - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
> 

Markus:
> 
> It is commonly considered to be good practice to reject at least
> overlong UTF-8 sequences, otherwise one permits multiple encodings for
> characters, which makes pattern matching far more difficult in
> applications where strings are processed in both coded and decoded form.
> It has been argued that this could easily lead to security
> vulnerabilities. See
>                   
>   http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
>   http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt   (section 4)
>   ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt          (section 6)
>           
> for a brief discussion.
>
Commit	Line	Data
2c674647 JH	1	Use Markus Kuhn's UTF-8 Decode Stress Tester at
	2
	3	http://www.cl.cam.ac.uk/~mgk25/ucs/examples/
	4
	5	Markus:
	6	>
	7	> What exactly is malformed UTF-8 data here?
	8	>
	9	> Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.
	10	>
	11	> Does it also cover overlong UTF-8 sequences, i.e. any string
	12	> containing any of the five bit sequences
	13	>
	14	> 1100000x,
	15	> 11100000 100xxxxx,
	16	> 11110000 1000xxxx,
	17	> 11111000 10000xxx,
	18	> 11111100 100000xx
	19	>
	20	> Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
	21	> surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
	22	> not occur in proper UTF-8 and UTF-32 data according to the standard
	23	> (see note 3 in section R.4 of UCS)?
	24	>
	25	> It might be useful, if the spec were clearer here.
	26	>
	27	> References:
	28	>
	29	> - ISO/IEC 10646-1:1993(E), Amd. 2,
	30	> http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
	31	>
	32	> - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
	33	>
	34
	35	Markus:
	36	>
	37	> It is commonly considered to be good practice to reject at least
	38	> overlong UTF-8 sequences, otherwise one permits multiple encodings for
	39	> characters, which makes pattern matching far more difficult in
	40	> applications where strings are processed in both coded and decoded form.
	41	> It has been argued that this could easily lead to security
	42	> vulnerabilities. See
	43	>
	44	> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
	45	> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (section 4)
	46	> ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt (section 6)
	47	>
	48	> for a brief discussion.
	49	>