busybox/docs/unicode.txt

	Unicode support in busybox

There are several scenarios where we need to handle unicode
correctly.

	Shell input

We want to correctly handle input of unicode characters.
There are several problems with it. Just handling input
as sequence of bytes would break any editing. This was fixed
and now lineedit operates on the array of wchar_t's.
But we also need to handle the following problematic moments:

* It is unreasonable to expect that output device supports
  _any_ unicode chars. Perhaps we need to avoid printing
  those chars which are not supported by output device.
  Examples: chars which are not present in the font,
  chars which are not assigned in unicode,
  combining chars (especially trying to combine bad pairs:
  a_chinese_symbol + "combining grave accent" = ??!)

* We need to account for the fact that unicode chars have
  different widths: 0 for combining chars, 1 for usual,
  2 for ideograms (are there 3+ wide chars?).

* Bidirectional handling. If user wants to echo a phrase
  in Hebrew, he types: echo "srettel werbeH"

	Editors (vi, ed)

This case is a bit similar to "shell input", but unlike shell,
editors may encounder many more unexpected unicode sequences
(try to load a random binary file...), and they need to preserve
them, unlike shell which can afford to drop bogus input.

	more, less

Need to correctly display any input file. Ideally, with
ASCII/unicode/filtered_unicode option or keyboard switch.
Note: need to handle tabs and backspaces specially
(bksp is for manpage compat).

	cut, fold, watch

May need ability to cut unicode string to specified number of wchars
and/or to specified screen width. Need to handle tabs specially.

	sed, awk, grep

Handle unicode-aware regexp match

	ls (multi-column display)

ls will fail to line up columnar output if it will not account
for character widths (and maybe filter out some of them, see
above). OTOH, non-columnar views (ls -1, ls -l, ls | car)
should NOT filter out bad unicode (but need to filter out
control chars (coreutils does that). Note that unlike more/less,
tabs and backspaces need not special handling.

	top, ps

Need to perform filtering similar to ls.

	Filename display (in error messages and elsewhere)

Need to perform filtering similar to ls.


TODO: write an email to Asmus Freytag (asmus@unicode.org),
author of http://unicode.org/reports/tr11/
add unicode.txt Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-01 20:28:08 +05:30			`Unicode support in busybox`

			`There are several scenarios where we need to handle unicode`
			`correctly.`

			`Shell input`

			`We want to correctly handle input of unicode characters.`
			`There are several problems with it. Just handling input`
			`as sequence of bytes would break any editing. This was fixed`
			`and now lineedit operates on the array of wchar_t's.`
			`But we also need to handle the following problematic moments:`

			`* It is unreasonable to expect that output device supports`
			`_any_ unicode chars. Perhaps we need to avoid printing`
			`those chars which are not supported by output device.`
			`Examples: chars which are not present in the font,`
			`chars which are not assigned in unicode,`
			`combining chars (especially trying to combine bad pairs:`
			`a_chinese_symbol + "combining grave accent" = ??!)`

			`* We need to account for the fact that unicode chars have`
			`different widths: 0 for combining chars, 1 for usual,`
			`2 for ideograms (are there 3+ wide chars?).`

			`* Bidirectional handling. If user wants to echo a phrase`
			`in Hebrew, he types: echo "srettel werbeH"`

docs/unicode.txt: added more TODOs Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-02 03:05:30 +05:30			`Editors (vi, ed)`
add unicode.txt Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-01 20:28:08 +05:30
			`This case is a bit similar to "shell input", but unlike shell,`
			`editors may encounder many more unexpected unicode sequences`
docs/unicode.txt: added more TODOs Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-02 03:05:30 +05:30			`(try to load a random binary file...), and they need to preserve`
add unicode.txt Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-01 20:28:08 +05:30			`them, unlike shell which can afford to drop bogus input.`

			`more, less`

docs/unicode.txt: added more TODOs Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-02 03:05:30 +05:30			`Need to correctly display any input file. Ideally, with`
			`ASCII/unicode/filtered_unicode option or keyboard switch.`
			`Note: need to handle tabs and backspaces specially`
			`(bksp is for manpage compat).`

			`cut, fold, watch`

			`May need ability to cut unicode string to specified number of wchars`
			`and/or to specified screen width. Need to handle tabs specially.`

			`sed, awk, grep`

			`Handle unicode-aware regexp match`
add unicode.txt Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-01 20:28:08 +05:30
			`ls (multi-column display)`

docs/unicode.txt: added more TODOs Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-02 03:05:30 +05:30			`ls will fail to line up columnar output if it will not account`
			`for character widths (and maybe filter out some of them, see`
			`above). OTOH, non-columnar views (ls -1, ls -l, ls \| car)`
			`should NOT filter out bad unicode (but need to filter out`
			`control chars (coreutils does that). Note that unlike more/less,`
			`tabs and backspaces need not special handling.`
add unicode.txt Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-01 20:28:08 +05:30
			`top, ps`

docs/unicode.txt: added more TODOs Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-02 03:05:30 +05:30			`Need to perform filtering similar to ls.`
add unicode.txt Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-01 20:28:08 +05:30
			`Filename display (in error messages and elsewhere)`

docs/unicode.txt: added more TODOs Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-02 03:05:30 +05:30			`Need to perform filtering similar to ls.`
add unicode.txt Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> 2010-02-01 20:28:08 +05:30

			`TODO: write an email to Asmus Freytag (asmus@unicode.org),`
			`author of http://unicode.org/reports/tr11/`