2010-02-01 20:28:08 +05:30
|
|
|
Unicode support in busybox
|
|
|
|
|
|
|
|
There are several scenarios where we need to handle unicode
|
|
|
|
correctly.
|
|
|
|
|
|
|
|
Shell input
|
|
|
|
|
|
|
|
We want to correctly handle input of unicode characters.
|
|
|
|
There are several problems with it. Just handling input
|
|
|
|
as sequence of bytes would break any editing. This was fixed
|
|
|
|
and now lineedit operates on the array of wchar_t's.
|
|
|
|
But we also need to handle the following problematic moments:
|
|
|
|
|
|
|
|
* It is unreasonable to expect that output device supports
|
|
|
|
_any_ unicode chars. Perhaps we need to avoid printing
|
|
|
|
those chars which are not supported by output device.
|
|
|
|
Examples: chars which are not present in the font,
|
|
|
|
chars which are not assigned in unicode,
|
|
|
|
combining chars (especially trying to combine bad pairs:
|
|
|
|
a_chinese_symbol + "combining grave accent" = ??!)
|
|
|
|
|
|
|
|
* We need to account for the fact that unicode chars have
|
|
|
|
different widths: 0 for combining chars, 1 for usual,
|
|
|
|
2 for ideograms (are there 3+ wide chars?).
|
|
|
|
|
|
|
|
* Bidirectional handling. If user wants to echo a phrase
|
|
|
|
in Hebrew, he types: echo "srettel werbeH"
|
|
|
|
|
2010-02-02 03:05:30 +05:30
|
|
|
Editors (vi, ed)
|
2010-02-01 20:28:08 +05:30
|
|
|
|
|
|
|
This case is a bit similar to "shell input", but unlike shell,
|
|
|
|
editors may encounder many more unexpected unicode sequences
|
2010-02-02 03:05:30 +05:30
|
|
|
(try to load a random binary file...), and they need to preserve
|
2010-02-01 20:28:08 +05:30
|
|
|
them, unlike shell which can afford to drop bogus input.
|
|
|
|
|
|
|
|
more, less
|
|
|
|
|
2010-02-02 03:05:30 +05:30
|
|
|
Need to correctly display any input file. Ideally, with
|
|
|
|
ASCII/unicode/filtered_unicode option or keyboard switch.
|
|
|
|
Note: need to handle tabs and backspaces specially
|
|
|
|
(bksp is for manpage compat).
|
|
|
|
|
|
|
|
cut, fold, watch
|
|
|
|
|
|
|
|
May need ability to cut unicode string to specified number of wchars
|
|
|
|
and/or to specified screen width. Need to handle tabs specially.
|
|
|
|
|
|
|
|
sed, awk, grep
|
|
|
|
|
|
|
|
Handle unicode-aware regexp match
|
2010-02-01 20:28:08 +05:30
|
|
|
|
|
|
|
ls (multi-column display)
|
|
|
|
|
2010-02-02 03:05:30 +05:30
|
|
|
ls will fail to line up columnar output if it will not account
|
|
|
|
for character widths (and maybe filter out some of them, see
|
|
|
|
above). OTOH, non-columnar views (ls -1, ls -l, ls | car)
|
|
|
|
should NOT filter out bad unicode (but need to filter out
|
|
|
|
control chars (coreutils does that). Note that unlike more/less,
|
|
|
|
tabs and backspaces need not special handling.
|
2010-02-01 20:28:08 +05:30
|
|
|
|
|
|
|
top, ps
|
|
|
|
|
2010-02-02 03:05:30 +05:30
|
|
|
Need to perform filtering similar to ls.
|
2010-02-01 20:28:08 +05:30
|
|
|
|
|
|
|
Filename display (in error messages and elsewhere)
|
|
|
|
|
2010-02-02 03:05:30 +05:30
|
|
|
Need to perform filtering similar to ls.
|
2010-02-01 20:28:08 +05:30
|
|
|
|
|
|
|
|
|
|
|
TODO: write an email to Asmus Freytag (asmus@unicode.org),
|
|
|
|
author of http://unicode.org/reports/tr11/
|