- NAME
- encoding — Work with encodings
- SYNOPSIS
- INTRODUCTION
- DESCRIPTION
- encoding convertfrom ?encoding? data
- encoding convertfrom ?-profile profile? ?-failindex var? encoding data
- encoding convertto ?encoding? data
- encoding convertto ?-profile profile? ?-failindex var? encoding data
- encoding dirs ?directoryList?
- encoding names
- encoding profiles
- encoding system ?encoding?
- PROFILES
- strict
- tcl8
- replace
- EXAMPLES
- SEE ALSO
- KEYWORDS
encoding — Work with encodings
encoding operation ?arg arg ...?
In Tcl every string is composed of Unicode values. Text may be encoded into an
encoding such as cp1252, iso8859-1, Shitf-JIS, utf-8, utf-16, etc. Not every
Unicode vealue is encodable in every encoding, and some encodings can encode
values that are not available in Unicode.
Even though Unicode is for encoding the written texts of human languages, any
sequence of bytes can be encoded as the first 255 Unicode values. iso8859-1 an
encoding for a subset of Unicode in which each byte is a Unicode value of 255
or less. Thus, any sequence of bytes can be considered to be a Unicode string
encoded in iso8859-1. To work with binary data in Tcl, decode it from
iso8859-1 when reading it in, and encode it into iso8859-1 when writing it out,
ensuring that each character in the string has a value of 255 or less.
Decoding such a string does nothing, and encoding encoding such a string also
does nothing.
For example, the following is true:
set text {In Tcl binary data is treated as Unicode text and it just works.}
set encoded [encoding convertto iso8859-1 $text]
expr {$text eq $encoded}; #-> 1
The following is also true:
set decoded [encoding convertfrom iso8859-1 $text]
expr {$text eq $decoded}; #-> 1
Performs one of the following encoding operations:
- encoding convertfrom ?encoding? data
-
- encoding convertfrom ?-profile profile? ?-failindex var? encoding data
-
Decodes data encoded in encoding. If encoding is not
specified the current system encoding is used.
-profile determines how invalid data for the encoding are handled. See
the PROFILES section below for details. Returns an error if decoding
fails. However, if -failindex given, returns the result of the
conversion up to the point of termination, and stores in var the index of
the character that could not be converted. If no errors are encountered the
entire result of the conversion is returned and the value -1 is stored in
var.
- encoding convertto ?encoding? data
-
- encoding convertto ?-profile profile? ?-failindex var? encoding data
-
Converts string to encoding. If encoding is not given, the
current system encoding is used.
See encoding convertfrom for the meaning of -profile and -failindex.
- encoding dirs ?directoryList?
-
Sets the search path for *.enc encoding data files to the list of
directories given by directoryList. If directoryList is not given,
returns the current list of directories that make up the search path. It is
not an error for an item in directoryList to not refer to a readable,
searchable directory.
- encoding names
-
Returns a list of the names of available encodings.
The encodings
“utf-8”
and
“iso8859-1”
are guaranteed to be present in the list.
- encoding profiles
-
Returns a list of names of available encoding profiles. See PROFILES
below.
- encoding system ?encoding?
-
Sets the system encoding to encoding. If encoding is not given,
returns the current system encoding. The system encoding is used to pass
strings to system calls.
Each profile is a distinct strategy for dealing with invalid data for an
encoding.
The following profiles are currently implemented.
- strict
-
The default profile. The operation fails when invalid data for the encoding
are encountered.
- tcl8
-
Provides for behaviour identical to that of Tcl 8.6: When
decoding, for encodings other than utf-8, each invalid byte is interpreted
as the Unicode value given by that one byte. For example, the byte 0x80, which
is invalid in the ASCII encoding would be mapped to the Unicode value U+0080.
For utf-8, each invalid byte that is a valid CP1252 character is
interpreted as the Unicode value for that character, while each byte that is
not is treated as the Unicode value given by that one byte. For example, byte
0x80 is defined by CP1252 and is therefore mapped to its Unicode equivalent
U+20AC while byte 0x81 which is not defined by CP1252 is mapped to U+0081. As
an additional special case, the sequence 0xC0 0x80 is mapped to U+0000.
When encoding, each character that cannot be represented in the encoding is
replaced by an encoding-dependent character, usually the question mark ?.
- replace
-
When decoding, invalid bytes are replaced by U+FFFD, the Unicode REPLACEMENT
CHARACTER.
When encoding, Unicode values that cannot be represented in the target encoding
are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT
CHARACTER for UTF targets, and generally `?` for other encodings.
These examples use the utility proc below that prints the Unicode value for
each character in a string.
proc codepoints s {join [lmap c [split $s {}] {
string cat U+ [format %.6X [scan $c %c]]}]
}
Example 1: Convert from euc-jp:
% codepoints [encoding convertfrom euc-jp \xA4\xCF]
U+00306F
The result is the Unicode value
“\u306F”,
which is the Hiragana letter HA.
Example 2: Error handling based on profiles:
The letter A is Unicode character U+0041 and the byte "\x80" is invalid
in ASCII encoding.
% codepoints [encoding convertfrom -profile tcl8 ascii A\x80]
U+000041 U+000080
% codepoints [encoding convertfrom -profile replace ascii A\x80]
U+000041 U+00FFFD
% codepoints [encoding convertfrom -profile strict ascii A\x80]
unexpected byte sequence starting at index 1: '\x80'
Example 3: Get partial data and the error location:
% codepoints [encoding convertfrom -failindex idx ascii AB\x80]
U+000041 U+000042
% set idx
2
Example 4: Encode a character that is not representable in ISO8859-1:
% encoding convertto iso8859-1 A\u0141
A?
% encoding convertto -profile strict iso8859-1 A\u0141
unexpected character at index 1: 'U+000141'
% encoding convertto -failindex idx iso8859-1 A\u0141
A
% set idx
1
Tcl_GetEncoding, fconfigure
encoding, unicode
Copyright © 1998 Scriptics Corporation.
Copyright © 2023 Nathan Coulter