Transcode the input text from a source encoding to a destination encoding.
The input is a string tensor of any shape. The output is a string tensor of the same shape containing the transcoded strings. Output strings are always valid unicode. If the input contains invalid encoding positions, the
errors attribute sets the policy for how to deal with them. If the default error-handling policy is used, invalid formatting will be substituted in the output by the
replacement_char. If the errors policy is to
ignore, any invalid encoding positions in the input are skipped and not included in the output. If it set to
strict then any invalid formatting will result in an InvalidArgument error.
This operation can be used with
output_encoding = input_encoding to enforce correct formatting for inputs even if they are already in the desired encoding.
If the input is prefixed by a Byte Order Mark needed to determine encoding (e.g. if the encoding is UTF-16 and the BOM indicates big-endian), then that BOM will be consumed and not emitted into the output. If the input encoding is marked with an explicit endianness (e.g. UTF-16-BE), then the BOM is interpreted as a non-breaking-space and is preserved in the output (including always for UTF-8).
The end result is that if the input is marked as an explicit endianness the transcoding is faithful to all codepoints in the source. If it is not marked with an explicit endianness, the BOM is not considered part of the string itself but as metadata, and so is not preserved in the output.
"UTF-16", "US ASCII", "UTF-8".
"UTF-8", "UTF-16-BE", "UTF-32-BE". Multi-byte encodings will be big-endian.
Optional attributes (see
replacement_charcodepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.
errors='replace'. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.)
Note that for UTF-8, passing a replacement character expressible in 1 byte, such as ' ', will preserve string alignment to the source since invalid bytes will be replaced with a 1-byte replacement. For UTF-16-BE and UTF-16-LE, any 1 or 2 byte replacement character will preserve byte alignment to the source.
replacement_char. Default is false.
Output: A string tensor containing unicode text encoded using
|Constructors and Destructors|
| || |
|Public static functions|
Optional attribute setters for UnicodeTranscode.
UnicodeTranscode( const ::tensorflow::Scope & scope, ::tensorflow::Input input, StringPiece input_encoding, StringPiece output_encoding )
UnicodeTranscode( const ::tensorflow::Scope & scope, ::tensorflow::Input input, StringPiece input_encoding, StringPiece output_encoding, const UnicodeTranscode::Attrs & attrs )
::tensorflow::Node * node() const
Attrs Errors( StringPiece x )
Attrs ReplaceControlCharacters( bool x )
Attrs ReplacementChar( int64 x )
© 2020 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.