| View source on GitHub |
Decodes each string into a sequence of code points with start offsets.
tf.strings.unicode_decode_with_offsets(
input, input_encoding, errors='replace', replacement_char=65533,
replace_control_characters=False, name=None
)
This op is similar to tf.strings.decode(...), but it also returns the start offset for each character in its respective string. This information can be used to align the characters with the original byte sequence.
Returns a tuple (codepoints, start_offsets) where:
codepoints[i1...iN, j] is the Unicode codepoint for the jth character in input[i1...iN], when decoded using input_encoding.start_offsets[i1...iN, j] is the start byte offset for the jth character in input[i1...iN], when decoded using input_encoding.| Args | |
|---|---|
input | An N dimensional potentially ragged string tensor with shape [D1...DN]. N must be statically known. |
input_encoding | String name for the unicode encoding that should be used to decode each string. |
errors | Specifies the response when an input string can't be converted using the indicated encoding. One of:
|
replacement_char | The replacement codepoint to be used in place of invalid substrings in input when errors='replace'; and in place of C0 control characters in input when replace_control_characters=True. |
replace_control_characters | Whether to replace the C0 control characters (U+0000 - U+001F) with the replacement_char. |
name | A name for the operation (optional). |
| Returns | |
|---|---|
A tuple of N+1 dimensional tensors (codepoints, start_offsets).
The returned tensors are |
>>> input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')]
>>> result = tf.strings.unicode_decode_with_offsets(input, 'UTF-8')
>>> result[0].tolist() # codepoints
[[71, 246, 246, 100, 110, 105, 103, 104, 116], [128522]]
>>> result[1].tolist() # offsets
[[0, 1, 3, 5, 6, 7, 8, 9, 10], [0]]
© 2020 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/strings/unicode_decode_with_offsets