public class UTF8
extends java.lang.Object
Decoding of UTF-8 is based on a presentation by Bob Steagall at CppCon2018 (see https://github.com/BobSteagall/CppCon2018). It uses a Deterministic Finite Automaton (DFA) to recognize and decode multi-byte code points.
Constructor and Description |
---|
UTF8() |
Modifier and Type | Method and Description |
---|---|
static int |
transcodeToUTF16(byte[] utf8,
char[] utf16)
Transcode a UTF-8 encoding into a UTF-16 representation.
|
static int |
transcodeToUTF16(byte[] utf8,
int utf8Off,
int utf8Length,
char[] utf16)
Transcode a UTF-8 encoding into a UTF-16 representation.
|
public static int transcodeToUTF16(byte[] utf8, char[] utf16)
utf16
array should be at least as long as the input utf8
one to handle
arbitrary inputs. The number of output UTF-16 code units is returned, or -1 if any errors are
encountered (in which case an arbitrary amount of data may have been written into the output
array). Errors that will be detected are malformed UTF-8, including incomplete, truncated or
"overlong" encodings, and unmappable code points. In particular, no unmatched surrogates will
be produced. An error will also result if utf16
is found to be too small to store the
complete output.utf8
- A non-null array containing a well-formed UTF-8 encoding.utf16
- A non-null array, at least as long as the utf8
array in order to ensure
the output will fit.utf16
(beginning from index 0), or
else -1 if the input was either malformed or encoded any unmappable characters, or if
the utf16
is too small.public static int transcodeToUTF16(byte[] utf8, int utf8Off, int utf8Length, char[] utf16)
utf16
array should be at least as long as the input length from utf8
to handle
arbitrary inputs. The number of output UTF-16 code units is returned, or -1 if any errors are
encountered (in which case an arbitrary amount of data may have been written into the output
array). Errors that will be detected are malformed UTF-8, including incomplete, truncated or
"overlong" encodings, and unmappable code points. In particular, no unmatched surrogates will
be produced. An error will also result if utf16
is found to be too small to store the
complete output.utf8
- A non-null array containing a well-formed UTF-8 encoding.utf8Off
- start position in the array for the well-formed encoding.utf8Length
- length in bytes of the well-formed encoding.utf16
- A non-null array, at least as long as the utf8
array in order to ensure
the output will fit.utf16
(beginning from index 0), or
else -1 if the input was either malformed or encoded any unmappable characters, or if
the utf16
is too small.