Welcome to Tony's Blog: UTF4.0 defines U+FFFF Java String

Monday, September 25, 2006

UTF4.0 defines U+FFFF Java String

Counting char Units

private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
System.out.printf("char count: %d\n", charCount);

The length method counts the number of char values in a String object. The sample code prints this:

char count: 7

Counting Character Units

When Unicode version 4.0 defined a significant number of new characters above U+FFFF, the 16-bit char type could no longer represent all characters. Starting with the Java 2 Platform, Standard Edition 5.0 (J2SE 5.0), the Java platform began to support the new Unicode characters as pairs of 16-bit char values called a surrogate pair.

This special use of 16-bit units is called UTF-16, and the Java Platform uses UTF-16 to represent Unicode characters. The char type is now a UTF-16 code unit, not necessarily a complete Unicode character (code point).

private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %d\n", characterCount);

This example prints this:

character count: 6 The Japanese character has Unicode code point U+5B66, which has the same hexadecimal char value \u5B66. The Gothic letter's code point is U+10330. In UTF-16, the Gothic letter is the surrogate pair \uD800\uDF30.

Counting Bytes

byte[] utf8 = null;
int byteCount = 0;
try {
utf8 = str.getBytes("UTF-8");
byteCount = utf8.length;
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
System.out.printf("UTF-8 Byte Count: %d\n", byteCount);

The target character set determines how many bytes are generated. The UTF-8 encoding transforms a single Unicode code point into one to four 8-bit code units (a byte). The characters a, b, c, and d require a total of only four bytes. The Japanese character turns into three bytes. The Gothic letter takes four bytes. The total result is shown here:

UTF-8 Byte Count: 11

Tony

Welcome to Tony's Blog

Monday, September 25, 2006

UTF4.0 defines U+FFFF Java String

No comments:

Be A Developer That Uses AI