Counting char Units
private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
System.out.printf("char count: %d\n", charCount);
The length method counts the number of char values in a String object. The sample code prints this:
char count: 7
Counting Character Units
When Unicode version 4.0 defined a significant number of new characters above U+FFFF, the 16-bit char type could no longer represent all characters. Starting with the Java 2 Platform, Standard Edition 5.0 (J2SE 5.0), the Java platform began to support the new Unicode characters as pairs of 16-bit char values called a surrogate pair.
This special use of 16-bit units is called UTF-16, and the Java Platform uses UTF-16 to represent Unicode characters. The char type is now a UTF-16 code unit, not necessarily a complete Unicode character (code point).
private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %d\n", characterCount);
This example prints this:
character count: 6 The Japanese character has Unicode code point U+5B66, which has the same hexadecimal char value \u5B66. The Gothic letter's code point is U+10330. In UTF-16, the Gothic letter is the surrogate pair \uD800\uDF30.
Counting Bytes
byte[] utf8 = null;
int byteCount = 0;
try {
utf8 = str.getBytes("UTF-8");
byteCount = utf8.length;
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
System.out.printf("UTF-8 Byte Count: %d\n", byteCount);
The target character set determines how many bytes are generated. The UTF-8 encoding transforms a single Unicode code point into one to four 8-bit code units (a byte). The characters a, b, c, and d require a total of only four bytes. The Japanese character turns into three bytes. The Gothic letter takes four bytes. The total result is shown here:
UTF-8 Byte Count: 11
Tony
No comments:
Post a Comment