So I came across the following code in stack overflow for calculating the length of an utf-8 string in C a few days ago:
int my_strlen_utf8_c(char *s) {
int i = 0, j = 0;
while (s[i]) {
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}
Here is a short explanation I wrote for myself (and decided to share) about this function. UTF-8 is a variable sized encoding, which means the length of each character is encoded in the character itself. For 1 byte ASCII characters the high order bit must be 0 and the following 7 bits describes the character. The bit formats for utf-8 strings are the following:
1 byte: 0XXX XXXX -> 07 bits of data
2 bytes: 110X XXXX 10XX XXXX -> 11 bits of data
3 bytes: 1110 XXXX 10XX XXXX 10XX XXXX -> 16 bits of data
4 bytes: 1111 0XXX 10XX XXXX 10XX XXXX 10XX XXXX -> 21 bits of data
The conditional used in the function for identifying the byte as a character is: byte & 0xc0 != 0x80. Let's take the
character 'Γ‘' which is encoded as:
character | hex | binary
Γ‘ | C3,A1 | 1100 0011, 1010 0001
Let's apply the first byte (0xC3) in the conditional described above:
0xC3 & 0xC0 != 0x80
1100 0011 & 1100 0000 != 1000 0000
1100 0000 != 1000 0000
TRUE
The function identifies the byte as a new character. Let's apply 0xA1:
0xA1 & 0xC0 != 0x80
1010 0001 & 1100 0000 != 1000 0000
1000 0000 != 1000 0000
FALSE
It doesn't identify as a new character, which is correct. C3 and A1 together (although two bytes) makes one character, and the counter is only incremented once, giving the correct length of 1.
No responses yet.