UTF-8 string length

So I came across the following code in stack overflow for calculating the length of an utf-8 string in C a few days ago:

int my_strlen_utf8_c(char *s) {
   int i = 0, j = 0;
   while (s[i]) {
     if ((s[i] & 0xc0) != 0x80) j++;
     i++;
   }
   return j;
}

Here is a short explanation I wrote for myself (and decided to share) about this function. UTF-8 is a variable sized encoding, which means the length of each character is encoded in the character itself. For 1 byte ASCII characters the high order bit must be 0 and the following 7 bits describes the character. The bit formats for utf-8 strings are the following:

1 byte:  0XXX XXXX                                -> 07 bits of data
2 bytes: 110X XXXX 10XX XXXX                      -> 11 bits of data
3 bytes: 1110 XXXX 10XX XXXX 10XX XXXX            -> 16 bits of data
4 bytes: 1111 0XXX 10XX XXXX 10XX XXXX 10XX XXXX  -> 21 bits of data

The conditional used in the function for identifying the byte as a character is: byte & 0xc0 != 0x80. Let's take the character 'á' which is encoded as:

character | hex     | binary
á         | C3,A1   | 1100 0011, 1010 0001

Let's apply the first byte (0xC3) in the conditional described above:

0xC3 & 0xC0 != 0x80
1100 0011 & 1100 0000 != 1000 0000
1100 0000 != 1000 0000
TRUE

The function identifies the byte as a new character. Let's apply 0xA1:

0xA1 & 0xC0 != 0x80
1010 0001 & 1100 0000 != 1000 0000
1000 0000 != 1000 0000
FALSE

It doesn't identify as a new character, which is correct. C3 and A1 together (although two bytes) makes one character, and the counter is only incremented once, giving the correct length of 1.

Thread

UTF-8 string length

Responses

Recent in Forum

Search Hashnode

UTF-8 string length

Responses

Recent in Forum