WTF is UTF-8?

٭ WTF is UTF-8?

Intro

Have you ever wondered how computers can display all the different languages and symbols used in the world? From the websites you browse to the apps you use, UTF-8 is the invisible backbone that powers multilingual communication in the digital realm. Most of my programming life I kinda ignored it even after seeing it literally everywhere.

A Code-Length Conundrum in Go

I have been delving a bit deep into the Golang world recently when I stumbled upon something interesting.

fmt.Println(len("日本語")) //What do you think is the length here?

Well I don't know about you but my initial guess was 3. The correct length? 6. SIX. Now , coming from a python/JS background I thought one character would mean adding 1 to the length.

To further understand this better , I read somewhere a rune in go is what a character is in other languages. Now what the heck is a rune? For that we might need to travel a bit back so that we really understand how important of an issue this is and we are so lucky to be living an era where these problems have already been solved by some super smart people.

Unicode Code Points: Abstract Representation

Think about how character sets are represented in memory. A computer at its core only understands bits. So, we need to encode the information we use in our languages into a form that can be represented using these bits. This allows the computer to process and interpret the information, enabling tasks like sending emails or displaying documents. It was easy earlier when we were only thinking about english letters , as ASCII is enough to cover them all. We can represent every character using numbers from from 32 to 127 and this could easily be stored in 7 bits. So , the computer use 8-bit bytes right? Great , so we even have 1 bit to spare. However, as computers became more widely used and communication spanned across languages and cultures, the limitations of ASCII became apparent. Different countries and languages had their own unique characters and symbols, and different encoding schemes were developed to represent them. This led to a situation where the same text could be displayed differently on different systems, depending on the underlying character set encoding. The absence of a universally accepted encoding standard made international communication challenging. If you sent an email to a friend in another country, the characters might not be displayed correctly on their system due to encoding differences.

UTF-8

This is why we needed to agree on a common representation , which is where unicode helped us. To represent characters , unicode uses something known as code points (hoping you haven't forgotten about runes). Unicode code points are abstract representations of characters. For example , A is represented like this: U+0041. It has representation for all possible characters you can imagine in any language. So great! We have found a way to represent all characters as part of a common representation. But wait, now this is not how the character is actually stored or represented in memory. Thats where encoding comes into the picture. Encoding schemes like UTF-8, translate these code points into concrete byte representations for memory storage. Those magic code points , in memory are stored using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes! It is a variable length encoding scheme and very efficient as it only uses extra bytes if they are needed , so don't worry it won't blow up your memory. Remember seeing <meta charset="UTF-8"> in your HTML? It was there so that the browser and other similar devices know how to interpret the text. So text/string goes hand in hand with the encoding scheme. If you don't tell me the encoding scheme , it does not make any sense for me. Luckily most of it is UTF-8 so we never had to care about it.

Runes in Go

Now coming back to how go treats strings which basically lead me to read more about all this stuff. A string in go is a collection (or slice in go) of bytes. From Rob Pike's blog on strings:

It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

So , a single value is essentially a code point. A code point is essentially a rune in go. From the blog:

“Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as “code point”, with one interesting addition. The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go.

Cracking the Length Code in Go

Now coming back to this:

fmt.Println(len("日本語"))

Hopefully now it makes sense as to why the length here was 6 and not 2. The given expression is represented by 6 bytes. Each rune takes up two bytes, so there are 3 runes. Ofcourse , we can get the number of runes too. We can use the function RuneCountInString provided by the unicode/utf-8 package. It takes in a string and returns the number of runes if no errors were present.

import (
    "fmt"
    "unicode/utf8"
)
fmt.Println(utf8.RuneCountInString("日本語")) //outputs 3

Hope you enjoyed the article!
These brilliant resources were the ones I used to brush up my understanding. I highly suggest checking them out :)

Joel Spolsky's blog on Unicode and Character sets
Rob Pike's blog on strings
The Go class by Matt Holiday