ASCII
American Standard Code for Information Interchange.
In 1963:
* 95 printable characters
capital letters, lowercase letters, digits and some simple symbols
* 33 control characters.
escape, del, or backspace ...
The limitation is that it is too small and only support English.
Nowadays:
* 8 bits (extended) -> 256
* The extended ASCII version is invented for Latin letters, encoding additional symbols such as mathematical notation, graphical elements, and common accented characters.
Unicode
Background:
Because there are thousands of characters in Chinese or other non-English languages that cannot be encoded in 8 bits. Scrambled text!
As a result, each country intended multiple-byte encoding systems that were all mutually incompatible.
Then a universal encoding scheme was invented to control them all.
It is the super set of standard ASCII.
UTF-32
A character is stored in 4 bytes (32 bits). As a result, it is a fixed width encoding technique.
The biggest difficulty with using UTF-32 is that if a file (1KB) is written in ASCII (English only), it will require 4KB memory space for UTF-32 without adding further information.
The disadvantages are as follows:
1. The data transformation time
2. The huge waste of memory
UTF-16
UTF-16, unlike UTF-32, is not a fixed width encoding system.
It can represent two bytes (16 bits) or four bytes (32 bits).
The advantage is that if the file is written in ASCII (English only), it will only take up two times as much memory.
UTF-8
In contrast to UTF-32 and UTF-16, UTF-8 can only use 8 bits when the file is written in ASCII.
It is not a fixed width technique, with a range of 1 byte to 6 bytes.
How can we figure out how many bytes we need to represent?
1. Leading byte (using a count of 1 before 0 to get the total number of bytes encoded)
2. Continuation byte (beginning with 10XXXXXX)
One byte (7 bits: 0 - 127)
0XXXXXXX
7 bits can represent standard ASCII.
Two Bytes (11 bits: 128 - 2047):
110XXXXX 10XXXXXX
It is 2 because the number of 1 preceding 0 indicates that there are a total of two bytes.
Three Bytes (16 bits: 2048 - 65535 ):
1110XXXX 10XXXXXX 10XXXXXX
It is 3 because the number of 1 preceding 0 indicates that there are a total of three bytes.
Four Bytes (21 bits):
11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
It is 4 because the number of 1 preceding 0 indicates that there are a total of four bytes.
It is already sufficient for all characters in all languages.
Five Bytes: (26 bits)
It is 5 because the number of 1 preceding 0 indicates that there are a total of five bytes.
Six Bytes: (31 bits)
Five Bytes: (26 bits)
111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX
It is 5 because the number of 1 preceding 0 indicates that there are a total of five bytes.
Six Bytes: (31 bits)
1111110X 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX
It is 6 because the number of 1 preceding 0 indicates that there are a total of six bytes.
Why we need continuation byte?
We may simply drop the continuation byte if it is the first data we get because we may handle the streaming data.
Why utf-8 pattern cannot used for Seven Bytes?
11111110 -> It is the code point defined by Unicode already.
Bytes, Rune, and Strings
A string is simply a collection of bytes.
"hey"
equals to
[]byte{104, 101, 121}
As a result, string and []byte are interchangeable.
[]byte("hey")
string([]byte{104, 101, 121})
Also, string literal are automatically encoded in UTF-8.
Rune:
What is rune?
var myInt int
myInt = 'A'
fmt.Printf("%c -> %[1]d\n", myInt)
var myInt8 int8
myInt8 = 'A'
fmt.Printf("%c -> %[1]d\n", myInt8)
myRune := 'A'
fmt.Printf("%c -> %[1]d\n", myRune)
A -> 65
A -> 65
A -> 65
'A' is a character or a Rune Literal (typeless numbers).
Whatever you call it, it can be assigned to a variable of any numeric type (if there are enough bytes to describe it).
A "Unicode Code Point" is also the name given to the printing number 65.
We can use fmt verb %c to represent a character and %x to represent a hexadecimal value.
Because the default type of Rune Literal is Rune Type, the type of 'myRune' is Rune Type in this example.
To learn more, consider the following example:
Literal Dec Hex UTF-8 encoded string
*****************************************************************
¡ 161 a1 c2 a1
* How to calculate the UTF-8 encoded string of '¡'?
Since UTF-8 uses two bytes to display 161, then we can use the formula we listed above to build a placeholder 110xxxxx 10xxxxxx first. Also the binary of 161 is 0b0010100001, and we can use it to fill the placeholder.
UTF-8
=> 110xxxxx 10xxxxxx
=> 11000010 10100001
=> c2 a1
Then the utf-8 encoded string of 161 is c2 a1.
Ex:
=> c2 a1
Then the utf-8 encoded string of 161 is c2 a1.
Rune Type
Ex:
fmt.Printf("%-10s %-10s %-10s %-30s\n%s\n", "Literal", "Dec",
"Hex", "utf-8 encoded string", strings.Repeat("*", 70))
t1 := 'ý'
fmt.Printf("%-10c %-10[1]d %-10[1]x % x\n", t1, string(t1))
t2 := 253
fmt.Printf("%-10c %-10[1]d %-10[1]x % -30x\n", t2, string(rune(t2)))
Result:
1. String values are read-only byte slices.
2. len(string) will count the number of bytes, not rune.
Literal Dec Hex utf-8 encoded string
******************************************************************
ý 253 fd c3 bd
ý 253 fd c3 bd
More on string
1. String values are read-only byte slices.
name := "╳╳╳"
name[0] = 'A'
fmt.Println(name)
cannot assign to name[0]
2. len(string) will count the number of bytes, not rune.
name := "╳╳╳"
fmt.Println(len(name))
9
3. We can convert string to []rune and use len() to get the number of rune like below.
name := "╳╳╳"
runeSlice := []rune(name)
fmt.Println(len(runeSlice))
3
4. However, conversion is costly. Instead, we can use utf8.RuneCountInString() to count the number of rune.
name := "╳╳╳"
fmt.Println(utf8.RuneCountInString(name))
3
Why using string instead of []rune?
Based on past experiments, we may conclude that []rune is preferable to string.
The main difference is that a string value is usually encoded in utf-8, which is more efficient because each rune needs 1 to 4 bytes.
However, because the rune type is an alias for int32, each rune in []rune has the same length: 4 bytes (inefficient).
Decode utf-8 encoded string
Using for range loop:
name := "╳╳╳"
fmt.Printf("%c = % -12[1]d = % -12[1]x\n", '╳')
for i, r := range name {
fmt.Printf("name[%d] = % -12x = %-6q != % x\n",
i, string(r), r, name[i])
}
╳ => 9587 => 0x 2573
name[0] = e2 95 b3 = '╳' != e2
name[3] = e2 95 b3 = '╳' != e2
name[6] = e2 95 b3 = '╳' != e2
We can see that index i does not have the predicted value of 0, 1, 2.
Furthermore, using an index to traverse a string such as name[i] is error-prone because each rune requires separate bytes in utf-8 encoding. (need 3 bytes)
The reason for this is that "for range loop" jumps over strng runes rather than bytes.
Each index returns the rune's starting index.
Why string values are read-only?
Strings are simply read-only bytes slices. (Refer to this doc)
There is no need for a capacity because they are read-only (they cannot be grown).
s1 := "Hello"
s2 := "Hello"
fmt.Println((*reflect.StringHeader)(unsafe.Pointer(&s1)).Data)
fmt.Println((*reflect.StringHeader)(unsafe.Pointer(&s2)).Data)
4985534
4985534
According to the result of the previous example, both variables with the same string have the same backing array.
No comments:
Post a Comment