Monday, April 26, 2021

Golang - Rune and String

ASCII



American Standard Code for Information Interchange.


In 1963:


    * 7 bits (standard) -> 128

    * 95 printable characters

        capital letters, lowercase letters, digits and some simple symbols

    * 33 control characters.

        escape, del, or backspace ...

    The limitation is that it is too small and only support English.

Nowadays:


    * 8 bits (extended) -> 256

    * The extended ASCII version is invented for Latin letters, encoding additional symbols such as mathematical notation, graphical elements, and common accented characters.


Unicode


Background:



Because there are thousands of characters in Chinese or other non-English languages that cannot be encoded in 8 bits. Scrambled text!

As a result, each country intended multiple-byte encoding systems that were all mutually incompatible.

Then a universal encoding scheme was invented to control them all.

It is the super set of standard ASCII.  


UTF-32



A character is stored in 4 bytes (32 bits). As a result, it is a fixed width encoding technique.

The biggest difficulty with using UTF-32 is that if a file (1KB) is written in ASCII (English only), it will require 4KB memory space for UTF-32 without adding further information.

The disadvantages are as follows:

    1. The data transformation time

    2. The huge waste of memory
 


UTF-16



UTF-16, unlike UTF-32, is not a fixed width encoding system.

It can represent two bytes (16 bits) or four bytes (32 bits).

The advantage is that if the file is written in ASCII (English only), it will only take up two times as much memory.


UTF-8



In contrast to UTF-32 and UTF-16, UTF-8 can only use 8 bits when the file is written in ASCII.

It is not a fixed width technique, with a range of 1 byte to 6 bytes.

How can we figure out how many bytes we need to represent?

    1. Leading byte (using a count of 1 before 0 to get the total number of bytes encoded)

    2. Continuation byte (beginning with 10XXXXXX)  


One byte (7 bits: 0 - 127)


  0XXXXXXX


7 bits can represent standard ASCII. 


Two Bytes (11 bits: 128 - 2047):


110XXXXX 10XXXXXX


It is 2 because the number of 1 preceding 0 indicates that there are a total of two bytes.



Three Bytes (16 bits: 2048 - 65535 ):


  1110XXXX 10XXXXXX 10XXXXXX

 
It is 3 because the number of 1 preceding 0 indicates that there are a total of three bytes.


Four Bytes (21 bits):


  11110XXX 10XXXXXX 10XXXXXX 10XXXXXX


It is 4 because the number of 1 preceding 0 indicates that there are a total of four bytes.

It is already sufficient for all characters in all languages. 


Five Bytes: (26 bits)

            
111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX


It is 5 because the number of 1 preceding 0 indicates that there are a total of five bytes.


Six Bytes: (31 bits)


  1111110X 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX

 
It is 6 because the number of 1 preceding 0 indicates that there are a total of six bytes.



Why we need continuation byte?

We may simply drop the continuation byte if it is the first data we get because we may handle the streaming data.
 

Why utf-8 pattern cannot used for Seven Bytes?

11111110 -> It is the code point defined by Unicode already.


Bytes, Rune, and Strings



A string is simply a collection of bytes. 



  "hey"

    equals to

[]byte{104101121}

 

As a result, string and []byte are interchangeable.


  []byte("hey")

or

  string([]byte{104101121})


Also, string literal are automatically encoded in UTF-8.


Rune:



What is rune?


  var myInt int
  myInt = 'A'
  fmt.Printf("%c -> %[1]d\n", myInt)

  var myInt8 int8
  myInt8 = 'A'
  fmt.Printf("%c -> %[1]d\n", myInt8)

  myRune := 'A'
  fmt.Printf("%c -> %[1]d\n", myRune)



  A -> 65
  A -> 65
  A -> 65


'A' is a character or a Rune Literal (typeless numbers).

Whatever you call it, it can be assigned to a variable of any numeric type (if there are enough bytes to describe it).

A "Unicode Code Point" is also the name given to the printing number 65.

We can use fmt verb %c to represent a character and %x to represent a hexadecimal value.
Because the default type of Rune Literal is Rune Type, the type of 'myRune' is Rune Type in this example.

To learn more, consider the following example:

        
Literal    Dec        Hex        UTF-8 encoded string
  *****************************************************************
  ¡          161        a1         c2 a1



* How to calculate the UTF-8 encoded string of '¡'?

Since UTF-8 uses two bytes to display 161, then we can use the formula we listed above to build a placeholder 110xxxxx 10xxxxxx first. Also the binary of 161 is 0b0010100001, and we can use it to fill the placeholder.

    UTF-8 
             => 110xxxxx 10xxxxxx
             => 11000010 10100001
             => c2 a1

Then the utf-8 encoded string of 161 is c2 a1.


Rune Type



Ex:

  fmt.Printf("%-10s %-10s %-10s %-30s\n%s\n""Literal""Dec"
"Hex""utf-8 encoded string", strings.Repeat("*"70))

  t1 := 'ý'
  fmt.Printf("%-10c %-10[1]d %-10[1]x % x\n", t1, string(t1))

  t2 := 253
  fmt.Printf("%-10c %-10[1]d %-10[1]x % -30x\n", t2, string(rune(t2)))

Result:
            
Literal    Dec        Hex        utf-8 encoded string
  ******************************************************************
  ý          253        fd         c3 bd
  ý          253        fd         c3 bd



More on string



1. String values are read-only byte slices.


  name := "╳╳╳"

  name[0] = 'A'
  fmt.Println(name)


            
cannot assign to name[0]



2. len(string) will count the number of bytes, not rune.


  name := "╳╳╳"
  fmt.Println(len(name))


            
9



3. We can convert string to []rune and use len() to get the number of rune like below.


  name := "╳╳╳"
  runeSlice := []rune(name)
  fmt.Println(len(runeSlice))


            
3


4. However, conversion is costly. Instead, we can use utf8.RuneCountInString() to count the number of rune.


  name := "╳╳╳"
  fmt.Println(utf8.RuneCountInString(name))


            
3



Why using string instead of []rune?



Based on past experiments, we may conclude that []rune is preferable to string.

The main difference is that a string value is usually encoded in utf-8, which is more efficient because each rune needs 1 to 4 bytes.

However, because the rune type is an alias for int32, each rune in []rune has the same length: 4 bytes (inefficient).


Decode utf-8 encoded string



Using for range loop:

    
name := "╳╳╳"
  fmt.Printf("%c  =  % -12[1]d = % -12[1]x\n"'╳')
  for ir := range name {
    fmt.Printf("name[%d]  =  % -12x =  %-6q != % x\n"
i, string(r), r, name[i])
  }


            
╳ =>  9587 => 0x 2573
  name[0]  =  e2 95 b3     =  '╳'    !=  e2
  name[3]  =  e2 95 b3     =  '╳'    !=  e2
  name[6]  =  e2 95 b3     =  '╳'    !=  e2


We can see that index i does not have the predicted value of 0, 1, 2.

Furthermore, using an index to traverse a string such as name[i] is error-prone because each rune requires separate bytes in utf-8 encoding. (need 3 bytes)

The reason for this is that "for range loop" jumps over strng runes rather than bytes.

Each index returns the rune's starting index.


Why string values are read-only?

Strings are simply read-only bytes slices. (Refer to this doc)
There is no need for a capacity because they are read-only (they cannot be grown).


  s1 := "Hello"
  s2 := "Hello"
  fmt.Println((*reflect.StringHeader)(unsafe.Pointer(&s1)).Data)
  fmt.Println((*reflect.StringHeader)(unsafe.Pointer(&s2)).Data)


        
4985534
  4985534


According to the result of the previous example, both variables with the same string have the same backing array.

Tuesday, April 20, 2021

Golang - Slice

Before we get into Slice, let's recap the Array.

    
var arr [2]int
    

It cannot expand or shrink (fix length) since the length is part of its type (compile-time) when declared.

And because Go is a static type, it cannot be changed at runtime.

Because the length is a part of the type, you cannot assign one array to another with a different length.


Slice




  var slice []int


1. At compile time, 'Slice' does not have a set length. As a result, it can grow or shrink in real time.


2. Before assigning a value, the slice's "Zero Value" is nil (it has no elements).

Consider the following example to compare the results of Array and Slice.


  var arr [2]int
  fmt.Printf("arr: %#v\n", arr)

  var slice []int
  fmt.Printf("slice: %#v\n", slice)


Result:
// 'Go' will give each element a 'Zero Value' based on its type. (int -> 0)
  arr: [2]int{0, 0}

  // 'Go' will set the value 'nil' to indicate that this 'slice' has not yet been initialized.
  slice: []int(nil)



3. Elements cannot be assigned to a 'nil' slice.


  var slice []int
  slice[0] = 1


Error:
                
runtime error: index out of range [0] with length 0



4. You can only assign one slice to another with a different length if their elements are of the same type.

Array:

  arr1 := [2]int{12}
  arr2 := [3]int{123}
  arr1 = arr2
  fmt.Printf("arr:: %#v\n", arr1)


Error:

// Because length is a part of array type, you cannot assign one array to another if their lengths differ
  cannot use arr2 (type [3]int) as type [2]int in assignment


Slice

  slice1 := []int{12}
  slice2 := []int{123}
  slice1 = slice2
  fmt.Printf("slice:: %#v\n", slice1)


Result:
                
// If the element type is the same, assigning one slice to another is not a problem,
//even if their lengths differ.
  slice:: []int{1, 2, 3}



5. Only nil (no other slice) can be compared to a slice.


  slice1 := []int{12}
  slice2 := []int{123}

  if slice1 == slice2 {
    fmt.Println("match")
  }


Error:

  invalid operation: slice1 == slice2 (slice can only be compared to nil)



Growing



Using the pre-built method 'append' (no need to import from other packages)

It will add an element to an existing slice and return a new slice. The current slice will be unaffected.

As a result, if you did not assign the new slice (from the 'append' method) to a variable, the data will be lost, and the errors listed below will appear.


  nums := []int{123}
  fmt.Printf("%#v\n", nums)

  append(nums, 4)
  fmt.Printf("%#v\n", nums)


Error:

  append(nums, 4) evaluated but not used


You can also append a single element or slices.


  nums := []int{123}
  fmt.Printf("%#v\n", nums)

  nums = append(nums, 56)
  fmt.Printf("%#v\n", nums)

  nums2 := []int{78}
  nums = append(nums, nums2...)
  fmt.Printf("%#v\n", nums)


Result:

  []int{1, 2, 3}
  []int{1, 2, 3, 5, 6}
  []int{1, 2, 3, 5, 6, 7, 8}



Slicing



Any value that can be sliced is considered sliceable.

Sliceable objects include arrays, slices, and strings.

Format:

  newSlice := sliceable[start:stop]


More Details:

* start: slice it from this 'index'

* stop: slice it up to this 'element position' or (index+1)

* The sliced pieces of the original sliceable value are returned by slicing.


  num := []int{012345678}

  fmt.Println(num)
  fmt.Println(num[:])

  fmt.Println(num[0:1])
  fmt.Println(num[0])

  fmt.Println(num[:3])
  fmt.Println(num[3:])


More Details:

* num equals to num[:]
* num[0:1] (slicing expression) will return the slice value even it has single value
* num[0] (index expression) will return the element value directly


Exception



Appending to a sliced slice will override the original slice.


Slice Internals - Backing array



Do you think a slice can store its elements?

NO is the answer.

A slice does not store any elements directly.

Consider the following example:


  nums := []int{123}


The slice literal ([]int 1, 2, 3) creates a hidden array (also known as the Backing array) and returns the memory location of the hidden array to the slice variable for referencing.

So the backing array keeps the true elements, and the slice variables point to it.

The backing array can be shared by several slice variables because the elements are stored independently.

A slice is a window to its underlying array. To put it another way, a slice formed with a slice expression uses the same backing array as the original slice.

As an example below, nums, nums2, nums3, and nums4 all refer to the same backend array but with different windows.


  nums := []int{123}
  nums2 := nums[:]
  nums3 := nums[1:1]
  nums4 := nums[:2]


Because they are seeking for the same backing array, changing one of the slices above will have an effect on all of them.


  nums := []int{123}
  nums1 := nums[0:2]
  nums1[0] = 4

  fmt.Println(nums)
  fmt.Println(nums1)


Result:

  [4 2 3]
  [4 2]


Slicing is inexpensive since they reuse the same backing array rather than producing a new one.

Indexing a slice is also quick because the backing array is continuous in memory.


Slice Internals - Slice Header



What does a slice value contain if it does not save any elements?
It must know the memory address of its backup array.

How does Go do it?
To explain its backing array, there is a little data structure called Slice Header.

Slice Header contains:

1. pointer

    It refers to the baking array's memory address.

2. length

    It tells how many elements must be found for this slice value.

    A slice cannot index elements that are longer than its length (without re-slicing).


  nums := []int{123}
  nums2 := nums[0:1]
fmt.Println(nums2[1])


Error
                
runtime error: index out of range [1] with length 1


3. capacity

    It keeps track of how many elements (or how many spaces) the backing array has.

The slice size is fixed.

It uses 24 bytes (1 int64 (if you pc based on 64bit) = 8 bytes * 3 fields)

Which results in a low-cost slice operation:

1. Slicing: this function generates a new slice header.
2. Assigning a slice to another slice: just the slice header was transferred.


Example of passing array and slice to a function



Array 
func main() {
    nums := [3]int{123}
    modifyArray(nums)
    fmt.Printf("Main: %#v (%p)\n", nums, &nums)
}

func modifyArray(nums [3]int) {
    nums[0] = 4
    fmt.Printf("modifyArray: %#v (%p)\n", nums, &nums)
}


Result:
            
modifyArray: [3]int{4, 2, 3} (0xc000016180)
  Main: [3]int{1, 2, 3} (0xc000016160)


Analysis: When you pass an array to a function, Go will copy it into the function's new local variables.
As you can see, the memory locations differ between nums in main and nums in modifyArray func.
As a result, the change has no effect on the nums in main.


Slice: 

func main() {
    nums := []int{123}
    modifyArray(nums)
    fmt.Printf("Main: %#v (%p) (%+v)\n", nums, &nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))
}

func modifyArray(nums []int) {
    nums[0] = 4
    fmt.Printf("modifyArray: %#v (%p) (%+v)\n", nums, &nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))
}


Result:

modifyArray: []int{4, 2, 3} (0xc00000c080
(&{Data:824633811296 Len:3 Cap:3})
  Main: []int{4, 2, 3} (0xc00000c060
(&{Data:824633811296 Len:3 Cap:3})


When you provide a slice to a function, Go makes a new copy of the slice header and passes it to the function.
As a result of the logs, we can see that the memory address of the nums variables differs, but the content of the slice header remains the same.
Because Go does not make a duplicate of the underlying array, any changes made under modifyArray will influence the nums in main because they share the same backing array.


nil slice



A nil slice lacks a backing array but has a slice header.

    
var nums []int


Slice Header:

    Pointer     0
    Length      0
    Capacity    0


Slice Internals - Capacity



The length of a backing array and where a slice begins determine capacity.
It can influence how far a slice can be extended.

The len and cap of a slice generated by slice literal will be the same.


  nums := []int{123}
  fmt.Printf("nums: %#v (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))


Result:
            
nums: []int{1, 2, 3} (&{Data:824633811296 Len:3 Cap:3})


If we slice it to an empty slice, its length becomes 0.


  nums2 := nums[0:0]
  fmt.Printf("nums: %#v (%+v)\n", nums2, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums2)))


Result:
            
nums: []int{} (&{Data:824633811296 Len:0 Cap:3})


The capacity allows you to extend the slice.

    
nums3 := nums2[0:3]
  fmt.Printf("nums: %#v (%+v)\n", nums3, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums3)))


Result:
            
nums: []int{1, 2, 3} (&{Data:824633811296 Len:3 Cap:3})


You can also utilize the cap feature.


  nums4 := nums2[0:cap(nums2)]
  fmt.Printf("nums: %#v (%+v)\n", nums4, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums4)))


Result:
            
nums: []int{1, 2, 3} (&{Data:824633811296 Len:3 Cap:3})


When you lose the initial slice header, you cannot extend back.

An empty slice (but not a nil slice) will not be assigned to a new backing array; they all share the same slice header:

    
nums5 := []int{}
  fmt.Printf("nums: %#v (%p) (%+v)\n", nums5, &nums5, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums5)))

  nums6 := []string{}
  fmt.Printf("nums: %#v (%p) (%+v)\n", nums6, &nums6, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums6)))


Result:
            
nums: []int{} (0xc00000c160) (&{Data:5757824 Len:0 Cap:0})
  nums: []string{} (0xc00000c1a0) (&{Data:5757824 Len:0 Cap:0})



Backing array for append function



If there is enough room for adding while appending a slice, Go will not create a new backing array.

However, if Go requires extra space for appending, it has its own way to enhance efficiency because allocating memory is costly.

Consider the following example:

    
nums := []int{1}
  fmt.Printf("nums: %#v (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 2)
  fmt.Printf("nums: %#v (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 3)
  fmt.Printf("nums: %#v (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 4)
  fmt.Printf("nums: %#v (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 5)
  fmt.Printf("nums: %#v (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))


Result:
        
nums: []int{1} (&{Data:824633802896 Len:1 Cap:1})
  nums: []int{1, 2} (&{Data:824633802992 Len:2 Cap:2})
  nums: []int{1, 2, 3} (&{Data:824633811392 Len:3 Cap:4})
  nums: []int{1, 2, 3, 4} (&{Data:824633811392 Len:4 Cap:4})
  nums: []int{1, 2, 3, 4, 5} (&{Data:824633852288 Len:5 Cap:8})


When num 4 is appended to a slice, the slice header pointer memory remains the same because there is still enough space.

The new slice headers are formed by attaching numbers 2, 3, and 5. You can also see that the cap has been raised to a power of 2.

If the cap exceeds 1024, the increasing rate is reduced to 25%.


All together



Format:

  [low:high:cap]

 
You may have come into this problem before.

    
nums1 := []int{123}
  nums2 := append(nums1[:1], 4)

  fmt.Println(nums1)
  fmt.Println(nums2)


Result:
            
[1 4 3]
  [1 4]


Why is the first result [1 4 3] rather than [1 2 3]?

To avoid affecting the raw nums1 slice, we can use 'cap' to limit the capacity of the returned slice header.

Capacity position can be thought of as the element position in the backing array, and it should be more than or equal to stop position.

Consider the following example.

    
nums1 := []int{123}
  nums2 := append(nums1[:1:1], 4)

  fmt.Println(nums1)
  fmt.Println(nums2)


Result:
            
[1 2 3]
  [1 4]


Because we have limited the capacity to one and there is no more space in this slice to append '4'.

Go will generate a new backing array that is not the same as nums1's.


Pre-allocate the backing array



If there is no more space for adding, Go will generate a new backing array, but this is time-consuming.

    
nums := []int{}
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 1)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 2)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 3)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 4)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 5)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))


Result:
            
[]int{}                 (&{Data:5753728 Len:0 Cap:0})
  []int{1}                (&{Data:824633802952 Len:1 Cap:1})
  []int{1, 2}             (&{Data:824633803024 Len:2 Cap:2})
  []int{1, 2, 3}          (&{Data:824633811456 Len:3 Cap:4})
  []int{1, 2, 3, 4}       (&{Data:824633811456 Len:4 Cap:4})
  []int{1, 2, 3, 4, 5}    (&{Data:824633844032 Len:5 Cap:8})


As a result, a few of new backing arrays are formed.

The 'make' function in Go allows you to preallocate a backing array (zero value depending on the type) with a particular length and capacity.
It is a built-in function that does not need to be imported from another package.

    
nums := make([]int05)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 1)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 2)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 3)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 4)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))

  nums = append(nums, 5)
  fmt.Printf("%#v  (%+v)\n", nums, 
(*reflect.SliceHeader)(unsafe.Pointer(&nums)))


Result:
            
[]int{}                 (&{Data:824633852048 Len:0 Cap:5})
  []int{1}                (&{Data:824633852048 Len:1 Cap:5})
  []int{1, 2}             (&{Data:824633852048 Len:2 Cap:5})
  []int{1, 2, 3}          (&{Data:824633852048 Len:3 Cap:5})
  []int{1, 2, 3, 4}       (&{Data:824633852048 Len:4 Cap:5})
  []int{1, 2, 3, 4, 5}    (&{Data:824633852048 Len:5 Cap:5})


We can use make() to improve the performance of our code.


Copy slice without a loop



Format:

  copy(desitination_slice, source_slice)


It should be noted that the elements will be copied dependent on the length of the smallest slice.

    
nums1 := []int{12}
  nums2 := []int{321}

  n := copy(nums1, nums2)
  fmt.Println(nums1)
  fmt.Println(nums2)
  fmt.Println(n)


Result:
            
[3 2]
  [3 2 1]
  2