Go切片≠动态数组

原文:https://appliedgo.net/slices/

Go切片设计非常巧妙,它不仅提供真正的动态数组的观感,并且性能优化的不错。但如果不了解切片的内部机制就可能会踩坑。

背景:最近我又注意到一些关于切片操作似乎不合逻辑的讨论。我觉得是时候来科普一下切片的内部实现和切片操作的机制,特别是append()和bytes.Split()。

Go切片

Go语言中切片的概念很精妙:切片代表一个长度灵活的类似于列表的数据类型,但又让它对内存的分配完全可制。
其他的语言中没有这种概念,所以Go新手常常会觉得切片操作行为非常困惑(对我也同样如此)。既然对切片内部原理的探索会解决部分(如果不能解决所有)的困惑, 所有我们首先来探索一下: 什么是切片, 切片如何工作。

切片仅仅是列表上的一个视图

在Go语言中,列表时固定长度的。列表长度是列表定义的一部分,所以[10]int和[20]int不光是两个不同长度的整数数组,它们其实是完全不同的类型!

切片在列表之上增加了一个动态层。从列表创建新的切片既不会分配新的内存也不会复制任何数据,切片仅仅是开在列表某一部分的窗口。从技术角度解释,切片可被视作一个struct,内含一个指针指向列表中处在切片开始位置的元素,以及两个整数分别描述切片的长度和容量。
这意味着典型的切片操作毫无压力:创建切片,扩展切片(只要capacity允许),把切片在列表前后滑动,这些操作只需要改变切片的指针以及最多两个整数的值。真正的数据不需要移动。

同样意味着机遇同一个列表创建的两个切片可以重叠,并且对一个切片赋值为另一个切片之后,两个切片指向相同的内存单元。此时如果改变了其中一切片中某个元素,同样改变了另一个切片中相同的元素。如果想创建切片的副本,使用go内建的copy()函数创建新的切片。

所有这些行为都是建立在简单一致的机制之上,如果不了解这些机制,就可能出现问题。

某些切片函数现场生效

既然切片仅仅是静态的列表之上的动态窗口,那很多切片操作会直接生效(happen in place)就说得通了。
举个栗子,bytes.Split()需要传入一个切片和分隔符,它会把切片按分隔符切分,并返回[]byte类型的切片.
但是!所有Split()返回的切片仍然指向原本的列表。这对了解其他语言中类似的split方法的同学来说可能有点出乎意料,因为其他语言中为了保证运行时效率都是重新分配一块内存再拷贝数据来实现split。

一旦你的代码无视Split()结果仍旧指向原始数据的事实,那要么是编译器提醒你,要么是程序运行中报错,反正最终你会遇到数据污染导致的错误。

另一种出乎意料的行为发生在bytes.Split()和append()联合使用的时候。不过我们先单独来看看append().

append()带来便利以及一点“魔法”

append()在切片末尾添加新的元素从而扩展切片。append()有两个便捷的特性:

  1. 可以对nil切片append,在append的时刻将nil切片变为存在
  2. 如果切片剩余的容量容纳不下新增的数据量,append会自动分配新的列表空间并拷贝旧的数据到新的地址。

第二个特性可能让人困惑,在append之后,有时原始列表被改变,有时会创建新的列表,原始列表不变。如果原始列表在代码其他部分被引用,其中某些引用可能得到旧的数据。

这种实现很容易就被定义为“随机”,但实际上它的实现是非常确定性的。如果观察者总是了解切片的长度、容量以及将要append的数量,那他可以轻易的判断append是否需要分配新的列表空间。

bytes.Split()和append联合使用也会导致未预期的结果。bytes.Split()返回的切片的cap()被设置为列表的结尾。如果向Split()返回的第一个切片进行append操作,切片会在列表上直接修改数据,从而对后续的切片都产生影响。

如果bytes.Split()返回所有的切片的capacity都被设置为切片的长度,append()就不能影响后续的所有切片,因为一旦在其中某个切片上执行append操作,立马就会重新分配一个列表空间,从而容纳超出当前切片capacity的数据量。

一些例子

https://github.com/AppliedGo/slices

/*




+++
title = "Go slices are not dynamic arrays"
description = "Go slices are based on a smart concept that does not like being ignored"
author = "Christoph Berger"
email = "[email protected]"
date = "2017-08-03"
draft = "false"
domains = ["Patterns and Paradigms"]
tags = ["slice", "append", "split", "memory management", "gotcha"]
categories = ["Background"]
+++

Go's slices are cleverly designed. They provide the look-and-feel of truly dynamic arrays while being optimized for performance. However, not being aware of the slice mechanisms can bring you into trouble.



*Background: Just recently I observed a few discussions--again--about seemingly inconsistent behavior of slice operations. I take this as an opportunity to talk a bit about slice internals and the mechanics around slice operations, especially `append()` and `bytes.Split()`.*


## Go's slices

The concept of slices in Go is really a clever one. A slice represents a flexible-length array-like data type while providing full control over memory allocations.

This concept is not seen in other languages, and so people new to Go often consider the behavior of slice operations as quite confusing. (Believe me, it happened to me as well.) Looking at the inner workings of slices removes much (if not all) of the confusion, so let's first have a look at the basics: What are slices, and how do they work?


## A slice is just a view on an array

In Go, arrays have a fixed size. The size is even part of the definition of an array, so the two arrays `[10]int` and `[20]int` are not just two `int` arrays of different size but are in fact different types.

Slices add a dynamic layer on top of arrays. Creating a slice from an array  neither allocates new memory nor copies anything. A slice is nothing but a "window" to some part of the array. Technically, a slice can be seen as a struct with a pointer to the array element where the slice starts, and two ints describing length and capacity.

This means that typical slice manipulations are cheap. Creating a slice, expanding it (as far as the available capacity permits), moving it back and forth on the underlying array--all that requires nothing more than changing the pointer value and/or one or both of the two int values. The data location does not change.

HYPE[slice basics](slices01.html)

*Fig.1: Slices are just "windows" to an array (click the buttons to see the operations)*

This also means that two slices created from the same array can overlap, and after assigning a slice to a new slice variable, both variables now share the same memory cells. Changing one item in one of the slices also change the same item in the other slice. If you want to create a true copy of a slice, create a new slice and use the built-in function `copy()`.

All of this is based on simple and consistent mechanisms. The problems arise when not being aware of these mechanisms.


## Some slice functions work in place

Since slices are just efficient "dynamic windows" on static arrays, it does make sense that most slice manipulations also happen in place.

As an example, `bytes.Split()` takes a slice and a separator, splits the slice by the separator, and returns a slice of byte slices.

But: All the byte slices returned by `Split()` still point to the same underlying array as the original slice. This may come unexpected to many who know similar split functions from other languages that rely on allocate-and-copy semantics (at the expense of efficiency at runtime).

HYPE[split](slices02.html)

*Fig. 2: `bytes.Split()` is an in-place operation*

Code that ignores the fact that the result of `Split()` still points to the original data may cause data corruption in a way that neither the compiler nor the runtime can detect as being wrong.

Another unexpected behavior can happen when combining `bytes.Split()` and `append()` - but first, let's have a look at `append()` alone.


## append() adds convenience--and some "magic"

`append()` adds new elements to the end of a slice, thus expanding the slice. `append()` has two convenience features:
* First, it can append to a `nil` slice, making it spring into existence in the moment of appending.
*  Second, if the remaining capacity is not sufficient for appending new values, `append()` automatically takes care of allocating a new array and copying the old content over.

Especially the second one can cause confusion, because after an `append()`, sometimes the original array has been changed, and sometimes a new array has been created, and the original one stays the same. If the original array was referenced by different parts of the code, one reference then may point to stale data.

HYPE[split](slices03.html)

*Fig. 3: The two outcomes of `append()`*

This behavior could be easily characterized as "random", although the behavior is in fact quite deterministic. An observer who always knows the values of slice length, capacity, and the number of items to append can trivially determine whether `append()` needs to allocate a new array.

In combination with `bytes.Split()`, `append()` can also create unexpected results. The slices that `bytes.Split()` returns have their `cap()` set to the end of the underlying array. Now when `append()`ing to the first of the returned slices, the slice grows within the same underlying array, overwriting subsequent slices.

HYPE[split and append](slices04.html)

*Fig. 4: After splitting (see fig. 2), append to the first returned slice*

If `bytes.Split()` returned all slices with their capacity set to their length, `append()` would not be able to overwrite subsequent slices, as it would immediately allocate a new array, to be able to extend beyond the slice's current capacity.

## A few demos

The code below demonstrates the discussed `Split()` and `append()` scenarios. It also shows how to do achieve an "always copy" semantics when appending.
*/

//
package main

import (
    "bytes"
    "fmt"
)

// Split the byte slice `a` at each comma, then update one of the split slices.
func splitDemo() {
    fmt.Println("Split demo")
    // bytes.Split splits in place.
    a := []byte("a,b,c")
    b := bytes.Split(a, []byte(","))
    fmt.Printf("a before changing b[0][0]: %q\n", a)

    // `b``'s byte slices use `a``'s underlying array. Changing `b[0][0]` also changes `a`.
    b[0][0] = byte('*')
    fmt.Printf("a after changing b[0][0]:  %q\n", a)

    // Appending to slice `b[0]` can write into slices `b[1]` and even `b[2], as `b[0]`'s capacity extends until the end of the underlying array that all slices share.
    fmt.Printf("b[1] before appending to b[0]: %q\n", b[1])
    b[0] = append(b[0], 'd', 'e', 'f')
    fmt.Printf("b[1] after appending to b[0]:  %q\n", b[1])
    fmt.Printf("a after appending to b[0]: %q\n", a)
}

// Append numbers to a slice; first, within capacity, then beyond capacity.
func appendDemo() {
    fmt.Println("\nAppend demo")
    s1 := make([]int, 2, 4)
    s1[0] = 1
    s1[1] = 2
    fmt.Printf("Initial address and value: %p: %[1]v\n", s1)
    s1 = append(s1, 3, 4)
    // Note the same address as before.
    fmt.Printf("After first append:        %p: %[1]v\n", s1)
    s1 = append(s1, 5)
    // Note the changed address. Append allocated a new, larger array.
    fmt.Printf("After second append:       %p: %[1]v\n", s1)
}

// How to get "always copy" semantics: simply `copy()` the slice before appending. Ensure the target slice is large enough for the subsequent `append()`, or else `append()` might again allocate a new array.
func alwaysCopy() {
    fmt.Println("\nAppend and always copy")
    s1 := []int{1, 2, 3, 4}
    fmt.Printf("s1: %p: %[1]v\n", s1)
    // Create a new slice with sufficient len (for copying) and cap (for appending - to avoid allocating and copying twice).
    s2 := make([]int, 4, 8)
    // Destination is always the first parameter, analogous to Fprintf, http.HandleFunc, etc.
    copy(s2, s1)
    // Note the different addresses of s1 and s2 in the output.
    fmt.Printf("s2: %p: %[1]v\n", s2)
    s2 = append(s2, 5, 6, 7, 8)
    // s2 has enough capacity so that append() does not allocate again.
    fmt.Printf("s2: %p: %[1]v\n", s2)
}

func main() {
    splitDemo()
    appendDemo()
    alwaysCopy()
}

/*

Output:

Split demo
a before changing b[0][0]: “a,b,c”
a after changing b[0][0]: “*,b,c”
b[1] before appending to b[0]: “b”
b[1] after appending to b[0]: “e”
a after appending to b[0]: “*defc”

Append demo
Initial address and value: 0xc42000a340: [1 2]
After first append: 0xc42000a340: [1 2 3 4]
After second append: 0xc420012380: [1 2 3 4 5]

Append and always copy
s1: 0xc42000a3c0: [1 2 3 4]
s2: 0xc4200123c0: [1 2 3 4]
s2: 0xc4200123c0: [1 2 3 4 5 6 7 8]



## How to get and run the code

Step 1: `go get` the code. Note the `-d` flag that prevents auto-installing
the binary into `$GOPATH/bin`.

    go get -d github.com/appliedgo/slices

Step 2: `cd` to the source code directory.

    cd $GOPATH/src/github.com/appliedgo/slices

Step 3. Run the code.

    go run slices.go


## Takeaways

### Remember that append() may or may not allocate a new slice.

In many cases, this is absolutely ok, as a single slice does not care if it gets relocated. Only when two or more slices interact, the behavior of `append()` can lead to unexpected results.

**To avoid ambiguous results, use the correct techniques to ensure the desired outcome:**

* If you absolutely want to avoid allocation and copying, use a large underlying array, re-slice your slice, and strictly avoid `append()`.

* If you absolutely need copy semantics, create a destination slice of sufficient size, and use the built-in `copy()` function.


### Never assume exclusive ownership of a slice that you did not create.

Any function that returns a slice may return a *shared* slice. `bytes.Split` splits a slice in-place, `append()` returns a slice header that still might point to the slice that it received before.

Hence if you receive a slice from a function, keep in mind that other code may still modify that slice. Again, `copy()` is your friend.


### Read the docs.

Functions that create or return copies of slices usually mention this in their documentation:

*"...returns a copy of..."*

*"...returns a new byte slice..."*

Whereas the documentation of in-place operations often talks about *"slicing"* or *"subslices"*, which indicates that no allocation takes place and the returned data may still be accessed by other code.


**Happy coding!**

- - -

Update 2017-08-05:

* Fixed fig. 3 to correctly show that len(t) == 3
* Added new case: Split then append

- - -

*/

小结

1 append()可能分配也可能不分配新的切片

许多场景下这不会有影响,因为一个独立的切片并不关心它是否是重新分配的。但当两个以上的切片交互的饿时候,append的行为就可能导致不可预料的结果。

为了避免出现不明确的结果,使用一些手段来确保期望的结果:

  1. 如果想避免重新分配和拷贝,请使用足够大的底层列表,重新规划切片,严格避免append操作

  2. 如果你确定需要复制,请创建足够大的目的切片,使用内建的copy函数来完成切片复制

2 如果不是自己创建的切片,请不要认定它只有你一个使用者

任何返回切片的函数都可能返回一个共享的切片,bytes.Split()现场切分原有的切片,而append()返回的切片可能依旧指向原本切片的起始位置。因此如果你从某个函数接收到切片,请记住其他代码也可能修改这个切片。请适当的使用copy().

3 阅读文档

Functions that create or return copies of slices usually mention this in their documentation:
许多创建或返回切片副本的函数通常都会在文档中提到:
“返回xxx的副本”
“返回新的byte切片”

而改变原始切片的函数在文档中通常会提到“slicing”或者“subslices”,暗示并无内存分配发生,并且返回的数据也可能被其他代码访问。

祝撸码愉快!

你可能感兴趣的:(go)