第35课:彻底解密Spark 2.1.X中Sort Shuffle 中TimSort排序源码具体实现

第35课:彻底解密Spark 2.1.X中Sort Shuffle 中TimSort排序源码具体实现

Spark 2.1.X中Sort Shuffle 中TimSort排序:

         1,从Spark 1.6.x开始,默认核心的Shuffle是Sort  Shuffle,同学们可能有个印象Sort Shuffle要完成数据排序的,但这个印象是有问题的,例如写个最简单的WordCount程序,为什么在默认情况下不进行排序呢?所以从Hash Shuffle的方式变成SortShuffle,具体是怎么实现排序的?

         2,TimSort排序方式,一种相对权衡了各方面的排序方式,假如排序的数据分成很多不同的块,TimSort有很好的排序性能上的表现。因此,有必要彻底研究一下TimSort怎么实现的。

         回顾一下,我们跟踪代码是从Sorter.scala中跟到timSort的,也就是进行ExternalSorter的时候要进行排序,默认情况下基于PartitionID进行排序,对PartitionID进行排序并不意味着对数据本身进行排序,我们在Sorter.scala用到了timSort。研究一下timSort的源代码,会发现timSort和MergeSort有点类似,实质上有很大的区别,我们可以初步感知timSort的排序方式。MergeSort排序的方式把数据分成很多片,开始分成很多小文件,最终把小文件合并成大文件。timSort可以认为是MergeSort排序的改良。

TimSort优化MergeSort排序,把它变成稳定的、适应的、迭代的排序,TimSort基于分布式的排序,效率有很大的提升。

         MergeSort排序默认长度是1,归并的时候自动生成归并元素;TimSort是连续递增的,将其中的一块数据run进行反转,run有自己具体的实现算法,run可以认为是一块固定大小的数据,如果插入一段数据,数据的长度如果小于run的长度,TimSort就会采用二分的insertSort,进行一些局部的优化。MergeSort排序归并是固定的,而TimSort是随机

的,会有判断条件。TimSort在很多地方都有使用,例如安卓等。

         TimSort.java位于org.apache.spark.util.collection包里面,其中还有一个测试类TestTimSort,如创建测试数组等。TimSort.java阅读源码的技巧先看Sort排序,然后将其它的成员,方法关联起来。TimSort.java代码如下:

1.           publicvoid sort(Buffer a, int lo, int hi, Comparator c) {

2.             assert c != null;

3.          

4.             int nRemaining  = hi - lo;

5.             if (nRemaining < 2)

6.               return; // Arrays of size 0 and 1 are always sorted

7.          

8.             // If array is small, do a"mini-TimSort" with no merges

9.             if (nRemaining < MIN_MERGE) {

10.            int initRunLen =countRunAndMakeAscending(a, lo, hi, c);

11.            binarySort(a, lo, hi, lo + initRunLen,c);

12.            return;

13.          }

14.       

15.          /**

16.           * March over the array once, left toright, finding natural runs,

17.           * extending short natural runs to minRunelements, and merging runs

18.           * to maintain stack invariant.

19.           */

20.          SortState sortState = new SortState(a, c,hi - lo);

21.          int minRun = minRunLength(nRemaining);

22.          do {

23.            // Identify next run

24.            int runLen = countRunAndMakeAscending(a,lo, hi, c);

25.       

26.            // If run is short, extend to min(minRun,nRemaining)

27.            if (runLen < minRun) {

28.              int force = nRemaining <= minRun ?nRemaining : minRun;

29.              binarySort(a, lo, lo + force, lo +runLen, c);

30.              runLen = force;

31.            }

32.       

33.            // Push run onto pending-run stack, andmaybe merge

34.            sortState.pushRun(lo, runLen);

35.            sortState.mergeCollapse();

36.       

37.            // Advance to find next run

38.            lo += runLen;

39.            nRemaining -= runLen;

40.          } while (nRemaining != 0);

41.       

42.          // Merge all remaining runs to completesort

43.          assert lo == hi;

44.          sortState.mergeForceCollapse();

45.          assert sortState.stackSize == 1;

46.        }

         在TimSort.java代码中:

l  nRemaining是未排序的数组的长度,是从数组的角度考虑的,不过TimSort是分布式的。

l  判断一下nRemaining小于2,已经是排序的数据了,直接返回。数组大小为0、1是已经排序的,那就不用排序。

l  如果nRemaining小于MIN_MERGE,就变成mini-TimSort,就是不使用归并排序。countRunAndMakeAscending计算得到递增数据的长度。然后使用binarySort二分排序法,这个是基本的排序法。

l  然后是SortState,SortState是构建一个栈,创建一个TimSort实例,维护我们排序的状态信息。

l  minRunLength:获得最小的run长度。

l  do while循环首先得到递增数列的长度,如果runLen小于minRun,则使用binarySort二分插入。sortState.pushRun(lo, runLen)是入栈,把即将运行的run放入栈中。sortState.mergeCollapse():可能进行归并排序,内部视不同的情况进行判断。lo += runLen:下一个要进行的run。

l  循环结束之后,所有剩余的run完成排序。

    TimSort.java从源码实现的角度将,第一个比较关键的一行代码是 int initRunLen = countRunAndMakeAscending(a, lo, hi, c);我们看一下countRunAndMakeAscending,首先找到run的尾部,在while中进行判断,反转我们的run,最后返回run的长度。

countRunAndMakeAscending:返回run的长度。run在指定的开始位置,如果它是递减的,则反转运行。如一个run是最长的升序序列:a[lo] <= a[lo + 1] <= a[lo + 2] <= ...或者是最长的递减序列:  a[lo] >  a[lo + 1] >  a[lo + 2] >  ... 一个稳定的归并排序中严格的降序定义是必要的,能安全调用进行反转降序序列,而不破坏稳定性。

l  @param a:数组中的run将被计数,并可能反转。

l  @param lo:run第一个元素的索引。

l  @param hi:run可能包含的最后一个元素的索引。需要 {@code lo < hi}

l  @param c:用于排序的比较器。

l  @return:返回run的长度。

countRunAndMakeAscending代码如下:

1.          private int countRunAndMakeAscending(Buffer a,int lo, int hi, Comparator c) {

2.             assert lo < hi;

3.             int runHi = lo + 1;

4.             if (runHi == hi)

5.               return 1;

6.          

7.             K key0 = s.newKey();

8.             K key1 = s.newKey();

9.          

10.          // Find end of run, and reverse range ifdescending

11.          if (c.compare(s.getKey(a, runHi++, key0),s.getKey(a, lo, key1)) < 0) { // Descending

12.            while (runHi < hi &&c.compare(s.getKey(a, runHi, key0), s.getKey(a, runHi - 1, key1)) < 0)

13.              runHi++;

14.            reverseRange(a, lo, runHi);

15.          } else {                             //Ascending

16.            while (runHi < hi &&c.compare(s.getKey(a, runHi, key0), s.getKey(a, runHi - 1, key1)) >= 0)

17.              runHi++;

18.          }

19.       

20.          return runHi - lo;

21.        }

 

回到TimSort.java的sort方法,我们看一下binarySort的代码:

 

1.          private void binarySort(Buffer a, int lo, inthi, int start, Comparator c) {

2.             assert lo <= start && start<= hi;

3.             if (start == lo)

4.               start++;

5.          

6.             K key0 = s.newKey();

7.             K key1 = s.newKey();

8.          

9.             Buffer pivotStore = s.allocate(1);

10.          for ( ; start < hi; start++) {

11.            s.copyElement(a, start, pivotStore, 0);

12.            K pivot = s.getKey(pivotStore, 0, key0);

13.       

14.            // Set left (and right) to the indexwhere a[start] (pivot) belongs

15.            int left = lo;

16.            int right = start;

17.            assert left <= right;

18.            /*

19.             * Invariants:

20.             *  pivot >= all in [lo, left).

21.             *  pivot <  all in [right, start).

22.             */

23.            while (left < right) {

24.              int mid = (left + right) >>>1;

25.              if (c.compare(pivot, s.getKey(a, mid,key1)) < 0)

26.                right = mid;

27.              else

28.                left = mid + 1;

29.            }

30.            assert left == right;

31.       

32.            /*

33.             * The invariants still hold: pivot >=all in [lo, left) and

34.             * pivot < all in [left, start), sopivot belongs at left.  Note

35.             * that if there are elements equal topivot, left points to the

36.             * first slot after them -- that's whythis sort is stable.

37.             * Slide elements over to make room forpivot.

38.             */

39.            int n = start - left;  // The number of elements to move

40.            // Switch is just an optimization forarraycopy in default case

41.            switch (n) {

42.              case 2: s.copyElement(a, left + 1, a, left + 2);

43.              case 1: s.copyElement(a, left, a, left + 1);

44.                break;

45.              default: s.copyRange(a, left, a, left +1, n);

46.            }

47.            s.copyElement(pivotStore, 0, a, left);

48.          }

49.        }

 

 

回到TimSort.java的sort方法,我们看一下binarySort的代码, 二分法排序将指定数组的指定部分进行插入排序,小数量数据排序的最好情况需要进行O(nlogn)次比较,但最坏情况下需移动 O(n^2)次数据。如果指定范围的初始部分已排序,此方法可以利用它:该方法假定包含索引{@codelo}的元素,包括到{@code start},排除已排序的数据。

l  @param a  需进行排序的数组范围

l  @param lo  索引中的第一个元素进行排序的范围

l  @param hi 索引的最后一个元素之后的范围进行排序

l  @param start  索引中的第一个元素的范围是未知排序的 ({@code lo <= start <= hi})

l  @param C 比较器用于排序

 TimSort.java的binarySort的代码如下:

1.          private void binarySort(Buffer a,int lo, int hi, int start, Comparator c) {

2.             assert lo <= start&& start <= hi;

3.             if (start == lo)

4.               start++;

5.          

6.             K key0 = s.newKey();

7.             K key1 = s.newKey();

8.          

9.             Buffer pivotStore =s.allocate(1);

10.           for ( ; start < hi; start++) {

11.             s.copyElement(a, start, pivotStore, 0);

12.             K pivot = s.getKey(pivotStore, 0, key0);

13.        

14.             // Set left (and right) to the indexwhere a[start] (pivot) belongs

15.             int left = lo;

16.             int right = start;

17.             assert left <= right;

18.             /*

19.              *Invariants:

20.              *  pivot >= all in [lo, left).

21.              *  pivot <  all in [right, start).

22.              */

23.             while (left < right) {

24.               int mid = (left + right) >>>1;

25.               if (c.compare(pivot, s.getKey(a, mid,key1)) < 0)

26.                 right = mid;

27.               else

28.                 left = mid + 1;

29.             }

30.             assert left == right;

31.        

32.             /*

33.              * The invariants still hold: pivot >=all in [lo, left) and

34.              * pivot < all in [left, start), sopivot belongs at left.  Note

35.              * that if there are elements equal topivot, left points to the

36.              * first slot after them -- that's whythis sort is stable.

37.              * Slide elements over to make room forpivot.

38.              */

39.             int n = start - left;  // The number of elements to move

40.             // Switch is just an optimization forarraycopy in default case

41.             switch (n) {

42.               case 2: s.copyElement(a, left + 1, a, left + 2);

43.               case 1: s.copyElement(a, left, a, left + 1);

44.                 break;

45.               default: s.copyRange(a, left, a, left +1, n);

46.             }

47.             s.copyElement(pivotStore, 0, a, left);

48.           }

49.        }

 

回到TimSort.java的sort方法有一个minRunLength,这个方法得到我们最小run的长度,minRunLength 里面是一个while循环,循环条件是n大于等于MIN_MERGE是32(2的5次方),然后进行基本的移位运算。看一下minRunLength的代码:

1.             private int minRunLength(int n) {

2.             assert n >= 0;

3.             int r = 0;      // Becomes 1 if any 1 bits are shiftedoff

4.             while (n >= MIN_MERGE) {

5.               r |= (n & 1);

6.               n >>= 1;

7.             }

8.             return n + r;

9.           }

 

回到TimSort.java的sort方法,sortState.pushRun(lo,runLen);其中pushRun就是一个栈:

1.               private void pushRun(int runBase, intrunLen) {

2.               this.runBase[stackSize] =runBase;

3.               this.runLen[stackSize] =runLen;

4.               stackSize++;

5.             }

 

回到TimSort.java的sort方法,下面有一句很关键的代码sortState.mergeCollapse();我们看一下mergeCollapse的源代码:

1.               private void mergeCollapse() {

2.               while (stackSize > 1) {

3.                 int n = stackSize - 2;

4.                 if ( (n >= 1 &&runLen[n-1] <= runLen[n] + runLen[n+1])

5.                   || (n >= 2 &&runLen[n-2] <= runLen[n] + runLen[n-1])) {

6.                   if (runLen[n - 1]

7.                     n--;

8.                 } else if (runLen[n] >runLen[n + 1]) {

9.                   break; // Invariant isestablished

10.              }

11.              mergeAt(n);

12.            }

13.          }

 

mergeCollapse据说openJDK在实现mergeCollapse有BUG,在插入数据的时候插入的顺序可能有问题。但Spark进行过充分测试,mergeCollapse没有Bug。其中的关键代码是mergeAt。我们看一下mergeAt的实现,runLen[i] 如果是栈顶的第3个位置将被交换为栈顶的第二个位置。gallopRight从我们的run1找到run2中第一个元素的位置。在此基础上,run1中元素可以被忽略掉,将从run2找到run1中最后一个元素的位置,然后run2的元素被忽略掉。:

1.             private void mergeAt(int i) {

2.               assert stackSize >= 2;

3.               assert i >= 0;

4.               assert i == stackSize - 2 ||i == stackSize - 3;

5.          

6.               int base1 = runBase[i];

7.               int len1 = runLen[i];

8.               int base2 = runBase[i + 1];

9.               int len2 = runLen[i + 1];

10.            assert len1 > 0&& len2 > 0;

11.            assert base1 + len1 ==base2;

12.       

13.            /*

14.             * Record the length of thecombined runs; if i is the 3rd-last

15.             * run now, also slide overthe last run (which isn't involved

16.             * in this merge).  The current run (i+1) goes away in any case.

17.             */

18.            runLen[i] = len1 + len2;

19.            if (i == stackSize - 3) {

20.              runBase[i + 1] = runBase[i+ 2];

21.              runLen[i + 1] = runLen[i +2];

22.            }

23.            stackSize--;

24.       

25.            K key0 = s.newKey();

26.       

27.            /*

28.             * Find where the firstelement of run2 goes in run1. Prior elements

29.             * in run1 can be ignored(because they're already in place).

30.             */

31.            int k =gallopRight(s.getKey(a, base2, key0), a, base1, len1, 0, c);

32.            assert k >= 0;

33.            base1 += k;

34.            len1 -= k;

35.            if (len1 == 0)

36.              return;

37.       

38.            /*

39.             * Find where the lastelement of run1 goes in run2. Subsequent elements

40.             * in run2 can be ignored(because they're already in place).

41.             */

42.            len2 = gallopLeft(s.getKey(a,base1 + len1 - 1, key0), a, base2, len2, len2 - 1, c);

43.            assert len2 >= 0;

44.            if (len2 == 0)

45.              return;

46.       

47.            // Merge remaining runs,using tmp array with min(len1, len2) elements

48.            if (len1 <= len2)

49.              mergeLo(base1, len1,base2, len2);

50.            else

51.              mergeHi(base1, len1,base2, len2);

52.          }

 

其中有一个方法gallopRight,类似于gallopleft,除非包含相等的元素key,gallopRight返回最右边的相等元素的索引。

l  @param key 关键的搜索插入点

l  @param a 需搜索的数组

l  @param base 第一个元素的索引范围

l  @param len范围的长度需大于 0

l  @param hint  开始搜索的索引,0 <= hint < n.结果越接近hint, 方法运行的越快。

l  @param c 用于排序和搜索范围的比较器

l  @return k 返回k,  0 <= k <= n 这样a[b + k - 1] <= key < a[b + k]

 gallopRight的代码如下:

1.               private int gallopRight(K key, Buffer a, intbase, int len, int hint, Comparator c) {

2.               assert len > 0 &&hint >= 0 && hint < len;

3.          

4.               int ofs = 1;

5.               int lastOfs = 0;

6.               K key1 = s.newKey();

7.          

8.               if (c.compare(key,s.getKey(a, base + hint, key1)) < 0) {

9.                 // Gallop left untila[b+hint - ofs] <= key < a[b+hint - lastOfs]

10.              int maxOfs = hint + 1;

11.              while (ofs < maxOfs&& c.compare(key, s.getKey(a, base + hint - ofs, key1)) < 0) {

12.                lastOfs = ofs;

13.                ofs = (ofs << 1) + 1;

14.                if (ofs <= 0)   // int overflow

15.                  ofs = maxOfs;

16.              }

17.              if (ofs > maxOfs)

18.                ofs = maxOfs;

19.       

20.              // Make offsets relativeto b

21.              int tmp = lastOfs;

22.              lastOfs = hint - ofs;

23.              ofs = hint - tmp;

24.            } else { // a[b + hint]<= key

25.              // Gallop right untila[b+hint + lastOfs] <= key < a[b+hint + ofs]

26.              int maxOfs = len - hint;

27.              while (ofs < maxOfs&& c.compare(key, s.getKey(a, base + hint + ofs, key1)) >= 0) {

28.                lastOfs = ofs;

29.                ofs = (ofs << 1) +1;

30.                if (ofs <= 0)   // int overflow

31.                  ofs = maxOfs;

32.              }

33.              if (ofs > maxOfs)

34.                ofs = maxOfs;

35.       

36.              // Make offsets relativeto b

37.              lastOfs += hint;

38.              ofs += hint;

39.            }

40.            assert -1 <= lastOfs&& lastOfs < ofs && ofs <= len;

41.       

42.            /*

43.             * Now a[b + lastOfs] <=key < a[b + ofs], so key belongs somewhere to

44.             * the right of lastOfs butno farther right than ofs.  Do a binary

45.             * search, with invarianta[b + lastOfs - 1] <= key < a[b + ofs].

46.             */

47.            lastOfs++;

48.            while (lastOfs < ofs) {

49.              int m = lastOfs + ((ofs -lastOfs) >>> 1);

50.       

51.              if (c.compare(key,s.getKey(a, base + m, key1)) < 0)

52.                ofs = m;          // key < a[b + m]

53.              else

54.                lastOfs = m + 1;  // a[b + m] <= key

55.            }

56.            assert lastOfs == ofs;    // so a[b + ofs - 1] <= key < a[b +ofs]

57.            return ofs;

58.          }

 

在前面的的代码中还有一个gallopLeft,定位将指定key插入到指定的排序范围;如果该范围包含等于key的元素,返回最左边的相等的元素的索引。

l  @param key 关键的搜索插入点

l  @param a 搜索的数组

l  @param base 第一个元素的索引范围

l  @param len 范围的长度需大于 0

l  @param hint 开始搜索的索引,0 <= hint < n.结果越接近hint, 方法运行的越快。

l  @param c 用于排序和搜索范围的比较器

l  @return 返回 k,  0 <= k <= n 这样a[b + k - 1] < key <= a[b + k], 假设a[b - 1]是负无穷大和 a[b + n]是无穷大。关键属于索引b + k,换句话说,a的第一个k元素应先于key,最后 n - k元素应该排在其之后。

gallopLeft代码如下:

1.            private int gallopLeft(K key, Buffer a, intbase, int len, int hint, Comparator c) {

2.               assert len > 0 &&hint >= 0 && hint < len;

3.               int lastOfs = 0;

4.               int ofs = 1;

5.               K key0 = s.newKey();

6.          

7.               if (c.compare(key,s.getKey(a, base + hint, key0)) > 0) {

8.                 // Gallop right untila[base+hint+lastOfs] < key <= a[base+hint+ofs]

9.                 int maxOfs = len - hint;

10.              while (ofs < maxOfs&& c.compare(key, s.getKey(a, base + hint + ofs, key0)) > 0) {

11.                lastOfs = ofs;

12.                ofs = (ofs << 1) + 1;

13.                if (ofs <= 0)   // int overflow

14.                  ofs = maxOfs;

15.              }

16.              if (ofs > maxOfs)

17.                ofs = maxOfs;

18.       

19.              // Make offsets relativeto base

20.              lastOfs += hint;

21.              ofs += hint;

22.            } else { // key <= a[base+ hint]

23.              // Gallop left untila[base+hint-ofs] < key <= a[base+hint-lastOfs]

24.              final int maxOfs = hint +1;

25.              while (ofs < maxOfs&& c.compare(key, s.getKey(a, base + hint - ofs, key0)) <= 0) {

26.                lastOfs = ofs;

27.                ofs = (ofs << 1) +1;

28.                if (ofs <= 0)   // int overflow

29.                  ofs = maxOfs;

30.              }

31.              if (ofs > maxOfs)

32.                ofs = maxOfs;

33.       

34.              // Make offsets relativeto base

35.              int tmp = lastOfs;

36.              lastOfs = hint - ofs;

37.              ofs = hint - tmp;

38.            }

39.            assert -1 <= lastOfs&& lastOfs < ofs && ofs <= len;

40.       

41.            /*

42.             * Now a[base+lastOfs]

43.             * to the right of lastOfsbut no farther right than ofs.  Do a binary

44.             * search, with invarianta[base + lastOfs - 1] < key <= a[base + ofs].

45.             */

46.            lastOfs++;

47.            while (lastOfs < ofs) {

48.              int m = lastOfs + ((ofs -lastOfs) >>> 1);

49.       

50.              if (c.compare(key,s.getKey(a, base + m, key0)) > 0)

51.                lastOfs = m + 1;  // a[base + m] < key

52.              else

53.                ofs = m;          // key <= a[base + m]

54.            }

55.            assert lastOfs == ofs;    // so a[base + ofs - 1] < key <=a[base + ofs]

56.            return ofs;

57.          }

 

回到TimSort.java的mergeAt方法,下面看一个关键代码mergeLo(base1, len1, base2,len2); mergeLo方法以稳定的方式合并两个相邻run。 第一个run的第一个的元素必须大于第二个run的第一个元素 (a[base1] > a[base2]),第一个run的最后一个元素 (a[base1 + len1-1])必须大于第二个run的所有的元素。 这种方法只有当len1 <= len2的时候被调用;另一个类似方法mergeHi在len1 > = len2的情况下被调用。(如果 len1 == len2,任何一种方法可被调用)      

l  @param base1 第一个run的第一个元素被合并的索引

l  @param len1  第一个run被合并的长度(必须大于0)

l  @param base2  第二个run被合并的第一个元素的索引(必须是 aBase + aLen)

l  @param len2   第二个run被合并的长度 (必须大于0)

mergeLo代码如下:

1.           private void mergeLo(int base1, int len1, intbase2, int len2) {

2.               assert len1 > 0&& len2 > 0 && base1 + len1 == base2;

3.          

4.               // Copy first run into temparray

5.               Buffer a = this.a; // Forperformance

6.               Buffer tmp = ensureCapacity(len1);

7.               s.copyRange(a, base1, tmp,0, len1);

8.          

9.               int cursor1 = 0;       // Indexes into tmp array

10.            int cursor2 = base2;   // Indexes int a

11.            int dest = base1;      // Indexes int a

12.       

13.            // Move first element ofsecond run and deal with degenerate cases

14.            s.copyElement(a, cursor2++,a, dest++);

15.            if (--len2 == 0) {

16.              s.copyRange(tmp, cursor1,a, dest, len1);

17.              return;

18.            }

19.            if (len1 == 1) {

20.              s.copyRange(a, cursor2, a,dest, len2);

21.              s.copyElement(tmp,cursor1, a, dest + len2); // Last elt of run 1 to end of merge

22.              return;

23.            }

24.       

25.            K key0 = s.newKey();

26.            K key1 = s.newKey();

27.       

28.            Comparatorc = this.c;  // Use local variable forperformance

29.            int minGallop =this.minGallop;    //  "   "       "     "     "

30.            outer:

31.            while (true) {

32.              int count1 = 0; // Numberof times in a row that first run won

33.              int count2 = 0; // Numberof times in a row that second run won

34.       

35.              /*

36.               * Do the straightforwardthing until (if ever) one run starts

37.               * winning consistently.

38.               */

39.              do {

40.                assert len1 > 1&& len2 > 0;

41.                if(c.compare(s.getKey(a, cursor2, key0), s.getKey(tmp, cursor1, key1)) < 0) {

42.                  s.copyElement(a,cursor2++, a, dest++);

43.                  count2++;

44.                  count1 = 0;

45.                  if (--len2 == 0)

46.                    break outer;

47.                } else {

48.                  s.copyElement(tmp,cursor1++, a, dest++);

49.                  count1++;

50.                  count2 = 0;

51.                  if (--len1 == 1)

52.                    break outer;

53.                }

54.              } while ((count1 | count2)< minGallop);

55.       

56.              /*

57.               * One run is winning soconsistently that galloping may be a

58.               * huge win. So try that,and continue galloping until (if ever)

59.               * neither run appears tobe winning consistently anymore.

60.               */

61.              do {

62.                assert len1 > 1&& len2 > 0;

63.                count1 = gallopRight(s.getKey(a,cursor2, key0), tmp, cursor1, len1, 0, c);

64.                if (count1 != 0) {

65.                  s.copyRange(tmp,cursor1, a, dest, count1);

66.                  dest += count1;

67.                  cursor1 += count1;

68.                  len1 -= count1;

69.                  if (len1 <= 1) //len1 == 1 || len1 == 0

70.                    break outer;

71.                }

72.                s.copyElement(a,cursor2++, a, dest++);

73.                if (--len2 == 0)

74.                  break outer;

75.       

76.                count2 =gallopLeft(s.getKey(tmp, cursor1, key0), a, cursor2, len2, 0, c);

77.                if (count2 != 0) {

78.                  s.copyRange(a,cursor2, a, dest, count2);

79.                  dest += count2;

80.                  cursor2 += count2;

81.                  len2 -= count2;

82.                  if (len2 == 0)

83.                    break outer;

84.                }

85.                s.copyElement(tmp,cursor1++, a, dest++);

86.                if (--len1 == 1)

87.                  break outer;

88.                minGallop--;

89.              } while (count1 >=MIN_GALLOP | count2 >= MIN_GALLOP);

90.              if (minGallop < 0)

91.                minGallop = 0;

92.              minGallop += 2;  // Penalize for leaving gallop mode

93.            }  // End of "outer" loop

94.            this.minGallop = minGallop< 1 ? 1 : minGallop;  // Write back tofield

95.       

96.            if (len1 == 1) {

97.              assert len2 > 0;

98.              s.copyRange(a, cursor2, a,dest, len2);

99.              s.copyElement(tmp,cursor1, a, dest + len2); //  Last elt ofrun 1 to end of merge

100.         } else if (len1 == 0) {

101.           throw newIllegalArgumentException(

102.               "Comparisonmethod violates its general contract!");

103.         } else {

104.           assert len2 == 0;

105.           assert len1 > 1;

106.           s.copyRange(tmp, cursor1,a, dest, len1);

107.         }

108.       }

 

回到TimSort.java的sort方法,还有一行关键代码sortState.mergeForceCollapse();合并堆栈上所有的run,直到只有一个run。这种方法是调用一次,完成排序。

mergeForceCollapse方法如下:

1.               private void mergeForceCollapse() {

2.               while (stackSize > 1) {

3.                 int n = stackSize - 2;

4.                 if (n > 0 &&runLen[n - 1] < runLen[n + 1])

5.                   n--;

6.                 mergeAt(n);

7.               }

8.             }

总结: TimSort会预先按连续递增的run的片段归并元素,进入插入的时候,如果长度小于run,就会使用insert进行排序的实现。与mergesort相比,mergesort归并是预先定义好的,而TimSort比较灵活。如果第三个run小于栈顶的run,那先归并第2个第3个run,从而得出需要归并的片段;如果run1的头部和run2的尾部有不用进行归并的部分,TimSort在截取的基础上进行二分排序,得到需要归并的起始位置。如果run的长度为1,会进行一些优化。


这节课有点难度,同学们了解Timsort即可。


你可能感兴趣的:(SparkInBeiJing,Spark,shuffle)