Basic Data Structures and Algorithms in the Linux kernel

http://cstheory.stackexchange.com/questions/19759/core-algorithms-deployed/19773#19773


Algorithms that are the main driver behind a system are, in my opinion, easier to find in non-algorithms courses for the same reason theorems with immediate applications are easier to find in applied mathematics rather than pure mathematics courses. It is rare for a practical problem to have the exact structure of the abstract problem in a lecture. To be argumentative, I see no reason why fashionable algorithms course material such as Strassen's multiplication, the AKS primality test, or the Moser-Tardos algorithm is relevant for low-level practical problems of implementing a video database, an optimizing compiler, an operating system, a network congestion control system or any other system. The value of these courses is learning that there are intricate ways to exploit the structure of a problem to find efficient solutions. Advanced algorithms is also where one meets simple algorithms whose analysis is non-trivial. For this reason, I would not dismiss simple randomized algorithms or PageRank.

I think you can choose any large piece of software and find basic and advanced algorithms implemented in it. As a case study, I've done this for the Linux kernel, and shown a few examples from Chromium.

Basic Data Structures and Algorithms in the Linux kernel

Links are to the source code on github.

  1. Linked list, doubly linked list, lock-free linked list.
  2. B+ Trees with comments telling you what you can't find in the textbooks.

    A relatively simple B+Tree implementation. I have written it as a learning exercise to understand how B+Trees work. Turned out to be useful as well.

    ...

    A tricks was used that is not commonly found in textbooks. The lowest values are to the right, not to the left. All used slots within a node are on the left, all unused slots contain NUL values. Most operations simply loop once over all slots and terminate on the first NUL.

  3. Priority sorted lists used for mutexes, drivers, etc.

  4. Red-Black trees are used for scheduling, virtual memory management, to track file descriptors and directory entries,etc.
  5. Interval trees
  6. Radix trees, are used for memory management, NFS related lookups and networking related functionality.

    A common use of the radix tree is to store pointers to struct pages;

  7. Priority heap, which is literally, a textbook implementation, used in the control group system.

    Simple insertion-only static-sized priority heap containing pointers, based on CLR, chapter 7

  8. Hash functions, with a reference to Knuth and to a paper.

    Knuth recommends primes in approximately golden ratio to the maximum integer representable by a machine word for multiplicative hashing. Chuck Lever verified the effectiveness of this technique:

    http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf

    These primes are chosen to be bit-sparse, that is operations on them can use shifts and additions instead of multiplications for machines where multiplications are slow.

  9. Some parts of the code, such as this driver, implement their own hash function.

    hash function using a Rotating Hash algorithm

    Knuth, D. The Art of Computer Programming, Volume 3: Sorting and Searching, Chapter 6.4. Addison Wesley, 1973

  10. Hash tables used to implement inodes, file system integrity checks etc.
  11. Bit arrays, which are used for dealing with flags, interrupts, etc. and are featured in Knuth Vol. 4.

  12. Semaphores and spin locks

  13. Binary search is used for interrupt handling, register cache lookup, etc.

  14. Binary search with B-trees

  15. Depth first search and variant used in directory configuration.

    Performs a modified depth-first walk of the namespace tree, starting (and ending) at the node specified by start_handle. The callback function is called whenever a node that matches the type parameter is found. If the callback function returns a non-zero value, the search is terminated immediately and this value is returned to the caller.

  16. Breadth first search is used to check correctness of locking at runtime.

  17. Merge sort on linked lists is used for garbage collection, file system management, etc.

  18. Bubble sort is amazingly implemented too, in a driver library.

  19. Knuth-Morris-Pratt string matching,

    Implements a linear-time string-matching algorithm due to Knuth, Morris, and Pratt [1]. Their algorithm avoids the explicit computation of the transition function DELTA altogether. Its matching time is O(n), for n being length(text), using just an auxiliary function PI[1..m], for m being length(pattern), precomputed from the pattern in time O(m). The array PI allows the transition function DELTA to be computed efficiently "on the fly" as needed. Roughly speaking, for any state "q" = 0,1,...,m and any character "a" in SIGMA, the value PI["q"] contains the information that is independent of "a" and is needed to compute DELTA("q", "a") 2. Since the array PI has only m entries, whereas DELTA has O(m|SIGMA|) entries, we save a factor of |SIGMA| in the preprocessing time by computing PI rather than DELTA.

    [1] Cormen, Leiserson, Rivest, Stein Introdcution to Algorithms, 2nd Edition, MIT Press

    [2] See finite automation theory

  20. Boyer-Moore pattern matching with references and recommendations for when to prefer the alternative.

    Implements Boyer-Moore string matching algorithm:

    [1] A Fast String Searching Algorithm, R.S. Boyer and Moore. Communications of the Association for Computing Machinery, 20(10), 1977, pp. 762-772.http://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf

    [2] Handbook of Exact String Matching Algorithms, Thierry Lecroq, 2004 http://www-igm.univ-mlv.fr/~lecroq/string/string.pdf

    Note: Since Boyer-Moore (BM) performs searches for matchings from right to left, it's still possible that a matching could be spread over multiple blocks, in that case this algorithm won't find any coincidence.

    If you're willing to ensure that such thing won't ever happen, use the Knuth-Pratt-Morris (KMP) implementation instead. In conclusion, choose the proper string search algorithm depending on your setting.

    Say you're using the textsearch infrastructure for filtering, NIDS or
    any similar security focused purpose, then go KMP. Otherwise, if you really care about performance, say you're classifying packets to apply Quality of Service (QoS) policies, and you don't mind about possible matchings spread over multiple fragments, then go BM.


你可能感兴趣的:(Basic Data Structures and Algorithms in the Linux kernel)