A practical introduction to the Count-Min Sketch
A practical introduction to the
Count-Min Sketch
Hannes Korte
Algorithms & Data Challenges Berlin Meetup
Mar 19, 2013
Imagine:
- You want to count elements.
- You don't need exact results.
Simple solution in Java:
Construct a map from elements to counts.
Map<String, Integer> counter = new HashMap<String, Integer>();
You probably want to use libraries like Guava or FastUtil for convenience and better performance.
Problem:
- The number of distinct elements might be very large.
- For realtime applications you need runtime guarantees.
Solution: The Count-Min Sketch
- The trick: Don't store the distinct elements, but just the counters.
- Create an integer array of length x initially filled with 0s.
- Each incoming element gets mapped to a number between 0 and x.
- The corresponding counter in the array gets incremented.
- To query an element's count, simply return the integer value at it's position.
You are completely right:
There will be collisions!
The Count-Min Sketch
- Use multiple arrays with different hash functions to compute the index.
- When queried, return the minimum of the numbers the arrays. → Count-Min Sketch
You are completely right:
There will still be collisions!
...but less.
Some properties
- Only over-estimates, never under-estimates the true count.
- Has a constant memory and time consumption independent of the number of elements.
- The relative error may be high for low-frequent elements.
Demo sketch:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Extension: The Conservative Update
- Instead of incrementing every counter, only increment the ones which need an update.
- Experiments show a strong error reduction using this heuristic.
Example counting unigrams and bigrams in text documents.
Example counting unigrams and bigrams in text documents.
Usage
- General:
- Every counting problem on large data sets, where errors are acceptable.
- Counting elements in data streams.
(e.g. trend detection on twitter)
- Mine:
- Feature selection for large data sets using pointwise mutual information.
- All kinds of feature statistics like word co-occurences, IDF, etc.
Feature Selection Example
Input: Company websites grouped by industry sector
Output: Relevant terms for each sector based on PMI
Label 01 Landwirtschaft und Jagd
#docs: 14
001. 0,0177864 pflanzen
002. 0,0144118 garten
003. 0,0129956 ernte
004. 0,0113797 gewächshausfläche
005. 0,0113797 dwarf
006. 0,0113797 spätreifende
007. 0,0113797 breiti
008. 0,0113797 kellereiartikel
009. 0,0108276 landschaftspflege
010. 0,0102964 cultivated
011. 0,0102964 reintönige
012. 0,0102964 traubensorte
013. 0,0102964 jahrgängen
014. 0,0102964 schwimmteich
015. 0,0102964 imposanter
016. 0,0102964 innenbegrünung
017. 0,0102964 pflanzenpflege
018. 0,0101218 grünanlagen
019. 0,0101218 blüten
020. 0,0101218 ernten
Label 01.12 Gartenbau
#docs: 6
001. 0,0143696 pflanzen
002. 0,0137515 gewächshausfläche
003. 0,0126628 schwimmteich
004. 0,0126628 pflanzenpflege
005. 0,0123234 gärtner
006. 0,0118892 schwimmteiche
007. 0,0107863 gärtnern
008. 0,0103628 topfpflanzen
009. 0,0099942 gartens
010. 0,0099942 schnittblumen
011. 0,0098929 gärten
012. 0,0096679 baumschule
013. 0,0096679 floristik
014. 0,0093751 eden
015. 0,0091096 gartencenter
016. 0,0091096 stauden
017. 0,0088669 blüten
018. 0,0084301 portrait
019. 0,0080621 ablehnen
020. 0,0080621 edith
Label 15 Ernährungsgewerbe
#docs: 20
001. 0,0167030 leckere
002. 0,0163800 zutaten
003. 0,0157303 geschmack
004. 0,0152514 rezepte
005. 0,0149070 spezialitäten
006. 0,0141082 wurst
007. 0,0138228 aprikosen
008. 0,0138228 hefe
009. 0,0134308 inhaltsstoffe
010. 0,0132752 schinken
011. 0,0132752 fettsäuren
012. 0,0132752 mandeln
013. 0,0131221 getraenke
014. 0,0131221 schlachtung
015. 0,0131221 walnüsse
016. 0,0131221 durstlöscher
017. 0,0131221 glasflasche
018. 0,0131221 geschmacklich
019. 0,0131221 maracuja
020. 0,0123445 vitaminen
Label 15.13 Fleischverarbeitung
#docs: 5
001. 0,0186327 schlachtung
002. 0,0179152 wurst
003. 0,0163253 wurstwaren
004. 0,0142521 lammfleisch
005. 0,0142521 qualitätsfleisch
006. 0,0138747 partyservice
007. 0,0131627 aufschnitt
008. 0,0131627 küchenfertige
009. 0,0131627 hackfleisch
010. 0,0131627 artgerechte
011. 0,0125670 fleisch
012. 0,0123884 züchten
013. 0,0123884 vieh
014. 0,0123884 mariniert
015. 0,0117828 kühlkette
016. 0,0112842 geschmacksverstärker
017. 0,0112842 dlg
018. 0,0112842 aufzucht
019. 0,0107535 zubereitung
020. 0,0104908 fleischerei
References
- https://sites.google.com/site/countminsketch/
A great resource of implementations and publications.
- An Improved Data Stream Summary: The Count-Min Sketch and Its Applications
Graham Cormode, and S. Muthukrishnan. LATIN 2004: Theoretical Informatics (2004). (The original paper).
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
- Sketching Techniques for Large Scale NLP
Amit Goyal, Jagadeesh Jagarlamudi, Hal Daume III, and Suresh Venkatasubramanian. Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (2010).
http://www.umiacs.umd.edu/~jags/pdfs/LSNLPsketchWAC10.pdf