count sketch








A practical introduction to the Count-Min Sketch


























	

A practical introduction to the

Count-Min Sketch

Hannes Korte
Algorithms & Data Challenges Berlin Meetup
Mar 19, 2013

Imagine:

  • You want to count elements.
  • You don't need exact results.

Simple solution in Java:

Construct a map from elements to counts.

 Map<String, Integer> counter = new HashMap<String, Integer>(); 

You probably want to use libraries like Guava or FastUtil for convenience and better performance.

Problem:

  • The number of distinct elements might be very large.
  • For realtime applications you need runtime guarantees.

Solution: The Count-Min Sketch

  • The trick: Don't store the distinct elements, but just the counters.
  • Create an integer array of length x initially filled with 0s.
  • Each incoming element gets mapped to a number between 0 and x.
  • The corresponding counter in the array gets incremented.
  • To query an element's count, simply return the integer value at it's position.

You are completely right:
There will be collisions!

The Count-Min Sketch

  • Use multiple arrays with different hash functions to compute the index.
  • When queried, return the minimum of the numbers the arrays. → Count-Min Sketch

You are completely right:
There will still be collisions!

...but less.

Some properties

  • Only over-estimates, never under-estimates the true count.
  • Has a constant memory and time consumption independent of the number of elements.
  • The relative error may be high for low-frequent elements.

Demo sketch:

h0 0000000
h1 0000000
h2 0000000
add a new element:

Extension: The Conservative Update

  • Instead of incrementing every counter, only increment the ones which need an update.
  • Experiments show a strong error reduction using this heuristic.

Example counting unigrams and bigrams in text documents.

Example counting unigrams and bigrams in text documents.

Usage

  • General:
    • Every counting problem on large data sets, where errors are acceptable.
    • Counting elements in data streams.
      (e.g. trend detection on twitter)
  • Mine:
    • Feature selection for large data sets using pointwise mutual information.
    • All kinds of feature statistics like word co-occurences, IDF, etc.

Feature Selection Example

Input: Company websites grouped by industry sector

Output: Relevant terms for each sector based on PMI

Label 01         Landwirtschaft und Jagd
#docs: 14
001.     0,0177864  pflanzen
002.     0,0144118  garten
003.     0,0129956  ernte
004.     0,0113797  gewächshausfläche
005.     0,0113797  dwarf
006.     0,0113797  spätreifende
007.     0,0113797  breiti
008.     0,0113797  kellereiartikel
009.     0,0108276  landschaftspflege
010.     0,0102964  cultivated
011.     0,0102964  reintönige
012.     0,0102964  traubensorte
013.     0,0102964  jahrgängen
014.     0,0102964  schwimmteich
015.     0,0102964  imposanter
016.     0,0102964  innenbegrünung
017.     0,0102964  pflanzenpflege
018.     0,0101218  grünanlagen
019.     0,0101218  blüten
020.     0,0101218  ernten

Label 01.12      Gartenbau
#docs: 6
001.     0,0143696  pflanzen
002.     0,0137515  gewächshausfläche
003.     0,0126628  schwimmteich
004.     0,0126628  pflanzenpflege
005.     0,0123234  gärtner
006.     0,0118892  schwimmteiche
007.     0,0107863  gärtnern
008.     0,0103628  topfpflanzen
009.     0,0099942  gartens
010.     0,0099942  schnittblumen
011.     0,0098929  gärten
012.     0,0096679  baumschule
013.     0,0096679  floristik
014.     0,0093751  eden
015.     0,0091096  gartencenter
016.     0,0091096  stauden
017.     0,0088669  blüten
018.     0,0084301  portrait
019.     0,0080621  ablehnen
020.     0,0080621  edith

Label 15         Ernährungsgewerbe
#docs: 20
001.     0,0167030  leckere
002.     0,0163800  zutaten
003.     0,0157303  geschmack
004.     0,0152514  rezepte
005.     0,0149070  spezialitäten
006.     0,0141082  wurst
007.     0,0138228  aprikosen
008.     0,0138228  hefe
009.     0,0134308  inhaltsstoffe
010.     0,0132752  schinken
011.     0,0132752  fettsäuren
012.     0,0132752  mandeln
013.     0,0131221  getraenke
014.     0,0131221  schlachtung
015.     0,0131221  walnüsse
016.     0,0131221  durstlöscher
017.     0,0131221  glasflasche
018.     0,0131221  geschmacklich
019.     0,0131221  maracuja
020.     0,0123445  vitaminen

Label 15.13      Fleischverarbeitung
#docs: 5
001.     0,0186327  schlachtung
002.     0,0179152  wurst
003.     0,0163253  wurstwaren
004.     0,0142521  lammfleisch
005.     0,0142521  qualitätsfleisch
006.     0,0138747  partyservice
007.     0,0131627  aufschnitt
008.     0,0131627  küchenfertige
009.     0,0131627  hackfleisch
010.     0,0131627  artgerechte
011.     0,0125670  fleisch
012.     0,0123884  züchten
013.     0,0123884  vieh
014.     0,0123884  mariniert
015.     0,0117828  kühlkette
016.     0,0112842  geschmacksverstärker
017.     0,0112842  dlg
018.     0,0112842  aufzucht
019.     0,0107535  zubereitung
020.     0,0104908  fleischerei

References

你可能感兴趣的:(PHP)