Horje
Count-Min Sketch in Python

Count-Min Sketch is a probabilistic data structure which approximates the frequency of items in a stream of data. It uses little memory while handling massive amounts of data and producing approximations of the answers. In this post, we’ll explore the idea behind the Count-Min Sketch, how it’s implemented in Python, and discuss its uses and drawbacks.

What is Count-Min Sketch?

Count-Min is a probabilistic data structure used to count unique items in a large stream of data. It is used to find an approximate frequency of the events on the streaming data.

The idea behind Count-Min Sketch is to use hash functions and a two-dimensional array (or matrix) to efficiently store the frequency of items. The array is made up of several rows and columns, where a bucket is represented by a column and a hash function by a row. The hash functions identify the locations in the array to increment or get counts when updating or querying the frequency of entries.

Key Operations in Count-Min Sketch:

Initialization: Set the number of rows and columns that you want in the Count-Min Sketch.

Update: To increase an element’s count, hash it through each hash function and update the array’s associated buckets.

Query: Find the lowest count across the related buckets after hashing an element with each hash algorithm to determine its estimated frequency.

Implementation of Count-Min Sketch in Python:

Below is the implementation of Count-Min Sketch in Python:

Python
import hashlib

class CountMinSketch:
    def __init__(self, rows, cols):
        self.rows = rows
        self.cols = cols
        self.count_matrix = [[0] * cols for _ in range(rows)]
        self.hash_functions = [hashlib.md5, hashlib.sha1, hashlib.sha256] 

    def update(self, element):
        for i, hash_func in enumerate(self.hash_functions):
            hash_value = int(hash_func(element.encode()).hexdigest(), 16)
            bucket_index = hash_value % self.cols
            self.count_matrix[i][bucket_index] += 1

    def query(self, element):
        min_count = float('inf')
        for i, hash_func in enumerate(self.hash_functions):
            hash_value = int(hash_func(element.encode()).hexdigest(), 16)
            bucket_index = hash_value % self.cols
            min_count = min(min_count, self.count_matrix[i][bucket_index])
        return min_count

# Example usage
cms = CountMinSketch(rows=3, cols=10)
data_stream = ["apple", "banana", "apple", "orange", "apple", "banana", "banana"]
for element in data_stream:
    cms.update(element)
print("Frequency of 'apple':", cms.query("apple"))

Output
Frequency of 'apple': 3

Count-Min Sketch is a data structure that provides accurate approximations of element frequencies in massive data streams. Python developers may efficiently address a range of frequency estimation issues by using Count-Min Sketch provided they have a thorough knowledge of its concepts, uses, and limits.




Reffered: https://www.geeksforgeeks.org


DSA

Related
LMNs-Data Structures LMNs-Data Structures
Reverse the string whenever digit is encountered Reverse the string whenever digit is encountered
Maximizing GCD Sums Maximizing GCD Sums
Boyer Moore Algorithm in Python Boyer Moore Algorithm in Python
Matching the Alphanumerical Pattern in Matrix I Matching the Alphanumerical Pattern in Matrix I

Type:
Geek
Category:
Coding
Sub Category:
Tutorial
Uploaded by:
Admin
Views:
19