Question

link

The input is an endless stream of English words or phrases (we refer them as tokens).

Output top N tokens we have seen so far (from all the tokens we have seen!)

Analysis

We will discuss the following details of implementation and optimization.

  1. String into Integer
  2. Data Storage
  3. Process Incoming Streams
  4. Save result

1. String into Integer

This is a nice trick that improves eficiency a lot.

Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower.

We have already found 4 million different words, and assigned a unique ID for each. This is important, because sorting and comparisons on integers is much much faster than on strings.

2. Data Storage

The system keeps archive data for every token. Basically it’s pairs of (Token, Frequency).

However, the table that stores the data would be so huge such that we have to partition the table physically. One partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram.

Of course we can also divide the data by the hash value. For information on ngrams, read [Design] Terminology: n-gram.

3. Process Incoming Streams

The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.

This data processing logic runs as a process under Memory-Manager. The next part is another processing running concurrently.

4. Save result

Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort.

There is some more steps afterwards, but they’re trivial. I have listed out the basic steps for processing large stream of incoming data (as string), and how to find out the Top K keywords.

I suggest you read previous [Design] Big Data - Top k Frequency posts before reading this.