[NineChap 9] Big Date, System Design and Resume (`)

2014/06/29

Resume

Do not write anything unrelated to CS.
Do not write too long - 1 or 2 pages are fine. Senior engineer 3 pages.
Do not write low GPA
Never ever write “proficient in anything”

Big Data

Most classic question is “Frequent items” (refer to July’s blog).

Find top k hot queries in a daily access log of Google.

Variation:

k = 1 vs k = 100000 - majority numbers
low RAM vs sufficient RAM
single machine vs multiple machines
accurate vs inaccurate

Sufficient RAM

HashTable + Heap (min-heap)
Time O(n * logk), Space O(n)

Low RAM

Split into 1000 (i.e. LOG/M) files by hash(query) % 1000
Using HashTable + Heap to get top k for each files
Collect 1000 top k queries and get global top k
This method requires a lot of disk access and r/w, still slow.

Inaccurate (reduce memory from O(n) to O(k))

Hash Count (only need to know this one)
Limit the size of HashMap. The bigger the RAM, the more accurate is the result.
Space Saving
Lossy Counting
Sticky Sampling
Count Sketch

Bloom Filter

Regular bloom filter - use 4 线性无关 formula
Counting bloom filter - support delete
Better DS than HashMap, but can loose some accuracy

Trie

Bitmap

Find all unique queries - use bigmap to store 3 types of states

System Design

Design a short url system

Cache

to store hot urls

Load Balance

Too many click in short time

Storage balance

Hash value of an url and then store in
individual machine

Expansibility?

Consistent Hash

Node, can increase # of machines to store information

Migration process

Router

check which machine response my query

light-weight calculations

what is router is down?

Locale

url frequently access by China, then put the url storage in Beijing

Need-to-know Design patterns

Singleton
Factory
Master-slave (esp. for relational DB)

MapReduce: Simplified Data Processing on Large Clusters

The Google File System

BigTable: A Distributed Storage System for Structured Data