n-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
Example
|
frequency |
word1 |
word2 |
word3 |
|
1419 |
much |
the |
same |
|
461 |
much |
more |
likely |
|
432 |
much |
better |
than |
|
266 |
much |
more |
difficult |
|
235 |
much |
of |
the |
|
226 |
much |
more |
than |
Downloadable n-grams sets for English
- Google n-grams, based on the web as of 2006.
- COCA n-grams, based on Corpus of Contemporary American English [COCA]. 450 million words from 1990 to 2012.
With n-grams data (2, 3, 4, 5-word sequences, with their frequency), we can carry out powerful queries offline.