Vietnamese Text Corpus

This site is open for test purposes only. Not all functions work yet, and corpus content content will change.

About the SEAlang Library Vietnamese Text Corpus

This mononlingual corpus consists of Vietnamese texts published on the Internet, sampled here for research and educational purposes. We are using a combination of newspaper, literary, and Wikipedia texts.

- context searches show how the search target appears in context, taking both leading and trailing collocates (or neighboring words) into account. This search returns a merged list of leading and trailing collocates.

- collocate searches are better for focusing on the search target's immediate neighbor. This search returns separate lists of leading and trailing collocates.

- merged view allows for fast switching between collocate and context views. Try brief first - downloaded pages may be very large, and a slow browser may fall behind in displaying the detailed view. The Go! button invokes the brief view.

- raw contexts show the search word in context without any attempt at analysis or sanity-checking (local segmentation that helps ensure that a real word has been found).

Usage tips

Because the underlying text corpus may be quite large (more than 50 million characters in this implementation), results may be taken from a random sample of hits. For common words, this means that sample contexts and exact collocate frequencies will vary from run to run.

Clicking on a word/collocate with the mouse starts a new search: yellow searches for contexts, and black searches for collocates.

Look for continuing development of SEAlang Library Vietnamese resources.