|About the SEAlang Library Khmer Text Corpus|
|This mononlingual corpus consists of Khmer texts published on the Internet, sampled here for research and educational purposes.|
|- context searches show how the search target appears in context, taking both leading and trailing collocates (or neighboring words) into account. This search returns a merged list of leading and trailing collocates.|
|- collocate searches are better for focusing on the search target's immediate neighbor. This search returns separate lists of leading and trailing collocates.|
|- merged view allows for fast switching between collocate and context views. Try brief first - downloaded pages may be very large, and a slow browser may fall behind in displaying the detailed view. The Go! button invokes the brief view.|
|- raw contexts show the search word in context without any attempt at analysis or sanity-checking (local segmentation that helps ensure that a real word has been found).|
restrict collocates requires (or forbids) all collocates to have at least
one sense with a particular part of speech or usage.
Segmentation in the SEAlang Library Corpus
|Southeast Asian writing is normally broken into phrases, rather than individual words. As yet, it is not possible to automatically segment a sentence into words correctly all of the time. The problem is made even more difficult by the authentic texts we provide: by their nature, they contain many names, loanwords, and misspellings that will undo even the best segmentation algorithm.|
Rather than trying to segment our corpus texts in advance, we
use peephole segmentation - we only try to segment the
search target's immediate neighbors, on the fly.
This will sometimes produce incorrect results.
However, it is far more robust, and returns much more potentially useful
|Because the underlying text corpus may be quite large (more than 50 million characters in this implementation), results may be taken from a random sample of hits. For common words, this means that sample contexts and exact collocate frequencies will vary from run to run.|
Clicking on a word/collocate with the mouse starts a new search:
searches for contexts, and
searches for collocates.
|Look for continuing development of SEAlang Library Khmer resources throughout Winter/Spring, 2007.|