SEAlang Thai Web Corpus

About the SEAlang Thai Web Corpus

The Thai Web Corpus is an experimental research tool built to:

- support research studying Thai-language change and growth,

- assist in Thai dictionary design and lexicographic reseach,

- evaluate new techniques in web-as-corpus design.

The Thai Web Corpus supports SEAlang's Thai in Transition research project, which is investigating Thai perceptions of and responses to language change. Questions include (see more ...):

- do foreign loans have any discernible impact on Thai syntax or grammar?

- what semantic and phonological nativization processes serve to 'Thai-ize' loans?

- how effective is the Royal Institute's prescriptive orthography for slang and loans?

- are there any characteristic phonological processes that tend to generate new slang?

- does the distribution of slang characterize Thai Web space, or the slang itself?

- can dictionaries developed without text corpora hope to get it right?

- what semantic niches do foreign borrowings fill?

- what is the grammatical distribution of new and borrowed words?

- how widely used are suggested 'traditional Thai' alternatives?

A gigaword / terabyte Thai corpus
We estimate that at the end of 2007, roughly 100 million Thai-language Web pages have been indexed by the major search engines (based on counts of the extremely common word XXX). A conservative estimate of 1,000 Thai characters per page implies that this Web corpus includes some 100 billion characters, or more than 25 billion words.

Thus, we are already able to consult a gigaword corpus 25 times over, and can anticipate access to a terabyte corpus within a few years. Most written genres are extremely well represented, with the particular exception of literature. Perhaps unexpectedly, spoken Thai - the most dynamic area of language development - has extensive on-line coverage thanks to the great popularity of blogs and message boards. The Thai Web Corpus is an unstructured, more-is-better collection.

An alternative approach taken by the Thai National Corpus specifies the domain, time, and original medium of corpus content. The TNC follows the general criteria set by the British National Corpus, and strictly limits the amount of Internet-sourced material to below 5%. The TNC is hand-segmented, and all editorial content is tagged following the TEI corpus-encoding standard. The initial corpus of 80 million words honors the 80th birthday of Thailand's King Bhumiphol. As a balanced, fixed reference corpus, the TNC will play a key role in Thai-language research.

Linked resources
Research resources based on recent publications are linked to this page. They include (see more ...):

Dictionary of New Words Royal Institute, Bangkok:2007.

Foreign Words That Can be Replaced with Thai Words Royal Institute, Bangkok:2006.

Slang Words Jintana Phutthameta, MA thesis, Srinakharinwirote University, Bangkok:2003.

Dictionary of Slang Words Department of Curriculum and Instruction Development, Ministry of Education, Bangkok:2000.

Research considerations
The Thai Web Corpus has several inherent behavioral characteristics that must be taken into account when using it as a formal research tool. They include:

reproducibility The Web is constantly in flux, and results may change on each query.

balance As Web space has matured, it is increasingly used a repository for all conceivable printed materials. Excess coverage of any genre can be managed, but Thai Web space appears to have a relatively limited store of purely literary content.

segmentation artifacts Although Google and Yahoo apparently attempt to segment Thai text (so that returned words always represent true word boundaries), their segmentation algorithms are imperfect and unpublished.

order artifacts Search engines attempt to return pages of greatest interest first. What effects, if any, this may have on the 'authenticity' of extracted example contexts is unknown.

count artifacts Reported page counts almost always vary by about 10% between the first and subsequent pages, and may vary by an order of magnitude even for a single search engine (observed for Google queries that originate in Thailand vs. in the US).

Features
The Web Corpus provides several kinds of searching and sampling capability.

Search returns results that are similar to Google or Yahoo searches, but which can be restricted or sorted in useful ways.

Sample extracts and returns the immediate contexts of the search terms.

Analyze performs an extensive collocate analysis, using capabilities provided by the SEAlang Library corpus tools.

Predict looks ahead to suggest likely phrases.

Innovations
The Web Corpus introduces several novel ideas, including:

Query randomization divides results into disjoint sets. This has a number of applications, including churning the results provided by search and sample queries, and forcing the search engines beyond the customary 1,000 item limit.

Search without is another forcing technique, in which common collocates are intentionally excluded in order to produce a richer sample of less-common items.

Web balancing uses a series of searches in different domains (for example, blogs, news media, etc.) to provide a more diverse set of results.

Context-sensitive predictive completion relies on the fact that Thai, Khmer, Lao, and Burmese have disjoint Unicode spaces to provide an appropriate action for either English or L2 query entry.