SEAlang Lab

SEAlang Lab Data

The SEAlang Lab explores the potential of a variety of data resources. We use Thai as a demonstration language because 'raw' resources are so readily available, but building facilities for Arabic, Chinese, Urdu, and other complex-script languages is our ultimate goal.

Bitexts are our primary text resource because they are so versatile. They are found in a variety of locations, including:

-- traditional readers prepared for first-year students, supplemented by purpose-built instructional texts; e.g. from the LangNet project.

-- translated texts, with the caveat that they must be reasonably close translations.

-- newspaper stories, particularly those that have been translated for teaching purposes.

-- thesis abstracts, which are an excellent source of technical terminology, but can present serious quality problems.

-- dictionary examples are not used here (because we have alternatives), but they are often the only available source for LCTLs.

-- translated audio is also not available yet; we have transcribed (but not translated) audio samples.

All texts have some special value. While there is a tendency to perceive native-language source texts as being more 'authentic' than translated texts, the latter can be more effective for teaching. It is often easier to follow the text, and the student is able to spend more time on task and cover far more ground.

Not all of these texts were originally translated, and none of them had been sentence-aligned. A considerable amount of work - not always satisfying, as when texts must be rejected because of missing or reordered segments in the translation - is required to prepare bitexts for this application.

Vocabulary lists are suprisingly underdeveloped. Our primary sources are:

-- textbook lists, extracted from existing textbooks.

-- shared lists, in particular, lists distributed in .b4u format as used by the BYKI reader from Transparent Language.

-- frequency-based wordlists developed at SEAlang. We used both corpus and WebRank-based word frequency lists to establish a rough idea of the n-thousand most commonly encountered words. We then used this as a guide in building thematic, semantic, and situational wordlists.

-- academic wordlists again developed here. We used Averil Coxhead's English AWL as our starting point.

Developing both frequency-based and academic wordlists has been difficult. For example, the fact that Thai is not segmented into individual words means that even counting words is fraught with peril, and subject to over- and undercounts. Another issue is the relative preponderance of longer Thai lexical constructs for expressing single-word English concepts. Nevertheless, these are simply the Thai variations of roadblocks that will confront many LCTLs.

The basic problem is that every individual organizing principle -- difficulty, theme, semantics, or situations -- eventually breaks down when one tries to group several thousand words into session-sized (about 20 items) packages. Ultimately, though, it is the combination of coverage (the student can steadily advance through the lexicon) and packaging (there is some obvious methed behind each vocabulary list's composition) that appears to be essential.

Audio samples that have been transcribed and chopped up into sentence-sized examples are difficult to find. We have been very fortunate in locating a large data set of Thai samples that was purpose-built to support speech-recognition and text-to-speech research. Nevertheless, this still required a fair amount of work to be 'repurposed' for use here; e.g. filtering out pauses without making the audio seem choppy.

On-line texts are the greatest potential resource, but can present the most problems in the short run.

Web pages that primarily consist of text content can usually take advantage of Lab facilities for segmentation, vocabulary extraction, and the like fairly readily. However, pages that have large amounts of embedded images, Flash presentations, advertising links, and so on -- as well as pages that have been badly coded to begin with -- will have problems.

At present, we've built in links to a number of media pages because when they work, they make for great examples. However, the tool will generally work better with informational pages.