SEAlang Library: Features for Teachers
(Summer 2006)
Introduction
The SEAlang Library is a tool for teaching and research, as
well as for student reference. The dictionary, corpus, and bitext
resources are all capable of producing materials for classroom, testing, and
study, as well as for helping the instructor gain fresh insight into the kinds
of problems that students face.
Coverage
The
SEAlang Library will provide dictionaries, corpora, and (when available) bitext
corpora for the national languages Thai, Burmese, Lao, Khmer, and Vietnamese,
and dictionaries for Mon, Shan, and Karen.
Data contributions We are always eager to extend SEAlang
Library coverage, so please get in touch if you have additional dictionaries or
texts for any of these languages.
Other languages For 2006-2009, SEAlang is focusing on
Availability of features Library
features will vary depending on underlying data resources. Our goal is to get existing texts on line as
rapidly as possible, adding two languages per year, with one or two
dictionaries per language. While we’re
working with the best materials we could find, there are variations in
quality. Still, each language makes the
best use of available resources.
Design All of the SEAlang tools provide
significant innovations in functionality, user-interface design, and in the
display of potentially very large amounts of data. SEA language-specific features have been
incorporated whenever possible. Please
let us know if any additional features would be helpful.
Extension The SEAlang Library
is designed to allow ongoing extension and updates as new materials become
available, and as you – the SEAlang user community – become interested in
improving existing resources.
Tools 1: Dictionary
Searching The SEAlang Library’s digital dictionaries are not simply electronic equivalents of traditional print texts. Rather, they allow many kinds of searches that printed books are not capable of providing. These include:
Approximation The exact details of
phonemic and orthographic approximation vary from language to language, but the
underlying principles are always the same:
Although we have a fair amount of insight into the best rule
set for each language, we consider this to be a preliminary implementation and
welcome user comments based on classroom experience.
Wild cards All Unix-like
regular expressions (e.g. as found in perl, vi, or grep) are allowed. In particular:
^ match at
beginning
$ match at end
. match any single character
.* match any sequence of characters
Match length Phonetic and
orthographic matches can be restricted to being a) whole words or
compounds, b) headwords or words found within compounds, or c)
syllables or longer. In some cases, we
use dictionary data that does not distinguish between headwords and compounds,
so this rule cannot always be applied properly.
Q & A
Why don’t all of
the entries have part-of-speech / phonetic / antonym & synonym / classifier
/ etc. data? We take the dictionary data as we find
it. Our first goal has been to get
useful data on line – expect cleanup and improvement over the next few years.
Why was there a question mark / square box in the phonetic?
Occasionally an oddball character (usually in the phonetic) wasn’t
cleaned up or converted to Unicode properly – we’re fixing these asap.
Why isn’t every compound word associated with the proper
head? When a head has etymologically distinct
orthonyms, compounds have to be segmented and associated with the right head
individually. This takes time, but
we’re working on it (e.g. we’ve just disambiguated more than 13,000 Burmese
compounds). Note that an appropriate
head entry doesn’t always exist in the original dictionary, either.
Tools 2: Corpus
Monolingual text corpora have attracted considerable interest in the past decade for several reasons.
Native-speaker ability is not always helpful in anticipating the difficulties that students will encounter. For example, native speakers automatically filter out traditionally assigned literal meanings that conflict with common sense (do we really beg for a pardon in English, or ask for punishment in Thai?), but learners do not have this built-in radar. Corpus evidence encourages the teacher (and lexicographer) to account for such uses sensibly.
For teachers of less commonly taught languages, text corpora can play an important role in filling the gap left by a lack of suitable guides to grammar and syntax. The SEAlang Library corpus near feature, as well as the ability to restrict collocates to particular usage or parts of speech, are specifically designed to elicit larger-scale text phenomena. This includes split constructions, modals (which may be restricted to preceding or following a verb), classifiers, class terms, and so on.
A corpus is also an excellent
source of drill and test material. Corpus results can be cut-and-pasted,
cutting out difficult words, or otherwise modifying if necessary, to create a
stock of raw materials for cloze tests, rearrange-the-word drills, translation
tests, etc. The ready availability of such material is particularly
helpful for real-world classroom environments, where it may be helpful to
create multiple sets of roughly equivalent texts to serve as practice guides
and makeup tests.
The SEAlang
Library corpus tool provides these basic functions:
For example, here is a search for collocates. The items highlighted in blue are tems that
are already dictionary entries. This
feature is particularly helpful for revealing items that should be
compounds listed in the dictionary, but are not.
:
Here is a context search, showing both left- and right-hand matches
(these can be retrieved separately as well; note the yellow ‘show leading …’ and
‘show trailing …’ tags above). The <
or > means “the word came from this side”:

Two search targets can be provided:
The corpus itself does not require advance preparation of
any kind (other than being in plain-text Unicode). Please contact us if you have a specialized
corpus (e.g. transcribed speech) you would like to share.
Text corpora can
be very large. While this is necessary
for finding less common targets, subsampling is a more effective alternative
for ordinary terms. In practice, most
results are found using these three steps:
Thus, for common search targets the corpus tool will produce
a different set of results each time.
The gross distributions of word + collocate(s) will remain more or less
the same, but the specific examples counted and returned will be different.
Finally, the
ability to specify and/or restrict collocate types is still being
developed. For example, symbolic entries
like N (number), C (classifier) and so on are reasonable
candidates for implementation, as is the ability to require that collocates
have particular POS or usage tags.
Please contact us (preferably with a prepared list of items) if this
sort of specification would be helpful.
Q & A
Why are there
oddball characters (like question marks) at the beginning and end of each
corpus line? These are fractional parts of Unicode
characters. We’ll be cleaning these up
soon.
What does the underline mean?
An underline: “_house” represents
a space or newline.
The word you say is the left or right neighbor is obviously
just part of a longer word. How come it
was returned anyway? Because perfect segmentation by computer is
hard!
Tools 3: Bitext Corpus
Bitext Corpus Aligned bitexts are a traditional tool of
European language instruction, where bilingual literary texts have been widely available
for many decades. They are rarely used in
The SEAlang
Library bitext tools provide these basic functions. Two search targets can be provided:
Searches can be in either, or both, a Southeast Asian L1 or
(usually)
These alternatives can be extremely helpful for finding
atypical translations or expressions.
As in the dictionary
reverse search, the bitext corpus tools supports derivational expansion of
English search terms, so that house expands to house, houses,
housing, and housed.
At present
(Summer, 2006) only Thai-English bitexts are supported (but only because closely
translated material is so difficult to find).
Most of our source material was originally English, given a close
translation into Thai by design. In our
experience, translation from a Southeast Asian language tends to be
problematic, and does not yield the orderly sentence-by-sentence alignment
usually sought from bitext corpora in this application.
Q & A
Why isn’t the search word’s translation
highlighted as well?
Because the bitexts
are only aligned sentence by sentence.
You’re welcome to pursue spotting the translation as a research project!
Other Research Applications
The SEAlang
Library is meant to support research in SEA linguistics and language
education. Please see the Programmers
Guide for additional information on system features and implementation.