The Southeast Asian Languages
Library
The Southeast Asian Languages
Library – SEAlang Library, for short – is a technically innovative plan to
build core lexical re-sources for all Southeast Asian languages, starting with
the difficult scripts used by the five mainland countries.
Broad support for the SEAlang
Library reflects its importance to the Southeast Asian Studies community. In
preparing this proposal, the University of Wisconsin-Madison Center for
Southeast Asian Studies (CSEAS, host of the Southeast Asian Studies
Summer Institute, SEASSI) and co-sponsor Center for Research in
Computational Linguistics (CRCL Inc., a US 501(3)(c) nonprofit), have made
concrete plans for cooperation with the Center for Khmer Studies (CKS,
Siem Reap), the Ecole française d'Ex-trême-Orient (EFEO), the Committee
on Research Materials on Southeast Asia (CORMOSEA), the Coalition of
Teachers of Southeast Asian Languages (COTSEAL), and NGO-based ‘open
source’ software projects in Burma, Laos, Thailand, Cambodia, and Vietnam.
The SEAlang Library
will provide:
DICTIONARIES: we will prepare XML-metatagged digital bilingual
dictionaries, based on the best available print reference works – often
difficult to obtain from
TEXT CORPORA: we will build monolingual and aligned bitext
corpora. Used to study collocation and usage, and to support data-driven
language learning, these are necessary precursors to more ad-vanced
translation and monolingual and cross-language information retrieval tools. We
will provide substantial (to tens of millions of words) monolingual corpora for
each majority language, along with the largest feasible (hundreds of thousands
of words for Thai and Vietnamese, and less for others) aligned two-language
corpora, drawn from both on-line resources and on-the-ground publishing contacts.
SOFTWARE: we will build information access tools for Southeast
Asian scripts, including tools for segmentation and transliteration, conversion
between font encodings, text harvesting and indexing, and statistical analysis.
User applications, including the SEA-Search query builder, the SEA-Cat
Library of Congress Romanization / cataloging utility, the SEA-Read reader’s
helper, and the SEA-See text-as-image utility for scripts (like Khmer)
that are difficult to render in Unicode, will be linked to dictionaries, text
corpora, and transliteration engines to help fulfill the promise of regional
information access.
The Southeast Asian Languages
Library is a long-awaited addition to the national digital infrastructure being
built with the support of a variety of U.S. Department of Education Title VI
programs. It will enable:
pedagogy and new teaching, learning, and translation tools for
less-commonly taught languages,
scholarly inquiry in linguistics, history,
lexicography/etymology, and
scientific research in computational linguistics and
cross-language information retrieval, and
language reference all but unavailable to 1.8 million Americans of mainland
Southeast Asian heritage who can typically speak – but not read, or consult
reference materials in – their heritage languages.