SEAcat Tools

About the SEAcat Tools
These tools have three goals:

- helping librarians romanize Southeast Asian texts for cataloging consistently and accurately;

- helping readers and scholars identify cataloged items, and track down texts both in-country and internationally;

- helping CRCL researchers assess the difficulty of converting existing romanized records back to their original SEA orthographies, and of identifying near-duplicate records.

This is a preliminary implementation, meant to help us see what kind of tools and search functions will be helpful, and to assess the quality and availability of MARC records. It is not tied to any particular library database, although it will locate texts that have ISBN/ISSN numbers via the Open WorldCat and Libraries Australia sites (the Australia searches will not work properly if you have their jsessionid cookie set, unfortunately). Thai texts are also found via the Thai Union Catalog. SEAcat does not attempt to correct entry errors in the original MARC files, so some information may be tagged inaccurately.

Queries
Queries may be entered in local orthography, conventional IPA phonetic transcription, conventional syllable-by-syllable phonetics in local orthography, or any combination of the three. If you're unsure about spelling, enter the word one syllable at a time, and use the robust or robust & rough search.

Words are automatically translated into the ALA-LC (or Library of Congress) romanization, which is very widely used for cataloging non-Roman alphabet titles.

Word breaks
Although the writing systems of mainland Southeast Asian do not usually space between words, LC romanization does. Unfortunately, even fluent speakers will disagree on where word breaks belong. This makes it difficult to catalog consistently, or to form accurate search queries. For example, khon Thai and Khonthai (Thai person or people) get 76 and 62 hits each in a preliminary survey of about 43,000 MARC records.

The SEAcat tools will automatically generate all possible search combinations: the query "a b c" will be sought as "a b c | ab c | a bc | abc" . This can be used in two ways:

- check word breaks checks every combination against a very large corpus of existing catalog entries, so that past practice can be used as a guide for cataloging.

- robust search tries every combination in searching, so that both correct and incorrect queries can find both correct and incorrect catalog entries.

- robust & rough search also tries every combination. In addition, it will ignore MARC diacritics (which usually show vowel length). This lets us find records that have not been catalogued following ALA-LC guidelines.

Sources
Data for the SEAcat tools was extracted from MARC records obtained via dozens of Z39.50 server facilities, including those listed below. Data collection is ongoing.

Library of Congress	z3950.loc.gov:7090 (Voyager)
Cornell University	catalog.library.cornell.edu:7090 (Voyager)
National Library of Australia	catalogue.nla.gov.au:7090 (Voyager)
University of Wisconsin-Madison	z3950.library.wisc.edu:210 (Madison)
Yale University	prodorbis.library.yale.edu:7090 (Voyager)
Monash University Library	zconn.lib.monash.edu.au:7090 (Voyager)
National Library of Medicine	tegument.nlm.nih:7090 (Voyager)
Hawaii Voyager Consortium	uhmanoa.lib.hawaii.edu:7090 (Voyager)
Columbia University	clio-db.cc.columbia.edu:7090 (Voyager)
University of California Los Angeles	z3950.library.ucla.edu:7090 (Voyager)
Thailand Unified Catalog	202.28.18.229:1111 (default)

All raw MARC data was converted to MODS3 xml and xhtml, and various specialized text samples were built. The search tools in this implementation use the combined title and name corpora.

- the SEAcat title corpus was extracted from <title> and <subTitle> fields, including alternative, uniform, and abbreviated subentries.

- the SEAcat name corpus includes all <namePart> fields of all types.

- the SEAcat note corpus includes all <note> fields that appear to include characters used in romanization.

- the SEAcat miscellaneous corpus includes all <publisher>, and <tableOfContents> fields that appear to include characters used in romanization.

- the SEAcat romanization corpus includes all of the above.

Other problems
Consistency is a major problem. Two simple terms show the extent of the situation:

- เฉพาะ should have a short final vowel, but often appears as long.

- กรณี may have initial vowel "a" or "o̜"

Both alternates have substantional numbers of hits, even within the subset of pages that try to follow ALA-LC guidelines.

Look for continuing development of SEAlang Library resources.