SEAlang Library: Programmer’s
Guide
(preliminary
guide, Summer 2006)
The SEAlang Library is designed to provide both interactive Web page-based interfaces, and queryable program resources. The basic system architecture is:
SEAlang Library data resources
mostly flat XML files
some pre-built hashes / in-memory filesystem
‘built for comfort, not for speed’
(which support)
the SEAlang Library query server:
http://sealang.net
(which handles)
language-specific SEAlang Library web page user interfaces
http://thai.sealang.net
http://lao.sealang.net , etc
(as well as)
program-generated queries
http://sealang.net/dictionary?lang=thai&ipa=baan& …
http://sealang.net/corpus?lang=thai&orth=บ้าน&...
Web services are provided by a dedicated Linux server (P4 3.0 Ghz, with 2 G ram), using Apache / mod_perl. All cgi and server scripts are written in Perl.
Data resources are held as flat XML-tagged text files. Tagging generally follows the Text Encoding Initiative (TEI) P5 standard; variations, if any, do not violate its spirit. In some cases where server performance is an issue, data has been extracted, built into (Perl) associative arrays, and saved in a (Linux) in-memory filesystem. From the web server point of view, we find that this provides practically the same performace as a persistent mod_perl solution, but is easier to maintain.
We do not use any database software per se. Our primary goal is to builld datasets that are comprehensible, easily maintained, readily extended, and open to access in unanticipated and innovative ways. Although we want fast performance for basic operations, scaling up (i.e. maintaining performance under load) isn’t really an issue. There’s just not that much data, and we don’t really expect heavy, simultaneous user demand.
Query servers respond to simple http requests generated both by SEAlang Library web pages, and by anonymous programs. Data is returned as XML-tagged text for display or further processing. Although we considered using more elaborate protocols (e.g. SOAP), these appeared to have little benefit – and very definite disadvantages in terms of complexity – in attempting to build a user community for our data. In contrast, the ‘RESTful’ approach of http requests with query arguments (as attribute=value pairs) is easy to understand and work with.
Query services will be formally defined and published over the life of the project. At present, our own web pages rely on a rough set of common-sense attributes, e.g.:
Attribute |
Values |
language |
Thai, Lao, Khmer … (derived automatically for Unicode languages) |
orthography |
a query string in local orthography |
phone |
query string in conventional romanization |
definition |
query string for reverse search |
service |
dictionary, corpus, bitext |
match |
syllable, word, complete ... (part of the dictionary
service) |
return |
gui, result, data … (i.e. the full GUI, the framed result page, other data) |
and so on. We anticipate beginning to publicize these as soon as SEAlang has data for its first few scheduled languages.
Web page services provided by the SEAlang Library website play a dual role. First, they manage user queries, providing initial error and sanity checking, and massaging them into the forms required by the query servers. As part of this task, they also handle error interpreting and reporting. Secondly, they manage query server responses as necessary, either by building an XHTML wrapper that refers to an appropriate CSS file before including the returned XML data, or by repackaging it in some way.