When Svenska Akademien in 1884 decided to publish a dictionary modelled after OED, nobody had imagined the amount of work involved. The first volume was published in 1898. This year the part including TRIVSEL was published; and the full 1st edition will be finished in 2017.
The dictionary offers a comprehensive description of about 70.000 simplex words, 450.000 compounds and derivatives, their pronunciation, morphology, spelling variants, etymology and nearly 600.000 senses, documented by about 1.350.000 citations from 23.000 sources covering the period from 1521 to 2005. The work is based on a collection of about 8 million slips. A concordance provided by Språkdata in Göteborg is today used as supplement to the slips.
More than 10 editor-in-chiefs has during the 123 years left their own mark on the editorial rules behind the text. The text is not consistent throughout the 35 volumes, partly due to the long time span and the amount of people involved, partly due to changes in spelling during the 123 years of writing.
During the last 10 years new entries has been produced using a word processor. The preparatory work with the slips is done manually, and the whole process of creating the entries is structured into a large number of tasks involving a large number of people.
In 2007 Svenska Akademien asked Jens Erlandsen to develop an XML Schema together with the editorial staff in Lund. The purpose was to create a formal foundation for the future editorial work. It should, as far as possible, keep close to, but refine, elaborate and make explicit the editorial rules embodied in the existing about 200 page editorial manual.
First and foremost the Schema should support the team of 20 editors in Lund in writing and editing the very complex structures – secondly the Schema should ease training of new editors.
With 8 different type faces, already the writing process of the flat strings in a word processor is cumbersome. Introducing more than 110 different information types (elements) with an average of about 8 content characters per element in running text, a major concern has been not to complicate the writing process even more by introducing too heavy and deep structures.
This aspect had to be balanced against other goals such as high quality print, future electronic publishing, documentation and training and last but not least, to define structures that work hand in hand with the lexicographic way of thinking – and working.
The design process was accomplished in several steps. The very first was to convey the editorial team the full meaning and implications of the X in XML. During the next steps the editors was trained in recognizing information types and explicate rules for the structures.
During the first steps only a few entries was used as examples. Towards the end of the design process 2000 entries of the newest published work was converted from typographical mark-up into the xml as defined by the attained schema. This process was far from trivial, as the input text was not 100% correct – but close to; the schema was not 100% finished – and to some extend varying a little from the doc type analysed and finally the conversion process into XML was difficult to make correct as well. But the process ended successfully and it did lead both to improvements of the Schema and yielded a test bed for the future editing process.
By the end of the process, we reached the level of expertise, where we were ready to face two important realizations. First, the lack of possibilities to express all our structural constraints by the XML Schema itself had to be swallowed. Second, the experience, that the more restrictive we made the schema, the more difficult it became to comprehend, thus accentuating the problem of documentation and training.
The next steps will look into what practical implications the tools used for the editing process and the work flow will have for the schema and weather a solution with several related Schemas could be an alternative.
The presentation will go through the design steps in more detail. In particular it will focus on the design decisions on the Schema and discuss the mechanisms for supporting the editing process. In other words: are Schemaes for editing XML different from other schemaes? Does it make sense to operate with several Schemaes and how are they interrelated and maintained? And finally, a few words will be said about the software tools we used during the design process.
Jens Erlandsen has since 2003 been leading the development team behind iLEX, an integrated editing and database XML system for dictionaries, encyclopaedias, legal text and technical documentation developed by EMP. He founded TEXTware in 1988, which developed electronic dictionaries and encyclopaedias for leading international publishers such as Longman, Oxford University Press, MacMillan, Cambridge University Press, Bertelsmann and more. Before that, Jens taught computational linguistics at University of Copenhagen and was marketing manager for a 70 person division developing expert systems.