JMdict
Japanese Multi-lingual Electronic Dictionary Project
Introduction
This document outlines the JMdict project, which set out
to extend the structure and content of
the EDICT Japanese-English Electronic Dictionary file to enable it to
contain additional information and provided an improved service to
users.
Project Goals
The project has several broad goals:
- to convert the EDICT file to a new dictionary structure which overcomes
the deficiencies in the basic EDICT structure.
(completed)
With regard to this goal, the particular structural and content aspects
addressed include, but are not limited to:
- the handling of orthographical variation (e.g. in kanji usage,
okurigana usage, readings) within the single entry;
- additional and more appropriately associated tagging of grammatical
and other information;
- provision for separation of different senses (polysemy) in the
translations;
- provision for the inclusion of translational equivalents from
several languages;
- provision for inclusion of examples of the usage of words;
- provision for cross-references to related entries.
- to publish the dictionary in a standard format which is accessible
by a wide range of software tools;
It is proposed that this goal be addressed by developing the structure
so that it can be released as an XML document, with an associated XML
DTD.
- to retain backward compatibility with the original EDICT structure
in order to enable legacy software systems to use later versions of the
EDICT files.
Project Status
The following has been achieved to date (June 2003):
- a new structure has been developed for the EDICT file, which has
been called
JMdict
(Japanese Multi-lingual Dictionary). This structure has been described in
an XML Document Type Declaration (DTD), which may be viewed
here.
(Note: this DTD is not quite up-to-date. The latest DTD is incorporated
into the distributed JMDict file.)
Samples of some EDICT
entries converted to XML in accordance with the DTD can be viewed
here.
- the EDICT file has been converted into a new structure which is
aligned with the XML DTD. While many of the EDICT entries converted
simply and automatically, a significant number of entries were variants
of each other which had to be identified and combined.
(Note that while this structure is aligned with the XML DTD, the XML
format is not being used internally at the moment.)
- utility software has been developed which converts the new file
structure back to the (old) EDICT format. All updates to the EDICT file
are now taking place via the new structure;
- utility software has also been developed which converts the JMdict
file in the new (internal) structure into the XML format for release;
- sets of translational equivalents in other languages are added to the
JMDict file when it is released. These are:
- entries from Ulrich Apel's
WaDokuJT
dictionary project.
- the French glosses from Jean-Marc Desperrier's translation of the
EDICT file
(Jean-Marc's page.)
- Oleg Volkov's EDICT-format Japanese-Russian dictionary file.
Feedback
Comments are sought from anyone interested in this project. In
particular, critical appraisal of the proposed structure, and
constructive suggestions for its improvement, will be most welcome.
Please feel free to send me
email
about this project.
Release Date
The first release of the XML
format JMdict (UTF8 Unicode) took place in May 1999.
There have been several releases since then, with the most recent in
October 2001. It is intended that JMdict releases take place at the same
time as major EDICT releases.
Mailing List
There is a small closed mailing list for people seriously involved in
JMdict. Email Jim if you want to be included.
Software
Some software is under development which uses JMdict:
The WWWJDIC dictionary server now uses an extended format for the main
distionary entries, which draws from the JMdict files.
Permission for Use
The JMdict file is now located within the Electronic Dictionary Research
and Development Group at Monash University. Information about the Group
is
here,
including the terms under which the file can be used.
Jim Breen
June 2003