Editorial policy - EDRDG Wiki (original) (raw)

JMdict/EDICT Editorial Policy and Guidelines

These guidelines are intended for people preparing new entries or amendments for the JMdict/EDICT files. Typically these entries or amendments will be made via the JMdictDB on-line database system.

Before Starting

Before proposing a new entry or an amendment, you should:

Note that it is not necessary to have an account and log in before proposing a new entry or a change to an existing entry. The login is really only for members of the Editorial Board. Changes can be proposed anonymously, but as explained below, we prefer if people identify themselves by name or nickname.

Dictionary Entry Fields

Kanji/Special-Character Forms

The Kanji section of the entry form contains the form of the Japanese word/phrase which contains kanji, special characters or letters from non-Japanese scripts (e.g. MP3プレーヤー). The word/phrase should written in full-width characters (e.g. it is not MP3プレーヤー).

There may be more than one version of the word or phrase in this section. The usual reasons for having more than one version (also known as "surface forms" or "orthographical variants") are:

Where there are multiple forms of a word, enter them with the most commonly used form first, and then order them in decreasing frequency of use. Provided the frequencies are equivalent, give preference to terms using 常用漢字. In general irregular or incorrect forms, e.g. those tagged iK, io or ik, should be placed to the rear of the surface form list, even if they are commonly used on WWW pages.

Synonyms should not be included here. Instead they should be entered as separate dictionary entries, and a cross-reference inserted to them.

Some other points to note:

A set of tags, e.g. iK or oK, can be applied to the words in this section. These should be used sparingly.

Readings

In this section enter either:

Readings associated with kanji should normally be in hiragana; the main exceptions being:

More than one reading can be entered where alternatives are possible. This can occur when

Where alternative readings are restricted to particular variants of the kanji form, specify this using the [restr=KKK] pattern after the reading. As in the Kanji section, place the more common reading(s) first.

外来語 (in katakana) are entered in this section. Do not enter them in the kanji section. Where a 外来語 is a transliteration of several source words, include versions with and without a separating "middle dot", e.g. "アームレスチェア;アームレス・チェア". Note that the JIS middle-dot must be used - there are other Unicode middle-dots that are not accepted.

If a 外来語 (e.g. ベースボール) means the same as a native Japanese word (e.g. 野球), do not include the 外来語 form as a reading of the kanji. Instead, create a separate entry and create cross-references between them. Similarly, if two kana-only words have the same meaning, do not place them in the same entry unless they are related, e.g. spelling or pronunciation variants.

If the kanji part contains katakana (e.g. 一眼レフ), use katakana in the Reading as well for the matching portion (いちがんレフ).

A set of tags, e.g. ik or ok, can be applied to the words in this section. These should be used sparingly.

Reading Field Simplification

Until mid-2021, if the kanji field of an entry included both kanji and katakana for part of a form, e.g. アカバナ科 and 赤花科, then the reading field typically had matching kana forms, in this case アカバナか and あかばなか, with restrictions to align the kanji/readings pairs. This was done to assist with the generation of the legacy EDICT format. This is no longer a major issue and it is now considered acceptable to have a single reading (i.e. あかばなか) in such cases.

Meanings

The Meanings section of the entry form is divided into senses, i.e. distinct meanings. These are indicated by a sense number: [1], [2], etc. Each sense can have a number of part of speech tags (POS), e.g. [n], [adj-i] and miscellaneous tags, e.g. [abbr] and [col].

The meanings consist of one or more short translations or explanations of the Japanese word or phrase.

General

(At present the "expl", "lit" and "fig" tags are only used in the database - they are not yet exported to JMdict or EDICT.)

Which Reference Is Best?

On occasions references (see the list of dictionaries, etc. below) will differ as the the meanings of entries, and which senses are more important than others. Here are some suggestions for handling this:

Part-Of-Speech (POS) Issues

Word Source

If the word or term comes from another language, mark this at the beginning of the sense(s) to which it applies. The format is [lsrc=lng:], where lng is the three-letter code from the ISO 639-2:1998 "Codes for the representation of names of languages" standard, e.g.:

Don't do this for (i) common Sino-Japanese vocabulary, (ii) loan-words from English where the source word is among the translations; (iii) words/terms which are translations from other languages. If the word or term in the source language is identical to the translation, don't repeat it in the [lsrc:...] field. Note that where a loan-word from English was originally from another language, e.g. ベランダー/verandah, the usual practice is not to indicate a source language.

Non-English source languages are usually indicated in the major 国語辞典 such as Daijrin and Daijisen, and also in 外来語 dictionaries such as the Gakken カタカナ 新語辞典. In cases of disagreement or doubt, e.g. where a term may have come from either English or French, omit any source language marking.

Source words in languages that use a non-Latin script should be given in Latin transcription. Diacritical marks can be used. For the following languages, use these transcription systems:

The language markings apply both to loanwords (外来語), as with the examples above, and to transliterations (音写), typically the Buddhist terms taken from Sanskrit, which are not usually regarded as loanwords.

Note that where ISO 639 discriminates between historical forms of a language, e.g. "grc" for Classical Greek and "gre" for Modern Greek, the modern tag is to be used as the discrimination cannot easily be applied at the word level.

Cross-References

Cross-references can be made to other dictionary entries where this enhances the value of the entry to the typical dictionary user. Examples of such useful cross-references are:

where one entry is an abbreviation of another, e.g. 学割 and 学生割引 (see below).

where the words are commonly associated or contrasted, e.g. 先輩/後輩, 税別/税込み, etc.

where there is a derivational relationship between words that it is useful to highlight, e.g. between かっけー and 格好いい, or between オケる and 空オケ.

At present two classes of cross-reference are supported: a general "see" and an "ant" for antonyms.

Specify the cross-reference using the pattern [see=言葉] or [ant=何等] (see the detailed instructions). Where the reference is to a particular headword/reading combination, use the format: kanji・reading, e.g., [see=金本位・かねほんい]. Where the target word has a kanji form, that form should be used. For targets that are a particular sense of the target word use the format [see=漢字[2]]

Please note that the "ant" (antonym) tag should only be used for genuine opposites. Words such as "short" and "tall" are antonyms; "short person" and "tall person" are not - use the regular "[see=...]" form for these. (For more information, see the excellent Wikipedia article on this.)

Avoid adding cross-references to words which simply mean the same (or opposite), as it adds a lot of clutter to the entries without necessarily being helpful to users. There are related systems such the the Japanese WordNet which specifically provide details of large numbers of synonyms. Some systems such as WWWJDIC link to the Japanese WordNet as part of the entry display.

Abbreviations

Many Japanese terms are abbreviations of longer terms, for example 学割 is an abbreviation of 学生割引. When creating an entry for such an abbreviation:

If appropriate, a cross-reference back from the full form to the abbreviation may be appropriate.

Romanized Japanese

Romanized forms of Japanese words may be used within meanings in the following situations:

The Hepburn romanization system, in particular the revised (aka modified) version, will be used. That page can be taken as a guide, with the key points being:

When a romanized Japanese term warrants further explanation, this should be added in an explanatory gloss; not in parentheses after the term. See the 式神 entry for an example of this.

Old and Rarely Used Terms

Several miscellaneous tags are available for indicating that terms are no longer in current use or are rarely used. They are:

The "hist" (historical) term does not fall into this category. It refers to a past event (e.g. battle, ceremony) or concept (e.g. an art-form common in the 18th century), but the term itself is still in current use.

Numbers with Units and Symbols

In general where a number is followed by a unit or symbol, the following spacing rules should be followed:

Date and Time Formats

For the sake of consistency, the same format should be used when recording specific dates. The preferred formats are:

For the dates of individual people, e.g. in the named-entity dictionary, use the YYYY.MM.DD format for the sake of brevity, e.g. "Yukio Mishima (1925.1.14-1970.11.25)".

Similarly, for specific times of the day, use the "2am" and "12:30pm" styles, both to be consistent and to use the minimum amount of space.

For classifying years in dates, use the secular BCE (Before Common Era) and CE (Common Era). In dates after 1,000 CE the "CE" is usually omitted.

Capital Letters

Capital letters should generally be confined to proper nouns, e.g. specific countries, places, people, products, etc. Astronomical objects such as the Sun, Saturn, etc. will have capitals, but moonlight and sunshine will not.

Use of French, etc. Diacritics

(in progress)

References

This is where you indicate the sources for the entry or amendment. It helps establish its validity, enables editors to check out the accuracy, e.g. of the translation from a 国語辞典, and leaves a record for other people to know where the entry and translation came from.

The best references are to other dictionaries, and the more the better. Sometimes just the name of the dictionary will do, where the proposed entry is already an entry in the reference, however if the entry in the dictionary is readily visible online it is better to include the URL. Editors and regular contributors have developed a set of abbreviations and mnemonics for some of the popular sources:

Some of the above references are available via aggregator or reference WWW sites such as Goo, Weblio, Yahoo, etc. In such cases please make sure the reference URL is to the specific term on the site, and add the name of the actual dictionary being used for the reference (大辞林, 日国, etc.)

If the references include online resources such as a dictionary entry or a Wikipedia article, quote the relevant URL. Please note that a Japanese Wikipedia article by itself is not necessarily a good source for a dictionary entry. Some articles are simply translations from an English page and not evidence that a term is in use in Japanese. Sometimes an article only covers one aspect of a term's usage, and there are other senses which need to be covered. It is best to check the term in other sources and state that in the References section.

If the sources for the entry are other WWW-based documents, quote the URLs of at least one (preferably several), and use the Comments field to state your case for it being included.

As noted above, the Eijiro glossary should not be the sole source of references for a proposed entry, although it may be used as a supplementary reference for confirming meanings. This is because the glossary is a collection of Japanese-English pairs which have apparently been collected from translations. In a Japan Times article Daniel Morales described it as "a smorgasbord of reibun and definitions, some of which err on the side of slang, often delighting the expat community. For example, the entry for nyūbō (乳房, breasts) has no fewer than 51 English options, including the ever-so-mature “funbags.” And kyūryōbi (給料日, payday) lists “when the eagle flies” (an American tribute to governmental pay), among other more colorful renditions."

Use this field to enter any additional information you think will help the editors when they assess the entry or amendment. These comments are kept with the entry as a record of the discussions. The Comments field will also be used by editors when providing feedback.

Name/Email address

While is not mandatory, it is best if you include your name. Editors get to know who are regular contributors of amendments and new entries, and it is easier to establish some rapport if the contributor is identified. Also, having an email address enables editors to contact a contributor directly if there is a question they wish to raise. Note that email addresses cannot be seen by people browsing the database; they are only visible to editors who have logged into the system.

Other Issues/Policies

Anonymous Submissions

There is no requirement for people submitting new entries or amendments to identify themselves. It is preferred, however, that people making regular contributions provide some identification, either their name or a pen-name, as it will add to the sense of community among the participants, and also enable the editors to take into account the quality of previous contributions when examining a proposal.

Character Codes

Although the database supporting the dictionary uses Unicode coding and can contain any character from that set, the distributed forms of the dictionary are more constrained, in particular:

The JMdict database is in Unicode and thus can contain any valid Unicode characters.

Care needs to be taken with the inclusion in the database of characters outside the JIS X 0208 and JIS X 0212 codesets as this has implications for the EDICT and EDICT2 versions of the data. In particular:

Kanji which lie outside the JIS X 0208 and JIS X 0212 codesets, e.g. the additional kanji in JIS X 0213, can be included in the database and will be in the JMdict distributions, however, they will not be propagated into the EDICT/EDICT2 distributions.

Merging Entries/Two-out-of-three Rule

On occasions, two or more entries may be merged when there are grounds for assuming they are variants of each other. The basic principle that is applied is a "two-out-of-three" rule (first described in a paper in 2004). For the candidate entries, if at least two out of the (a) kanji-headword, (b) reading and (c) meaning fields are the same, the entries may be merged. Otherwise they must be separate entries. It is often not a simple decision, as there may be kanji-headwords which only apply to some of the readings.

Where the entries have multiple kanji parts or readings, this rule really applies only to the major/common forms. Mergers should not be carried out on the basis of a rare or archaic kanji form or reading. Common sense must apply.

Two entries with no kanji could be merged if they have the same meaning and the kana forms are related, e.g. are variants of each other, such as ダイアモンド and ダイヤモンド.

Is it worth including?

An important issue is whether a possible entry is worth including. This question primarily arises with expressions such as XXXのYYY/XXXがYYY/etc. or compound nouns/multi-word expressions. Clearly, we want to include entries that are useful and relevant, but we don't want to clutter the dictionary with things that are obvious. It is inevitably a value judgement and often leads to some debate between editors before a proposed entry is accepted or rejected. All dictionaries have to deal with this issue. It is worth reading the Wiktionary Criteria for inclusion as it discusses many of the issues in considerable detail. The following is a list of criteria being used by the editors to assess whether a proposed entry should be included. Generally passing one or more of these criteria is needed.

Loanword Variants

Many loanwords (外来語) in Japanese have multiple surface forms which reflect such things as alternative mappings from the source language, variant vowel lengths, etc. Examples include ダイヤモンド/ダイアモンド, コンピュータ/コンピューター and ヴァイオリン/バイオリン. In general, all variants that are in regular use should be included; ranked in order of use (an n-gram corpus can be used to determine this.) Rarely-used variants can be omitted, or included with an "ik" (incorrect kana) tag.

Dividing Loanwords by Source

In some cases loanwords which have identical katakana representations derive from two or more source language words. For example, there is フォーク meaning "fork", and フォーク meaning "folk". Similarly, there is バイト meaning "byte" and バイト which is an abbreviation of アルバイト (part-time work). In such cases, the policy is to have a separate entry when the source-language words are different.

Search-only Forms

From August 2022 a number of surface forms are being included in the dictionary database purely for the purpose are enabling them to be used as search keys. The editors have identified a number of these forms which, although not considered appropriate for inclusion and display in dictionary entries, are used "in the wild" enough for them to be useful for looking up entries. This practice of having search-only forms can be seen in several online dictionaries, for example the Kenkyusha site allows the 手古摺る entry to be found using 手こずる as a search key although that form does not appear in the entry.

Examples of the types of forms that are in this category include:

These search-only forms are being added at the ends of the sets of surface forms and are being given the tags/attributes of "sK" for forms containing kanji and "sk" for kana-only forms.

It is strongly suggested that developers of dictionary apps and sites use these forms for searching purposes, but not show them as part of the full entry. The forms do not participate in the restriction structure and there is nothing to indicate whether they are irregular (e.g. 変換ミス) or just uncommon (e.g. itaiji). Displaying them alongside the non-sk/sK forms would be confusing and unhelpful for users. (This concealment approach is currently implemented in the WWWJDIC server, where a search for 三蜜 will retrieve the 3密/三密 entry but does not show 三蜜 as part of the entry.)

Additional documentation of these form can be found on the Kanji and Reading Information Fields page.

Proverbs/Kotowaza/Aphorisms/Sayings/etc.

In general, the dictionary is not the place for recording extended text passages, but there is scope for including short, pithy passages which are recognized as useful in Japanese. Tests that will be used by editors when assessing such passages for inclusion include whether they are clearly in common use in Japanese, and/or are included in one or more of the major 国語辞典.

With regard to quotations and proverbs, the following guidelines are suggested for the use of the tags:

Some entries consist of a term or passage based on or derived from part of a historical text. These should not be marked as [quote] unless they are an actual translation. Where appropriate a note can be included indicating the original text, e.g. "deriv. from 史記 passage".

Proper Names

In general, the JMdict/EDICT dictionary is not intended to include proper names as these are included in the companion ENAMDICT/JMnedict dictionary. It is common, however, for small numbers of high-profile proper names to be included in general dictionaries, and this is the case with JMdict. Proper names included in JMdict are primarily place names, with emphasis on the names of significant places within Japan, and on the Japanese names of countries and major cities. (The proper names in JMdict will be in ENAMDICT/JMnedict as well.)

The proper names considered appropriate for inclusion are:

The above covers most of the proper names in JMdict. Some other names have been included, e.g. major newspapers, and there is discussion as to whether that can be retained under a "grandfather" principle, or confined to ENAMDICT/JMnedict.

The tags such as "place", "work", "person", etc. which are used to classify named-entities in the JMnedict database may also be used for proper names in JMdict however they should only be used when the nature of the entry is not clear from the gloss itself. For example "バルセロナ (n) Barcelona (Spain)" does not need the addition of the "place" tag.

Since mid-2023 a selection of almost 7,000 entries from the JMnedict database has been included in the JMdict XML distribution. These entries are mainly from the company/organization/work categories. Care should be taken not to duplicate such name entries.

As with other transcriptions of Japanese terms, the modified Hepburn system will be used. In most cases macrons will be used for long vowels, the only exceptions being cities such as Tokyo, Osaka and Kobe which are commonly used in English without macrons.

Names of biological species

The rules we are using for biological species are:

Note that in Japanese a genus is always denoted by the use of 属/ぞく, as in:
ハギ属
ハギぞく
(n) Lespedeza (genus comprising the bush clovers)

These guidelines were developed originally by ReneMalenfant 21:05, 25 August 2009 (UTC) and revised by (most recently) JimBreen (talk) 00:21, 28 November 2017 (UTC)

Sensitive Terms

As in any language, there are words and terms in Japanese which need to be used with care and sensitivity, as they may be blunt, cause offence in some contexts, etc. In JMdict there is a "sens" tag which may be associated with one or more senses of an entry to indicate that the term should be used with a degree of caution. Determining which terms should be regarded as sensitive is quite difficult. In general the major Japanese-English and English-Japanese dictionaries do not attempt to indicate them, probably because they are usually compiled for Japanese users who do not need to be told this.

A useful reference is a list of problem terms (放送問題用語) based on a 1983 publication by NHK. That list, for example, includes virtually every term which includes 盲/めくら (blindness), so for 盲窓/めくら窓, it advises that "外見だけの窓" be used instead. Some of the prohibitions seem extreme; for example, 医者 is on the list, with the advice that 医師 or お医者さん be used instead, however, foreign learners of Japanese are usually taught 医者 without any qualification. Note that the list is over 30 years old, and there are reports that it is not being followed completely now. The list is categorized according to whether terms are banned (×), have some reservations (△) or are uncertain (?), and the "×" tag is applied to 122 terms.

While there can be no hard and fast rules, it is suggested that people submitting or amending entries apply the following guidelines when considering whether the entry should include a "sens" tag.

Hyphens and Similar Characters

In the Unicode character set, there are nine characters that represent some sort of mid-line bar. Only three of these are to be used in the JMdict database. They are:

For details of the other characters see the JMdict issue on the topic.

Issues Forum

There is an JMdict Issues forum where matters such as structure, format, policies, tags, and other issues concerning dictionary content can be raised and discussed (currently hosted on GitHub.) Do not use this forum to discuss specific entries as these should be raised in the database itself.