Japanese language processing on a computer is more complicated than English language processing, because Japanese orthography involves four different writing systems - hiragana, katakana, kanji, and romaji - and uses many thousands of distinct characters.
A personal computer equipped with a kanji input method and dictionary program is a powerful aid in learning Japanese. The software needed is freely available and offers hope for the non-native student of the language who attempts to learn the large character sets during limited "free" time.
The intent of this article is to introduce NetBSD's Japanese language support to the English-speaking user. The approach will be to demonstrate a few common activities with commentary on the progression of ideas involved.
Note: Because this HTML document contains Japanese characters, some graphical browsers will display a backslash (\) herein as a Yen symbol. The backslash will, however, show itself properly if the HTML content is saved from the browser and viewed with other software, or selected with the mouse and pasted into a text window. Welcome to the world of multlingual text processing!
The first exercise involves the least amount of work. Simply visit a web site with Japanese content using Netscape or Mozilla. Try these two sites:
You will probably already be seeing Japanese text. If your browser did not recognize Japanese content, then you may see text that looks like $B(%B???? and so forth. Such improperly displayed text is affectionately known as mojibake, or "ghost characters". See What Is Mojibake? for more information.
To eliminate most mojibake from Netscape's display, make the following sequence of selections, starting from the top menu:
View / Character Set / Japanese (Auto-Detect).For Mozilla, the sequence, again starting from the top menu, is
View / Character Coding / Auto-Detect / Japanese.
Occasionally, Auto-Detect fails, and it is necessary to select Japanese (Shift_JIS) or Japanese (EUC-JP) character sets manually.
English-speaking computer users are familiar with US-ASCII, a seven-bit coded character set mapped into the lower half of the 0-255 range of values in an eight-bit byte. You can remind yourself of the ASCII characters and their codes at any time by doing
The ISO 8859 character sets provide several extensions, making use of the upper half of the range of byte values to represent alphabets of many different nations. For more about ISO 8859, see The ISO 8859 Alphabet Soup.
For Japanese language content, there are several standard character sets, of which the most important for basic communication are the various revisions of JIS X 0208. The latest version, JIS X 0208:1997, consists of 6355 kanji and 524 non-kanji characters; these numbers are slightly different in earlier versions. The non-kanji characters include hiragana and katakana syllabaries, and Latin, Greek, and Cyrillic alphabets. The kanji characters in JIS X 0208 include all characters in two official lists compiled by the Japanese government, namely Joyo ($B>oMQ(B) or "daily use" kanji, and Jinmei-yo ($B?ML>MQ(B) or "personal name" kanji. A genealogy of JIS X 0208 and related character sets can be found at JIS Character Sets.
Characters in JIS X 0208 are arranged in 94 rows, or "ku" ($B6h(B), each row having 94 cells, or "ten" ($BE@(B). A character may thus be indicated by its kuten ($B6hE@(B) value, a pair of decimal numbers in the range 1-94. For example, the first kanji in JIS X 0208 is $B0!(B, which has kuten value 16-01, sometimes written simply as 1601.
In order to make use of this character set, you must have one or more fonts which support it. NetBSD releases - in fact most free operating systems shipped with X11 - include a few fonts for JIS X 0208:1983. Additional Japanese fonts may be found in the fonts/jisx* entries of the NetBSD pkgsrc tree. You can see which JIS X 0208 fonts are available on your computer with
xlsfonts "*jisx0208*"and view them with a command such as
xfd -fn "*jisx0208*" &
With Netscape, you may get improved readability by selecting
Edit / Preferences / Appearance / Fonts / For the Encoding: Japanese (jis x0208-1983)and then setting both Variable and Fixed Width Fonts to Fixed (Misc) at largest possible size, e.g. 13.0.
An encoding for a character set is a way of representing text using that character set as a sequence of byte values. Japanese text using JIS X 0208 characters is not stored using kuten values. Instead, characters are mapped to two-byte codes using one of three encodings: ISO-2022-JP, EUC-JP, and Shift-JIS. These encodings, and kuten numbers, are closely related; interconversion is possible among them with fairly simple rules. Details are available at CODING.INF.
Of the sample Japanese websites named above, NetBSD's website is encoded with ISO-2022-JP, while Nikkei's uses Shift-JIS. Roughly, ISO-2022-JP is used for data interchange (email and such), EUC-JP is more common for internal processing, and Shift-JIS is seen at Microsoft installations.
A file containing only the two kanji characters spelling the word "kanji" ($B4A;z(B) has the following hexadecimal byte values for the three principal encodings:
EUC-JP: b4 c1 bb fa ISO-2022-JP: 1b 24 42 34 41 3b 7a 1b 28 42 Shift-JIS: 8a bf 8e 9a
Note that with ISO-2022-JP, escape sequences are used to select character sets: 1b 24 42 for JIS X 0208-1983, and 1b 28 42 for ASCII at the end of the line, and that all byte values have zero for the highest order bit. The two-byte code used by ISO-2022-JP for a JIS X character, once the necessary escape sequence has been entered, is called the JIS code.
Although there isn't a man page for the Japanese encodings, you can use xfd with any of the JIS fonts to view JIS codes for the characters. Here's a screenshot from xfd, showing the first kanji page of JIS X 0208, just after selecting the character $B0!(B with the mouse. The JIS code for the selected character is seen to be 0x3021:
For exhaustive detail on Japanese character sets and encodings, see Ken Lunde's book, listed below in the references. Specific information relating to Internet message encoding is contained in RFC1468.
The next step is to use Japanese text in a terminal session with typical UNIX-style command line processing. Install the following NetBSD packages:
pkgsrc/japanese/kterm pkgsrc/misc/lv or pkgsrc/japanese/ja-less pkgsrc/www/w3m or pkgsrc/www/lynx
To get some Japanese text on your computer, take either of the web pages from the previous section and save locally in text (not html) format. If you try to view one of these files in an xterm, you won't see Japanese characters. Open up a kterm instead. You can look at the downloaded web content with cat, head, and tail. For paging through Japanese text files, use jless or lv instead of less.
Kterm sometimes complains when you start it with messages of the form "Couldn't set locale:...". It is safe to ignore these warnings.
To view Japanese content in local files or on the web in text mode, you can use w3m or lynx. Open a kterm using
kterm -km euc &then visit a web page with
lynx -display_charset=euc-jp http://www.jp.netbsd.orgThe "-km euc" option tells kterm to expect display data in EUC-JP encoding. Configured this way, kterm can display ISO-2022-JP as well. To make the above command line options the defaults, you can add this line to ~/.Xresources:
KTerm*kanjiMode: eucand restart X11 or do
xrdb -m ~/.XresourcesW3m is usually able to guess encodings on the fly; command line options are available when an override is needed. If you're using lynx, you may want to add this line to /usr/pkg/share/lynx/lynx.cfg:
Although jless, lv, and lgrep (part of lv's package) allow you to search for Japanese strings, and w3m and lynx allow entering Japanese text into an HTML form, you don't have a way of typing Japanese characters into these programs yet.
Japanese text entry is usually done with two additional software layers:
The first applications to be examined have their own input methods, so for now, just install the cannaserver kanji server package and the multilingual version of the vi editor, from these NetBSD packages:
Although several conversion servers are available, current discussion is limited to cannaserver. When you have used a kanji server, you will be impressed with it not only as an input utility but a great learning tool when dealing with thousands of kanji characters.
You can start cannaserver as a non-root user just by typing
/usr/pkg/sbin/cannaserverThe cannaserver package has instructions for starting it at boot time, as well as an rc.d startup script.
You are now ready to enter Japanese text. Start a kterm and open a new file:
nvi-euc-jp jptestEnglish text is entered as usual. Japanese text is entered after pressing the canna conversion key, or "cannakey". The default canna key for the nvi binary is Ctrl-O; however, the scripts nvi-euc-jp, nvi-iso-2022-jp, and nvi-sjis set the canna key to Ctrl-\ (Ctrl-backslash).
Let's start with two lines, one line of hiragana saying "konnichiwa" (hello), and a second line of kanji saying "sekai" (world). To begin entering Japanese text, type "i" to enter vi insert mode as usual, then type Ctrl-backslash. A hiragana "a" should appear in the lower left corner to indicate you're in hiragana mode. Type "konnnichiha" - note the transliterated (not phonetic) spelling. As you type, you will see first Roman characters, then hiragana as syllables are recognized. Here's a screenshot, just after typing the third "n":
after typing in the full word $B$3$s$K$A$O(B, press Enter to end the clause, then Enter again to end the line.
On the second line, type "sekai". You will be looking at the hiragana (phonetic) spelling for the word. Press the spacebar to begin kanji conversion. The hiragana just typed is replaced with kanji, and the indicator at bottom left changes from [$B$"(B] to [$B4A;z(B] to indicate the change from hiragana entry to kanji conversion mode.
If you press the spacebar again, you will see a list of alternate conversions, and the indicator changes to [$B0lMw(B] to indicate list mode. Probably, the first choice was the right one and you will want to go back to it. Use Ctrl-f and Ctrl-b to move left and right among the choices. Here is a screenshot, after positioning the cursor over the first choice:
Press Enter to keep the desired conversion, Enter again to end the clause, Escape, and ":x" Enter to exit the editor. You can now cat your first Japanese text file.
Often, the initial kanji conversion offered by cannaserver is correct, and it will not be necessary to press the spacebar a second time for list mode. In this case, pressing Enter after the first spacebar keeps the first conversion offered.
From vi input mode, Ctrl-backslash will toggle between Japanese input and ASCII. Other keystrokes navigate among the choices for kanji conversion. It is also possible to enter kanji based on hexadecimal JIS code.
Many kanji are made up of smaller building blocks, known as radicals, of which there are some 214 officially recognized. It is common, but not universal, for kanji to be a combination of one figure for pronunciation, and another, its radical (or primary radical) for meaning. Cannaserver supports input by radical.
A brief list of the most common cannaserver commands appears in Appendix A of this article. Like so many of the programs presented here, it is best to start with just a few basic commands, then learn about advanced modes of operation gradually. For more information about cannaserver, see Craig Oda's JLinux tutorial mentioned in the references below.
The next two examples demonstrate email with Japanese language support, first reading mail from a POP server with cue, then from an IMAP server with gnus. For both cases, you should already have a working outbound email setup using, for example, sendmail or postfix. Also, as usual when experimenting with new email software, you should work with a test account rather than your normal working login.
Cue is a very fast and light-weight email client. It stores messages in the same manner as the MH mail utilities, usually relying on the MH "inc" command for fetching mail from the local spool or POP server. To try out email with cue, install the following NetBSD packages:
Login as the test user and open a kterm window. Make sure the kterm is in kanji mode "euc" as described above, as cue is hardcoded to euc-jp.
In the test user's home directory, create three files as shown:
.mh_profile Path: Mail Editor: nvi-euc-jp Inc: -noapop -host your-pop-server-name .cuerc send: sendmail -t -i editor: nvi-euc-jp +/^$/ %s/%s initial_folder: +inbox initial_window_size: 1/6 %refile .netrc machine your-pop-server-name login test-user-login password xxxxxxLimit permissions on ~/.netrc, and create three directories in the test user's home:
chmod 600 .netrc mkdir Mail mkdir Mail/inbox mkdir Mail/drafts
You're now ready to read and compose email in Japanese. Send a message or two to the test account - preferably something with Japanese content such as the sample file created above with nvi. Run
cuefrom the shell prompt. Type "i" to incorporate messages from the server into your inbox folder. Press the spacebar to view the first new message.
To send a message from within cue, type "w". As soon as you have entered "To:" and "Subject:" header contents for the message, you will enter an nvi edit window for the message body, where the usual nvi + cannaserver input method applies. After you have composed your message and exited nvi, cue places you in the +drafts folder with the cursor at your latest message. To send the message, type "c".
There is no current manual page for cue. Fortunately, there is ample online help. Type "h" while running cue to view help; spacebar scrolls forward through help and backspace scrolls backward.
There is a sample configuration file for cue at
Cue's internal help file is in
/usr/pkg/share/doc/cue/cue.hlpstarting with version 20010917nb1 of the NetBSD package.
Note that the MH mail package installed above supports Japanese language processing. MH is a mail client system for the true command line diehard, consisting of a number of separate commands to be run from the shell. Although the O'Reilly book on MH is out of print, the content lives on electronically, and is actively maintained, at MH & nmh: Email for Users & Programmers
For persons comfortable with Emacs and XEmacs, the popular Mew POP client offers Japanese language support. A NetBSD package is available at pkgsrc/mail/mew.
While not as fast as cue, the GNU Emacs Gnus module offers IMAP support with autosorting of messages, a common interface for reading mail and news, and integration with the Emacs editing environment. LEIM (Libraries of Emacs Input Methods) adds multilingual support to the system, (including postscript printing - see below).
Note XEmacs also offers multilingual support - there is simply not space to cover both of the major Emacs variants in one article.
Version 21.1 of GNU Emacs and LEIM is somewhat nicer to work with than version 20.7, but at the time of this writing there is no NetBSD package for the newer version. Installation instructions are given Appendix B of this article.
After installing GNU Emacs and LEIM, create a .emacs file in the test user's home directory containing the following line
(set-language-environment "Japanese")and a .gnus file containing the following single line with the hostname of your IMAP server:
(setq gnus-select-method '(nnimap "your-imap-server-name"))
Send an email to the test user, again, preferably something with Japanese content.
Start emacs with
emacs &When the editor has started, type
M-x gnusenter the password at the prompt, and take "n" when asked about storing the password for the session. When initial login to the IMAP server is complete, you should see a status message, "Checking new news...done". To bring the folder of new messages into view, type "jINBOX", Enter, "Sl1" and Enter.
You will now be able to view messages by hitting a spacebar, compose new messages with "m", reply to received messages with "r", and so forth. Exit the gnus module with "q". While composing a message, Ctrl-backslash will toggle Japanese text input, and spacebar during Japanese input will begin kanji conversion. Note the input method is not exactly the same as with nvi-m17n. For example, the $B$s(B character is entered using "n'" rather than "nn". A complete list of syllable inputs can be seen by checking help for the Emacs variable "quail-japanese-transliteration-rules".
Useful commands are also available under the menu system, under "Options / Mule (Multilingual Environment)".
Gnus is a huge program with hundreds of commands and options. However, you can get by quite well starting with a few basic commands, then adding others gradually as you find use for them. Excellent documentation is available both within Emacs (do C-h i and read the "Gnus" node) and at Gnus Network User Services.
If you have access to a news server, you may want to view postings in sci.lang.japan. The quickest way to do this is to start from the gnus *Group* buffer, then
Another Japanese-enabled Emacs extension with IMAP support is Wanderlust (NetBSD package pkgsrc/mail/wl). Persons looking for more GUI interaction in an IMAP client with Japanese support may want to look at Sylpheed (NetBSD package pkgsrc/mail/sylpheed). Both Wanderlust and Sylpheed, like cue, support MH format.
The applications discussed so far have their own input methods. It is also possible to use an external program providing the input method. One such program is kinput2, which is used in the dictionary lookup examples that follow.
First, install the package
Then, add the following to your ~/.Xresources (the character after "override" is a backslash):
KTerm*VT100.Translations: #override \ Shift<Key>space: begin-conversion(_JAPANESE_CONVERSION) Kinput2*conversionEngine: cannarestart X11 or do
xrdb -m ~/.Xresourcesthen start kinput2
/usr/X11R6/bin/kinput2 &and open a kterm. For programs that support the kinput2 method, you can now toggle Japanese text entry with Shift-space.
(There's a slight chance cannaserver will offer something else as its first choice, as it will reorder alternatives based on previous selections.) Press Enter to keep the first choice offered, Shift-space to end Japanese input, and finally Enter to run the command.
Note that if you're running nvi-euc-jp in a kterm, and kinput2 is available, then you can use either the internal input method (Ctrl-backslash), or kinput2 (Shift-space) to enter Japanese text.
Note that kinput2 is not needed for simple cutting and pasting. For example, if you visit a Japanese-language website from lynx in another kterm window, you can paste words and phrases into the dictionary textarea.
A kterm connected to kinput2 and cannaserver allows you to enter Japanese characters into search strings for jless, lv, and lgrep. Simply use Shift-space to toggle Japanese text entry.
Select the mirror site closest to you. At the mirror site, select the upper left item in the array, "[tw.jpg]". Tab down to the big textarea under the prompt "Key or paste Japanese text in the box below". Press shift-spacebar to begin phonetic Japanese input, and enter "inu". Use kanji conversion and select the $B8$(B character. Here's a screenshot, just after selecting the desired character, showing the kinput2 window of kanji alternatives superimposed on the kterm window:
Close the input method and follow the link "Begin Translation". What common pet is an "inu"?
The previous example was somewhat artificial, in that you rarely know the pronunciation of a kanji until after you look it up. Usually, kanji are captured from another window and pasted into the dictionary.
pkgsrc/japanese/xjdicthen copy the dotfile to your home directory:
cp /usr/pkg/share/doc/xjdic/.xjdicrc $HOME
Invoke xjdic from a kterm window. At the main prompt, "XJDIC [1:edict] SEARCH KEY: ", type Shift-space, "arigatou" ($B$"$j$,$H$&(B), Enter, Shift-space, and Enter again - note in this example you're looking up a word with hiragana. You should see that the English translation is "thank you". Press "n" to end the current dictionary lookup, and enter "you are welcome" at the main XJDIC prompt. You will see kanji and hiragana for the corresponding Japanese phrase. Exit xjdic by pressing Ctrl-D.
All techniques in the section require a print system that supports postscript (or ghostscript) printing.
First, if you have a Japanese-capable postscript printer, you can print with a2ps-j, which is found in NetBSD package
Usually, though, a non-native Japanese speaker will be using printers for which Japanese fonts are not already installed. The following sections give alternatives for this situation.
pkgsrc/graphics/xvBring up some Japanese text in a window, any sort of window. In a kterm (or xterm) window, run xv, right-click for the command menu, and use the "Grab" command to take a screenshot. You can then use xv to print the screenshot to any postscript printer as you would any graphic file.
Another approach that is about as gruesome as doing screen dumps is to visit a conversion server on the Internet with Netscape or Mozilla. After entering a URL for the text to be converted into the web form, you will see all the Japanese characters in your text filled in as graphic images, one by one. The result can of course be printed from your browser to any postscript-enabled printer.
Now suppose you have installed cnprint, and have a euc-jp-encoded file, "/tmp/x.euc", and you want to print it. (It is easy enough to create such a file with nvi, configured as above.) Issue the following command:
cd ~/cnprint ./cnprint -w -euc -o=/tmp/out.ps /tmp/x.eucThe resulting file can be printed on any postscript-capable printer.
Download the jiskan24 bdf font, for example here, and uncompress into directory /usr/pkg/share/fonts/bdf. In your ~/.emacs file, put the lines
(setq ps-multibyte-buffer 'bdf-font-except-latin) (setq bdf-directory-list (list "/usr/pkg/share/fonts/bdf"))You can now use the usual ps-print options from the command buffer to print text containing Japanese characters. If you're reading email with gnus and the cursor is in the Summary window, you can print the current message with "A P".
Further information on multilingual printing from GNU Emacs can be found by looking at the source; see /usr/pkg/share/emacs/21.1/lisp/ps-mule.el.
Certainly there is no shortage of related topics to explore. Here are a few suggestions.
Here's a canna input quick reference. The portion of text that may be replaced during conversion is called the "current clause"; the vertical markers that indicate its beginning and end are the "fence".
Note that several of the commands (Ctrl-n, Ctrl-p, Ctrl-f, Ctrl-b, Ctrl-g) are reminiscent of Emacs. As is often the case, it is beneficial to have a working familiarity with native editing commands for both the major editors, vi and Emacs.
First, install packages for prerequisite libraries:
pkgsrc/x11/Xaw3d pkgsrc/graphics/xpm pkgsrc/graphics/jpeg pkgsrc/graphics/tiff pkgsrc/graphics/libungif pkgsrc/graphics/png
Download emacs-21.1.tar.gz and leim-21.1.tar.gz from your nearest ftp.gnu.org mirror and extract first emacs, then leim, into the same directory. The top of the source tree will be a path ending in emacs-21.1. Here's one way to do it:
mkdir ~/build-emacs cd ~/build-emacs tar -xzf .../emacs-21.1.tar.gz tar -xzf .../leim-21.1.tar.gz
Create a subdirectory for compiling the editor and configure with the following options from this directory:
mkdir obj cd obj ../emacs-21.1/configure --with-pop --with-x --with-ipv6 \ --prefix=/usr/pkg \ --x-includes=/usr/X11R6/include:/usr/pkg/include \ --x-libraries=/usr/X11R6/lib:/usr/pkg/lib \ --with-xpm --with-jpeg --with-tiff --with-gif --with-png \ --srcdir=../emacs-21.1At the end of its run, the configure script should summarize results as follows:
What operating system and machine description files should Emacs use? `s/netbsd.h' and `m/intel386.h' What compiler should emacs be built with? gcc -g -O2 Should Emacs use the GNU version of malloc? yes Should Emacs use a relocating allocator for buffers? yes Should Emacs use mmap(2) for buffer allocation? no What window system should Emacs use? x11 What toolkit should Emacs use? LUCID Where do we find X Windows header files? /usr/X11R6/include:/usr/pkg/include Where do we find X Windows libraries? /usr/X11R6/lib:/usr/pkg/lib Does Emacs use -lXaw3d? yes Does Emacs use -lXpm? yes Does Emacs use -ljpeg? yes Does Emacs use -ltiff? yes Does Emacs use -lungif? yes Does Emacs use -lpng? yes Does Emacs use X toolkit scroll bars? yesAfter configuration is complete, do
make su make install
If a cnprint package is available at The NetBSD Packages Collection, you should install it instead of using the instructions below.
Start by creating a work area for the build.
cd mkdir cnprint cd cnprintGo to CAI's Software Page. Select the "Download CNPRINT" link, and from there download all files for the latest version of cnprint. The links are labeled
release note cnprint330b.c ttfb330b.c cnprint330b.hlp cnprint33.cmd cnprint.afl helvet.datYou will also need kanji hbf fonts. At present, these are obtained by downloading this archive.
Extract the font files and place kanji48.hbf and kanji48.bin in the cnprint work directory. Create file "cnprint.cmd" in the cnprint directory, containing the single line: