SKF(1) SKF(1)
NAME
skf - simple Kanji Filter (v1.93)
SYNOPSIS
skf [-AEIJKNQRSXZabdehjknqrsuvxz] [ long_format_options ]
[infiles..]
DESCRIPTION
skf is a yet another i18n capable kanji-filter, designed
for reading various CJK-coded files on the Net. It con-
verts input kanji texts or streams into a character stream
using designated codeset and output them to standard out-
put. Specifically, skf is designed to be a versatile fil-
ter to read documents in various code sets, and does not
have fancy features which are not directly related to code
conversion.
Like nkf, skf automatically recognizes input file code
when it is a kind of ISO-2022 compliant code, and also
detects EUC-variant codes if input file is Japanese text
without X0201 kanas. skf 1.9x can read various iso-2022
compliant charsets, including JIS Kanji code (X0208, X0212
and X0213), EUC encoding (euc-jp (with x-0213 support),
euc-cn, euc-kr and euc-tw), ISO Europian latins
(ISO-8859-1 to 11, 13/14/15/16), BS 4730, NF Z 62-010 and
X0201 kana with ESC-(-I, SS0, Locking shift. skf also
supports some non-iso2022 compliant sets, including
Microsoft Shift-JIS code, KOI-8-R/U, GB2312 (HZ), big5,
VISCII(rfc1456, include VIQR), Unicode standard
(UCS2/UTF-16, UTF7 and UTF8), some of MS codesets (cp1250
etc.) and some other vendor specific codes (KEIS83, JEF
etc).
Supported output codesets include X-0208/X-0212/X-0213
JIS, X-0201 JIS, ASCII, Microsoft Shift-JIS, EUC-
jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.
skf also provide some basic decoding features for some
common encodings (MIME, Punycode and URI codepoint).
Unlike nkf, skf is designed to convert input code into
some kind of human-readable form under a local environment
(i.e. codeset), and has several extra conversion features
like GNU recode. Such conversions include Windows/Macin-
tosh specific code swap and old-new jis glyph change,
html-format/TeX format conversion and variant unifica-
tions.
If file name(s) are given, skf read the files and output
converted stream to stdout. If no file names are given,
input is taken from stdin and output to stdout. OPTIONS
are taken from Environment Variables SKFENV, skfenv and
command line, respectively in this order. Environment
variables are not used when skf is running as priviledged
user. skf does not use LOCALE-related environment vari-
ables for conversion, but output error messages are con-
trolled by given LOCALES.
OPTIONS
skf-1.9 is written from scratch, and inherits no code from
nkf. However, skf is intended to be a drop-in replacement
for nkf(v1.4) and has a similar commonly-used nkf option
set.
skf 1.9x recognizes following options. Defaults are all
off if not explicitly specified.
buffering control
-b use buffered output. This is default.
-u use unbuffered output. This option spoils code
detection feature.
Input/Output codeset options
--ic= input_code_set
specify input codeset is input_code_set. Possible
candidates are shown below.
--oc= output_code_set
specify output codeset is output_code_set. Possi-
ble candidates are shown below. Default codeset in
distribution package is euc-jp, but depends on com-
pile option. Default codeset is shown by
Supported codeset
skf recognize following codesets as an input/output code-
set. These codeset names are case insensitive. Note that
iso-2022 escape-based input codeset (registered to IANA)
is recoginized automatically, and for this reason, some
codeset is treated as same when specified as input. o in
in-column means named codeset can be specified as input
and x means named codeset is not for input. output-column
is same except it is for output.
in out name description
o o iso8859-1 ascii + iso-8859-1 (latin-1)
o o iso8859-2 ascii + iso-8859-2 (latin-2)
o o iso8859-3 ascii + iso-8859-3 (latin-3)
o o iso8859-4 ascii + iso-8859-4 (latin-4)
o o iso8859-5 ascii + iso-8859-5 (Cyrillic)
o o iso8859-6 ascii + iso-8859-6 (Arabic)
o o iso8859-7 ascii + iso-8859-7 (Greek)
o o iso8859-8 ascii + iso-8859-8 (Hebrew)
o o iso8859-9 ascii + iso-8859-9 (latin-5)
o o iso8859-10 ascii + iso-8859-10 (latin-6)
o o iso8859-11 ascii + iso-8859-11 (Thai)
o o iso8859-13 ascii + iso-8859-13 (Baltic Rim)
o o iso8859-14 ascii + iso-8859-14 (Celtic)
o o iso8859-15 ascii + iso-8859-15 (Latin-9)
o o iso8859-16 ascii + iso-8859-16
o o koi-8r koi-8r (Russian)
o o cp1251 Cyrillic latin MS cp1251
o o jis iso-2022-jp (rfc1496 7bit JIS)
o o jis-x0213 iso-2022-jp-3 (JIS X-0213(2000))
o o jis-x0213-strict iso-2022-jp-3-strict
o o jis-x0213-2004 iso-2022-jp-2004(JIS X-0213(2004))
o o oldjis iso-2022-jp-1978(JIS X-0208(1978))
o o euc-jp EUC-encoded JIS X-0208(1997)
o o euc-x0213 EUC-encoded JIS X-0213(2000)
o o euc-jis-2004 EUC-encoded JIS X-0213(2004)
o o euc-kr EUC-encoded KS X-1001 Korian
o o euc7-kr 7bit EUC-encoded KS X-1001 Korian
o o johab KS X-1001-johab Korian
o o euc-cn EUC-encoded GB2312 chinese
o o euc7-cn 7bit EUC-encoded GB2312 chinese
o o hz HZ-encoded GB2312 chinese
o o euc-tw EUC-encoded CNS 11643 chinese
o o gb12345 EUC-encoded GB12345 chinese
o o gbk GB2312 Extension (cp936)
o o big5 BIG5 (with Eten extension + EURO)
o o big5-cp950 BIG5 (Microsoft cp950 + EURO)
o o sjis Shift-jis (Microsoft cp943)
o o sjis-x0213 Shift-jis-encoded JIS X-0213(2000)
o o sjis-x0213-2004 Shift-jis-encoded JIS X-0213(2004)
o x sjis-cellular Shift-jis-encoded JIS X-0208
with NTT Docomo, Vodafone phone glyph
o o cp932 Shift-jis-encoded MS cp932
o o viscii VISCII (rfc1456) Vietnamise
o o viqr VISCII (rfc1456-VIQR) Vietnamise
o o keis Hitachi KEIS83/90
o x jef Fujitsu JEF (basic support only)
o o ucs2 Unicode(TM) UCS-2/UTF-32LE
o o utf7 Unicode(TM) UTF-7
o o utf8 Unicode(TM) UTF-8
o x transparent Transparent mode (see below)
Codeset explanations
iso-8859-*
a.k.a. latin*. When specified as output, G0 = GL is
ascii and G1 = GR is iso-8859-*. 8bit encoding is
used.
iso-2022-jp, jis
Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS
x0201 roman, G1 = GR is JIS x0201 kana, G2 is
iso-8859-1 and G3 is JIS x0212 Supplementary Kanji.
jis-x0213
Encoding is iso-2022-jp-3. G0 = GL is JIS x0201
roman, For output, G1 = GR is JIS x0201 kana, G2 is
iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
jis-x0213-strict
Encoding is subset of iso-2022-jp-3-strict (uses
Plane 1 only). For output, G0 = GL is JIS x0201
roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1
and G3 is not set. Output code as JIS x0208 when-
ever possible. JIS X-0213 input is automatically
recognized.
jis-x0213-2004
Encoding is iso-2022-jp-2003(2004). For output, G0
= GL is JIS x0201 roman, G1 = GR is JIS x0201 kana,
G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
oldjis Encoding is iso-2022-jp (JIS X-0208(1978)). G0 = GL
is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2
is iso-8859-1 and G3 is JIS x0212 Supplementary
Kanji.
euc-jp, euc
Encoding is 8-bit EUC using JIS X0208(1997) charac-
ter set. G0 = GL is ascii, G1 = GR is JIS x0208,
G2 is JIS x0201 kana and G3 is JIS x0212 Supplemen-
tary Kanji.
euc-x0213
Encoding is 8-bit EUC-based JIS X0213(2000). G0 =
GL is ascii, G1 = GR is X0213 plane 1, G2 is
iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
euc-jis-2004
Encoding is 8-bit EUC-based JIS X0213(2004). G0 =
GL is ascii, G1 = GR is X0213(2004) plane 1, G2 is
iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
euc-kr Encoding is 8-bit EUC using KS X-1001 Wansung char-
acter set. G0 = GR is KS X1003, G1 = GR is KS
X1001, G2 and G3 is not set.
euc7-kr iso-2022-kr
Encoding is iso-2022-kr (rfc1557). 7-bit EUC using
KS X-1001 Wansung character set. G0 = GR is KS
X1003, G1 is KS X1001, G2 and G3 is not set.
euc-cn Encoding is 8-bit EUC using GB 2312 character set.
G0 = GR is GB1988, G1 = GR is GB2312, G2 and G3 is
not set.
euc7-cn
Encoding is 7-bit EUC using GB 2312 character set.
G0 = GR is GB1988, G1 is GB2312, G2 and G3 is not
set.
hz Encoding is HZ encoded (rfc1842) GB 2312 character
set. G0 = GR is GB1988, G1 = GR is GB2312, G2 and
G3 is not set.
euc-tw Encoding is EUC encoded CNS11643 Plane1/2. Subset
of iso-2022-cn. G0 = GR is ascii, G1 = GR is
CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is
not set.
gb12345
Encoding is 8-bit EUC using GB 12345 (GBF) charac-
ter set. G0 = GR is GB1988, G1 = GR is GB12345, G2
and G3 is not set.
gbk Encoding is GBK (a.k.a. cp936). G0 = GR is GB1988
and G1 = GR is GBK. G2 and G3 is not set.
big5 Encoding is Big5 with ETen extension. Include Euro
mapping. Uses ascii as latin part.
big5-cp950
Encoding is Big5 (cp950) character set. Uses ascii
as latin part.
VISCII (experimental)
Vietnamise VISCII (rfc1456). Not TCVN-5712.
VIQR (experimental)
Vietnamise VISCII with VIQR encoding(rfc1456).
sjis Encoding is Shift-encoded JIS X0208(1997) character
set. Note this is not cp932. Uses JIS x-0201 latin
as latin(GL) part.
sjis-x0213
Encoding is Microsoft JIS using JIS X0213(2000)
character set.
sjis-x0213-2004
Encoding is Microsoft JIS using JIS X0213(2004)
character set. 10 newly defined character added,
but Unicode mapping is same as JIS X0213(2000).
Uses JIS x-0201 latin as latin(GL) part.
sjis-cellular (experimental)
Encoding is Shift-encoded JIS X0208(1997) character
set with NTT Docomo/Vodafone cellular phone glyph
mapping.
cp932 Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji
area. Uses JIS x-0201 latin as latin(GL) part.
johab Encoding is KS X1001(Johab). Uses KS X1003 latin as
latin(GL) part.
ucs2 Encoding is Unicode UTF-16 (v4.0). Input/Output
default byte-endian is little, and input byte order
mark is recognized. Output includes endian mark by
default unless --disable-endian-mark is specified.
Output range is within UTF-32 with surrogate pair
unless --limit-to-ucs2 is specified.
utf8 Encoding is UTF-8 encoded Unicode (v4.0). Output
doesn't include byte order mark unless
--enable-endian-mark is specified. Output range is
within UTF-32 unless --limit-to-ucs2 is specified.
utf7 Encoding is UTF-7 encoded Unicode (v4.0). Output
range is limited to UTF-16, and value above U+10000
is regarded as undefined.
keis (experimental)
Encoding is Hitachi KEIS83/90. Output range is lim-
ited to EBCDIK and JIS X-0208 area.
jef (experimental)
Encoding is Fujitsu JEF. Only basic part is sup-
ported.
koi8r Russian KOI-8R code.
cp1250 Central Europian latin MS cp1250 code.
cp1251 Eastern Europian cyrillic MS cp1251 code.
transparent
Transparent mode. Various code control features,
include folding and line end code conversion, is
ignored.
Shortcuts
-n -j same as --oc=jis
-s -x same as --oc=sjis
-a -e same as --oc=euc-jp
-q same as --oc=ucs2
-z same as --oc=sjis
-y same as --oc=utf7
-k same as --oc=keis
-A, -E same as --ic=euc-jp. Assume input code set is EUC-
JP.
-N same as --ic=jis. Assume input code set is
iso-2022-jp.
-S, -X same as --ic=sjis. Assume input code set is
Microsoft JIS.
-Q same as --ic=ucs2.
-Y same as --ic=utf7.
-Z same as --ic=utf8.
-K same as --ic=keis.
ISO-2022 Specific controls
Replace G0-3 after setting up according to specified input
codeset by assigned character set with this option.
--set-g0=`charset name'
Predefine specified code set to plane 0 (G0). Also
set to GL at initial state.
--set-g1=`charset name'
Predefine specified code set to right plane (G1).
Also set to GR at initial state.
--set-g2=`charset name'
Predefine specified code set to right plane (G2).
--set-g3=`charset name'
Predefine specified code set to right plane (G3).
Supported `char_set' is as follows. 'o' means the codeset
can be spacified to set to the plane. 'x' means you can't.
g0 g1 g2 g3 codeset name description
o o o o ascii ANSI X3.4 ASCII
o o o o x0201 JIS X 0201 (latin part)
x o o o iso8859-1 ISO 8859-1 latin
x o o o iso8859-2 ISO 8859-2 latin
x o o o iso8859-3 ISO 8859-3 latin
x o o o iso8859-4 ISO 8859-4 latin
x o o o iso8859-5 ISO 8859-5 Cyrillic
x o o o iso8859-6 ISO 8859-6 Arabic
x o o o iso8859-7 ISO 8859-7 Greek-latin
x o o o iso8859-8 ISO 8859-8 Hebrew
x o o o iso8859-9 ISO 8859-9 latin
x o o o iso8859-10 ISO 8859-10 latin
x o o o iso8859-11 ISO 8859-11 Thai
x o o o iso8859-13 ISO 8859-13 latin
x o o o iso8859-14 ISO 8859-14 latin
x o o o iso8859-15 ISO 8859-15 latin
x o o o iso8859-16 ISO 8859-16 latin
x o o o tcvn5712 TCVN 5712 (Vietnamese)
x o o o ecma113 ECMA 113 Cyrillic
o o o o x0212 JIS X-0212(1990)
o o o o x0208 JIS X-0208(1990)
o o o o x0213 JIS X-0213 Plane 1(2000)
o o o o x0213-2 JIS X-0213 Plane 2(2000)
o o o o x0213n JIS X-0213 Plane 1(2004)
o o o o gb2312 Simplified Chinese GB2312
o o o o gb1988 Chinese GB1988(latin)
o o o o gb12345 Traditional Chinese GB12345
o o o o ksx1003 Korian KS X 1003(latin)
o o o o ksx1001 Korian KS X 1001
x o o o koi8-r Cyriilic KOI-8R
x o o o koi8-u Ukrainean Cyriilic KOI-8U
o o o o cns11643 Traditional Chinese CNS11643
x o o o viscii-r RFC1496 VISCII (right plane)
o o o o viscii-l RFC1496 VISCII (left plane)
o o o o vni Vietnamese VNI
x o o o cp437 Microsoft cp437 (US latin)
x o o o cp737 Microsoft cp737
x o o o cp775 Microsoft cp775
x o o o cp850 Microsoft cp850
x o o o cp852 Microsoft cp852
x o o o cp855 Microsoft cp855
x o o o cp857 Microsoft cp857
x o o o cp860 Microsoft cp860
x o o o cp861 Microsoft cp861
x o o o cp862 Microsoft cp862
x o o o cp863 Microsoft cp863
x o o o cp864 Microsoft cp864
x o o o cp865 Microsoft cp865
x o o o cp866 Microsoft cp866
x o o o cp869 Microsoft cp869
x o o o cp874 Microsoft cp874
x o o o cp932 Microsoft cp932 (Japanese)
x o o o cp1250 Microsoft cp1250(Central Europe)
x o o o cp1251 Microsoft cp1251 (Cyrillic)
x o o o cp1252 Microsoft cp1252 (Latin-1)
x o o o cp1253 Microsoft cp1253 (Greek)
x o o o cp1254 Microsoft cp1254 (Turkish)
x o o o cp1255 Microsoft cp1255
x o o o cp1258 Microsoft cp1258
--euc-protect-g1
In EUC input mode, suppress sequences to set a
charset to G1. Such sequences are discarded.
--add-annon
Add announcer for JIS X-0208(1990) to X-0208 desig-
nate sequence. This option works only with
iso-2022-based output.
--disable-jis90
Disable 2 added characters of JIS X-0208(1990). If
this option is specified, these two characters are
replaced by Kanji variants. This option is off by
default.
--input-detect-jis78
Distinguish JIS X-0208(1978) codeset and JIS
X-0208(1983/90) codeset. By default, these two
charset is regarded as X-0208(1983/90). This option
is valid only when input encoding is JIS
(ISO-2022).
JIS X-0212(Supplement Kanji code) Support
--x0212-enable
skf by default does not output JIS X-0212 code.
This option enables use of JIS X-0212 part. Output
code set may be neither Microsoft code nor KEIS.
For Unicode variant encodings, this option is on by
default. This option is supported for backward
compatibility. May not be supported in future ver-
sions.
Unicode coding specific control options
--use-compat
When output is one of translation format of Unicode
standard, enable characters in compatibility plane
(0xfxxx). If disabled, these characters is con-
verted to variants or undefined.
--use-ms-compat
When output is Unicode, make translation Microsoft
windows compatible (i.e. cp932). This only affect
some symbols in JIS-Kanji, and adding --use-compat
option is recommended.
--use-cde-compat
When output is Unicode, make translation CDE stan-
dard codeset compatible.
--little-endian
When output is Unicode, use little endian byte-
order. This is default.
--big-endian
When output is Unicode, use big endian byte-order.
--disable-endian-mark
When output is UTF-16, do not use byte order mark-
ing. To make UTF-16N, use this option with --lit-
tle-endian. This is off by default.
--enable-endian-mark
When output is UTF-8, output byte order marking.
This is off by default.
--input-little-endian
When input is Unicode, assume input is little
endian byte-ordered. This is default, but skf
respects byte-order mark.
--input-big-endian
When input is Unicode, assume input is big endian
byte-ordered. Note that skf respects byte-order
mark.
--endian-protect
Do not use endian mark in the input stream. Endian
mark is just discarded. This is off by default.
--use-replace-char
skf by default converts undefined (except 0x2xxx
part) characters into "geta (U+3013)" code in
Japanese codeset. This option specifies skf to use
replacement char (U-fffc) instead.
--limit-to-ucs2
Do not use > 0x10000 area code in Unicode (i.e.
limit code to ucs2 area). This is off by default.
--disable-cjk-extension
Treat CJK extension A/B area as undefined. This is
off (i.e. these areas are enabled) by default.
--old-hangul-location
Treat input U-3400 area as hangul (Unicode 1.0 com-
patibility). This is off by default.
Codeset/Vendor Specific codeset handling flags
skf by default assumes machine specific parts of kanji
code are Microsoft Windows compatible. Here are some
options that control this behavior. Option in this cate-
gory is valid when output codeset is Japanese codeset,
except disable-charts.
--use-apple-gaiji
Assume machine specific part in input file is Mac-
intosh (System 7,8,9 or OS X) compatible.
--disable-ibm-gaiji
Disable machine specific part in input file.
--disable-chart
Do not use Moji-keisen characters. This is for old
Macintosh system (System 6.x or older) compatibil-
ity.
Miscellanious codeset related options
--old-nec-compat
Enable old NEC kanji sequence (ESC-K,H). Needs com-
pile option --enable-oldnec at configuration.
--no-utf7
Assume input code set is *NOT* UTF-7 encoded Uni-
code. This option disables input utf7 testing.
--no-kana
Assume input code set does *NOT* include JIS x0201
kana. Also suppresses Unicode half width variants.
OUTPUT Conversions options
skf has various features to fit output file to local envi-
ronment, and many of these are controlled by extended con-
trol switch described in this section.
--use-g0-ascii
set G0(=GL) for output encoding to ASCII, ignoring
codeset designation.
X-0201 Kana/latin conversions
skf by default converts X-0201 kanas to X-0208 kanas. To
output X-0201 kana as it is, use one of following options.
When output is designated to EUC or SJIS, these three
options enable X-0201 kana output by ways provided by each
code set. When Unicode output is specified, (equiv.) kana
part output is controlled by --use-compat, not following
switches. Valid only when output codeset is non-Unicode
Japanese codeset.
--kana-jis7
use SI/SO locking shift sequence to designate
X-0201 kana.
--kana-jis8
output X-0201 kana using 8-bit code right plane.
--kana-esci --kana-call
use ESC-(-I to designate X-0201 kana.
--kana-enable
use X-0201 kana when EUC (with G2) or SJIS output
code is used. When JIS output, it is same as
--kana-call.
URI/TeX conversion feature options
With Unicode(tm) family output codings, skf output non-
ascii latin character part as it is, but with other output
codings, skf converts these characters using following
rules:
(1) If code is defined in a specified output codeset, it
is outputted with this codeset.
(2) If one of following html convert modes enabled and
code is defined in html/sgml codeset, it is converted to
entity-reference or codepoint reference.
(3) If tex convert mode enabled and code is defined in tex
codeset, it is converted to tex format.
(4) If code is a kind of combined ligatures, it is shown
by a set of characters.
(5) A kind of replacement character is shown, with warn-
ing.
--convert-html --convert-sgml
Enable html convert mode. This mode is cleared by
--reset. These two options are synonyms, and are
treated as same option.
--convert-html-decimal
Enable html code-point decimal convert mode. This
mode is cleared by --reset.
--convert-html-hexadecimal
Enable html code-point hexadecimal convert mode.
This mode is cleared by --reset.
--convert-tex
Enable TeX convert mode. This mode is cleared by
--reset.
--use-iso8859-1
Enable iso-8859-1 output. Iso-8859-1 is invoked to
G1 and set to GR plane.
--use-iso8859-1-right
Enable 7-bit iso-8859-1 output. Iso-8859-1 is
invoked to G1 plane.
Encoding control options
--decode=`encoding scheme'
Specify encoding scheme for input stream. Supported
encoding scheme is `hex', 'mime', 'mime_q',
'mime_b', 'uri_encode', 'puny', 'hex_perc_encode',
CAP hex-code, mime, mime Q-encoding, mime B-encod-
ing, uri character reference, ACE punycode, uri
percent notation, base64, Q-encoding, rfc2231 and
rot13/47 respectively. Only one decode option is
valid, and if more than one option is specified,
last one is used. When mime decoding is specified,
base text is assumed to be EUC encoding unless
specified otherwise. Except rot, which assumes
input stream is Shift_JIS, EUC or iso-2022-jp,
these encodings assumes input stream is ascii (as
defined in RFC2045). Some encodings may co-exist
with encoding, but this is not guaranteed. Espe-
cially, if input is UTF-16/UCS2 code, these encod-
ing is ignored in skf.
End of line control options
--lineend-thru
Output end of line code as it is. Also output ^Z
code as it is. This is default.
--lineend-cr --lineend-mac
Use CR as end of line code. Also delete ^Z code
from input stream.
--lineend-lf --lineend-unix
Use LF as end of line code. Also delete ^Z code
from input stream.
--lineend-crlf --lineend-windows
Use CRLF as end of line code. Also delete ^Z code
from input stream.
-F[line_length[-kinsoku]]
-f[line_length[-kinsoku]]
Wrap input lines by line_length columns. f option
deletes CR/LF's in input, and F option doesn't
delete them. For Japanese convension, both gyoutou-
kinsoku(by burasage-gumi) and gyoumatsu-kin-
soku(oidasi-gumi) is supoorted. burasage-length is
controlled by kinsoku option. Default value for
line_length is 60, and must be < 1000. Default
value for kinsoku is 5, and must be < 10.
File control options
--filewise-detect --force-reset
Reset and re-detect input code set at the start of
each file.
--linewise-detect
Reset and re-detect input code set at the start of
each line. This option needs -DKUNIMOTO at compile
time.
Compatibility options
--nkf-compat
interpret following options as nkf compatible man-
ners.
--skf-compat
interpret following options as skf-native manners.
Misc. Control options
--disable-space-convert
skf by default, converts an ideographic space into
two ascii spaces. This option disables this behav-
ior.
--html-sanitize
Convert several characters in HTML document to
entity reference expression. Specifically,
"!#$&%()/<>:;?' is escaped by entity expression.
--filewise-detect --force-reset
If multiple input files are given, detect input
code for each file.
--linewise-detect
Detect input code line-wise. Note this option weak-
ens code detect feature. Need compile option (at
configure) --enable-kunimoto.
--reset
Reset all flags specified by extended controls and
given input code.
--inquiry --guess
skf detects code and output detect result to std-
out. No filtering output is performed. If multiple
input file is given, --show-filename is automati-
cally enabled.
--suppress-filename
When inquiry(--inquiry) is on, this option disables
file name output. This option overrides
--show-filename.
--show-filename
When inquiry(--inquiry) is on, this option adds
each file name to output.
--invis-strip
Delete all escape sequences not belonging to
ISO-2022 code extension. This is intended to
replace invisstrip command bundled in inews pack-
age.
-I Warn if input has unassigned code points.
-v print version and exit.
-h --help
print brief help.
--show-supported-codeset
Display supported codeset (input) and exit.
--show-supported-charset
Display supported character set (output) and exit.
-%[debug_level]
Enable skf debugging. Debug level is one digit. 0
is the least verbose, and with -%9 you'll get whole
traces within skf. This option needs compile
option --enable-debug.
FILES
/usr/(local/)share/skf/lib/ (Unices)
/Program Files/skf/share/lib (MS Windows)
These directories are where external codeset con-
version tables go. The location that current skf
assumes are shown by -h option.
AUTHOR
skf is written by Seiji Kaneko (skaneko@a2.mbn.or.jp)
based on idea from nkf written by Itaru Ichikawa
(ichikawa@flab.fujitsu.co.jp) X-0213 code table is derived
from work of earthian@tama.or.jp.
ACKNOWLEDGEMENT
skf is inspired by works or requests by shinoda@cs.titech,
kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh,
Hinata(HKE) Ashizawa(CRL) Kunimoto(SDL) Oohara(Univ of
Kyoto). Thanks.
BUGS AND LIMITATIONS
1. skf can handle mixed coding with some limitations. How-
ever, code detection tends to fail for mixed code, and
giving explicit input code set is strongly encouraged, if
codeset is known beforehand.
In case of need, --linewise-detect option may help, but
more likely to fail to detect codes.
2. When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to
detect input code, but giving explicit code set is encour-
aged. skf doesn't support UCS4, but does support
UTF-16/UTF-32 (i.e. surrogate pairs). skf just pass Com-
posite characters to output. No further normalization pro-
cess is performed.
3. skf implements ISO-2022 with following exceptions
i) GL 0x20 is always space. Even when 96-character code-
set is invoked to GL.
ii) Sequences for setting codes to C1 and C2 is always
ignored.
iii) if unknown sequence is given to G0, G0 is set to
ascii, and locking/single shift is cleared. Unknown
sequece call to G1-G3 is just ignored.
iv) Sequences for 96 character multibyte coding is
ignored (Currently, no codeset is registered).
v) Calling UTF-8, UTF-16 coding system from iso-2022 is
supported, and returns to previous coding system by stan-
dard return. Calling and return in other case is ignored.
vi) Because of cellular phone glyph support, several pri-
vate (not registered) codeset is defined in skf, and can
be called by appropriate sequence.
4. Since skf by default tests input stream to detect utf7
coding, skf sometimes misdetects pure ascii text as utf7.
If this occurs, use --no-utf7 option.
5. error output coding is controlled by LOCALE environment
variables in UN*X system. Since skf don't care about std-
out and stderr is redirecting into same stream, this case
should be handled by user.
6. skf-1.9x converts KEIS/JIS X-0213 code using CJK-exten-
sion B and CJK compatibility area. For this reason, X-0213
and KEIS convert result varies depending on --use-compat
and --limit-to-ucs2 switches.
7. JIS X-0207(1979) is not supported. JIS X-0211(1987) is
designed to be supported (i.e. common terminal control
sequence will be transparently passed to output).
8. Even if unbuffer option(-u) is specified, some code-
translation related bufferings are still performed (in
MIME, kana, VIQR etc.).
9. skf-1.9x recognizes and handles languages in
iso639-1(alpha 2). iso639-2 is not supported as a valid
language set.
Notes
1. Extended options are changed extensively since skf-1.9.
Some archaic options (eg. -B, -@ and -r) have been deleted
from this version.
2. skf is derived project from nkf, but doesn't contain
nkf codes. Copyright notice is retained by honor.
3. From version 1.9, default Japanese character set
assumed by skf has changed to JIS X-0208(1990) with
Microsoft Japanese Windows gaiji (i.e. CP932).
4. Code autodetection is not perfect by design. If it has
failed to detect input code properly, please give input
code information explicitly.
5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are
converted using JIS X-0124 and other convention. During
this conversion, its byte length is not preserved.
6. skf is intended to pass ANSI compatible terminal con-
trol code transparently, but this is not guaranteed.
7. nkf's -i and -o options still works, but valid only
when iso-2022-jp and is independent with codeset specifi-
cations. Using these options are strongly discouraged.
8. For unconverted character, skf uses geta and undefined
character as --use-replace-char option. If output codeset
doesn't contain geta code, skf prefers 'black square char-
acter', then uses '.' respectively.
9. There are some undocumented options. These options
should be considered as highly experimental.
Notice
Unicode(TM) is a trademark of Unicode, Inc. Microsoft and
Windows are registered trademarks of Microsoft corpora-
tion. Macintosh is a registered trademark of Apple Com-
puter Inc. Vodafone is a trademark of Vodafone K.K. Other
names and terms may be trademarks or registered trademark
of their respective owner. Trademark symbol (TM) is omit-
ted in this manual page.
09/MAY/2004 SKF(1)