argra****@users*****
argra****@users*****
2013年 4月 16日 (火) 04:26:02 JST
Index: docs/perl/5.14.1/perlrecharclass.pod diff -u /dev/null docs/perl/5.14.1/perlrecharclass.pod:1.1 --- /dev/null Tue Apr 16 04:26:02 2013 +++ docs/perl/5.14.1/perlrecharclass.pod Tue Apr 16 04:26:02 2013 @@ -0,0 +1,2022 @@ + +=encoding euc-jp + +=head1 NAME +X<character class> + +=begin original + +perlrecharclass - Perl Regular Expression Character Classes + +=end original + +perlrecharclass - Perl 正規表現文字クラス + +=head1 DESCRIPTION + +=begin original + +The top level documentation about Perl regular expressions +is found in L<perlre>. + +=end original + +Perl 正規表現に関する最上位文書は L<perlre> です。 + +=begin original + +This manual page discusses the syntax and use of character +classes in Perl regular expressions. + +=end original + +このマニュアルページは Perl 正規表現の文字クラスの文法と使用法について +議論します。 + +=begin original + +A character class is a way of denoting a set of characters +in such a way that one character of the set is matched. +It's important to remember that: matching a character class +consumes exactly one character in the source string. (The source +string is the string the regular expression is matched against.) + +=end original + +文字クラスは、集合の中の一文字がマッチングするというような方法で、 +文字の集合を指定するための方法です。 +次のことを覚えておくことは重要です: 文字集合はソース文字列の中から正確に +一文字だけを消費します。 +(ソース文字列とは正規表現がマッチングしようとしている文字列です。) + +=begin original + +There are three types of character classes in Perl regular +expressions: the dot, backslash sequences, and the form enclosed in square +brackets. Keep in mind, though, that often the term "character class" is used +to mean just the bracketed form. Certainly, most Perl documentation does that. + +=end original + +Perl 正規表現には 3 種類の文字クラスがあります: ドット、 +逆スラッシュシーケンス、大かっこで囲まれた形式です。 +しかし、「文字クラス」という用語はしばしば大かっこ形式だけを意味するために +使われることに注意してください。 +確かに、ほとんどの Perl 文書ではそうなっています。 + +=head2 The dot + +(ドット) + +=begin original + +The dot (or period), C<.> is probably the most used, and certainly +the most well-known character class. By default, a dot matches any +character, except for the newline. The default can be changed to +add matching the newline by using the I<single line> modifier: either +for the entire regular expression with the C</s> modifier, or +locally with C<(?s)>. (The experimental C<\N> backslash sequence, described +below, matches any character except newline without regard to the +I<single line> modifier.) + +=end original + +ドット (またはピリオド) C<.> はおそらくもっともよく使われ、そして確実に +もっともよく知られている文字クラスです。 +デフォルトでは、ドットは改行を除く任意の文字にマッチングします。 +デフォルトは I<単一行> 修飾子を使うことで改行にもマッチングするように +変更されます: 正規表現全体に対して C</s> 修飾子を使うか、ローカルには +C<(?s)> を使います。 +(The experimental C<\N> backslash sequence, described +below, matches any character except newline without regard to the +I<single line> modifier.) +(TBT) + +=begin original + +Here are some examples: + +=end original + +以下は例です: + +=begin original + + "a" =~ /./ # Match + "." =~ /./ # Match + "" =~ /./ # No match (dot has to match a character) + "\n" =~ /./ # No match (dot does not match a newline) + "\n" =~ /./s # Match (global 'single line' modifier) + "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) + "ab" =~ /^.$/ # No match (dot matches one character) + +=end original + + "a" =~ /./ # マッチングする + "." =~ /./ # マッチングする + "" =~ /./ # マッチングしない (ドットは文字にマッチングする必要がある) + "\n" =~ /./ # マッチングしない (ドットは改行にはマッチングしない) + "\n" =~ /./s # マッチングする (グローバル「単一行」修飾子) + "\n" =~ /(?s:.)/ # マッチングする (ローカル「単一行」修飾子) + "ab" =~ /^.$/ # マッチングしない (ドットは一文字にマッチングする) + +=head2 Backslash sequences +X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> +X<\N> X<\v> X<\V> X<\h> X<\H> +X<word> X<whitespace> + +(逆スラッシュシーケンス) + +=begin original + +A backslash sequence is a sequence of characters, the first one of which is a +backslash. Perl ascribes special meaning to many such sequences, and some of +these are character classes. That is, they match a single character each, +provided that the character belongs to the specific set of characters defined +by the sequence. + +=end original + +A backslash sequence is a sequence of characters, the first one of which is a +backslash. Perl ascribes special meaning to many such sequences, and some of +these are character classes. That is, they match a single character each, +provided that the character belongs to the specific set of characters defined +by the sequence. +(TBT) + +=begin original + +Here's a list of the backslash sequences that are character classes. They +are discussed in more detail below. (For the backslash sequences that aren't +character classes, see L<perlrebackslash>.) + +=end original + +以下は文字クラスの逆スラッシュシーケンスの一覧です。 +以下でさらに詳細に議論します。 +(For the backslash sequences that aren't +character classes, see L<perlrebackslash>.) +(TBT) + +=begin original + + \d Match a decimal digit character. + \D Match a non-decimal-digit character. + \w Match a "word" character. + \W Match a non-"word" character. + \s Match a whitespace character. + \S Match a non-whitespace character. + \h Match a horizontal whitespace character. + \H Match a character that isn't horizontal whitespace. + \v Match a vertical whitespace character. + \V Match a character that isn't vertical whitespace. + \N Match a character that isn't a newline. Experimental. + \pP, \p{Prop} Match a character that has the given Unicode property. + \PP, \P{Prop} Match a character that doesn't have the Unicode property + +=end original + + \d 10 進数字にマッチング。 + \D 非 10 進数字にマッチング。 + \w 「単語」文字にマッチング。 + \W 非「単語」文字にマッチング。 + \s 空白文字にマッチング。 + \S 非空白文字にマッチング。 + \h 水平空白文字にマッチング。 + \H 水平空白でない文字にマッチング。 + \v 垂直空白文字にマッチング。 + \V 垂直空白でない文字にマッチング。 + \N 改行以外の文字にマッチング。実験的。 + \pP, \p{Prop} 指定された Unicode 特性を持つ文字にマッチング。 + \PP, \P{Prop} 指定された Unicode 特性を持たない文字にマッチング。 + +=head3 Digits + +(数字) + +=begin original + +C<\d> matches a single character considered to be a decimal I<digit>. +If the C</a> modifier in effect, it matches [0-9]. Otherwise, it +matches anything that is matched by C<\p{Digit}>, which includes [0-9]. +(An unlikely possible exception is that under locale matching rules, the +current locale might not have [0-9] matched by C<\d>, and/or might match +other characters whose code point is less than 256. Such a locale +definition would be in violation of the C language standard, but Perl +doesn't currently assume anything in regard to this.) + +=end original + +C<\d> は 10 進 I<数字> と考えられる単一の文字にマッチングします。 +If the C</a> modifier in effect, it matches [0-9]. Otherwise, it +matches anything that is matched by C<\p{Digit}>, which includes [0-9]. +(An unlikely possible exception is that under locale matching rules, the +current locale might not have [0-9] matched by C<\d>, and/or might match +other characters whose code point is less than 256. Such a locale +definition would be in violation of the C language standard, but Perl +doesn't currently assume anything in regard to this.) +(TBT) + +=begin original + +What this means is that unless the C</a> modifier is in effect C<\d> not +only matches the digits '0' - '9', but also Arabic, Devanagari, and +digits from other languages. This may cause some confusion, and some +security issues. + +=end original + +What this means is that unless the C</a> modifier is in effect +C<\d> は数字 '0' - '9' だけでなく、Arabic, +Devanagari およびその他の言語の数字もマッチングします。 +This may cause some confusion, and some +security issues. +(TBT) + +=begin original + +Some digits that C<\d> matches look like some of the [0-9] ones, but +have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks +very much like an ASCII DIGIT EIGHT (U+0038). An application that +is expecting only the ASCII digits might be misled, or if the match is +C<\d+>, the matched string might contain a mixture of digits from +different writing systems that look like they signify a number different +than they actually do. L<Unicode::UCD/num()> can be used to safely +calculate the value, returning C<undef> if the input string contains +such a mixture. + +=end original + +Some digits that C<\d> matches look like some of the [0-9] ones, but +have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks +very much like an ASCII DIGIT EIGHT (U+0038). An application that +is expecting only the ASCII digits might be misled, or if the match is +C<\d+>, the matched string might contain a mixture of digits from +different writing systems that look like they signify a number different +than they actually do. L<Unicode::UCD/num()> can be used to safely +calculate the value, returning C<undef> if the input string contains +such a mixture. +(TBT) + +=begin original + +What C<\p{Digit}> means (and hence C<\d> except under the C</a> +modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously, +C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this +is the same set of characters matched by C<\p{Numeric_Type=Decimal}>. +But Unicode also has a different property with a similar name, +C<\p{Numeric_Type=Digit}>, which matches a completely different set of +characters. These characters are things such as C<CIRCLED DIGIT ONE> +or subscripts, or are from writing systems that lack all ten digits. + +=end original + +What C<\p{Digit}> means (and hence C<\d> except under the C</a> +modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously, +C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this +is the same set of characters matched by C<\p{Numeric_Type=Decimal}>. +But Unicode also has a different property with a similar name, +C<\p{Numeric_Type=Digit}>, which matches a completely different set of +characters. These characters are things such as C<CIRCLED DIGIT ONE> +or subscripts, or are from writing systems that lack all ten digits. +(TBT) + +=begin original + +The design intent is for C<\d> to exactly match the set of characters +that can safely be used with "normal" big-endian positional decimal +syntax, where, for example 123 means one 'hundred', plus two 'tens', +plus three 'ones'. This positional notation does not necessarily apply +to characters that match the other type of "digit", +C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them. + +=end original + +The design intent is for C<\d> to exactly match the set of characters +that can safely be used with "normal" big-endian positional decimal +syntax, where, for example 123 means one 'hundred', plus two 'tens', +plus three 'ones'. This positional notation does not necessarily apply +to characters that match the other type of "digit", +C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them. +(TBT) + +=begin original + +In Unicode 5.2, the Tamil digits (U+0BE6 - U+0BEF) can also legally be +used in old-style Tamil numbers in which they would appear no more than +one in a row, separated by characters that mean "times 10", "times 100", +etc. (See L<http://www.unicode.org/notes/tn21>.) + +=end original + +In Unicode 5.2, the Tamil digits (U+0BE6 - U+0BEF) can also legally be +used in old-style Tamil numbers in which they would appear no more than +one in a row, separated by characters that mean "times 10", "times 100", +etc. (See L<http://www.unicode.org/notes/tn21>.) +(TBT) + +=begin original + +Any character not matched by C<\d> is matched by C<\D>. + +=end original + +C<\d> にマッチングしない任意の文字は C<\D> にマッチングします。 + +=head3 Word characters + +(単語文字) + +=begin original + +A C<\w> matches a single alphanumeric character (an alphabetic character, or a +decimal digit) or a connecting punctuation character, such as an +underscore ("_"). It does not match a whole word. To match a whole +word, use C<\w+>. This isn't the same thing as matching an English word, but +in the ASCII range it is the same as a string of Perl-identifier +characters. + +=end original + +C<\w> は単語全体ではなく、単一の英数字(つまり英字または数字)または +下線(C<_>) のような接続句読点にマッチングします。 +It does not match a whole word. To match a whole +word, use C<\w+>. This isn't the same thing as matching an English word, but +in the ASCII range it is the same as a string of Perl-identifier +characters. +(TBT) + +=over + +=item If the C</a> modifier is in effect ... + +(C</a> 修飾子が有効なら ...) + +=begin original + +C<\w> matches the 63 characters [a-zA-Z0-9_]. + +=end original + +C<\w> matches the 63 characters [a-zA-Z0-9_]. +(TBT) + +=item otherwise ... + +(さもなければ ...) + +=over + +=item For code points above 255 ... + +(256 以上の符号位置では ...) + +=begin original + +C<\w> matches the same as C<\p{Word}> matches in this range. That is, +it matches Thai letters, Greek letters, etc. This includes connector +punctuation (like the underscore) which connect two words together, or +diacritics, such as a C<COMBINING TILDE> and the modifier letters, which +are generally used to add auxiliary markings to letters. + +=end original + +C<\w> matches the same as C<\p{Word}> matches in this range. That is, +it matches Thai letters, Greek letters, etc. This includes connector +punctuation (like the underscore) which connect two words together, or +diacritics, such as a C<COMBINING TILDE> and the modifier letters, which +are generally used to add auxiliary markings to letters. +(TBT) + +=item For code points below 256 ... + +(255 以下の符号位置では ...) + +=over + +=item if locale rules are in effect ... + +(ロケール規則が有効なら ...) + +=begin original + +C<\w> matches the platform's native underscore character plus whatever +the locale considers to be alphanumeric. + +=end original + +C<\w> matches the platform's native underscore character plus whatever +the locale considers to be alphanumeric. +(TBT) + +=item if Unicode rules are in effect or if on an EBCDIC platform ... + +(Unicode 規則が有効か、EBCDIC プラットフォームなら ...) + +=begin original + +C<\w> matches exactly what C<\p{Word}> matches. + +=end original + +C<\w> matches exactly what C<\p{Word}> matches. +(TBT) + +=item otherwise ... + +(さもなければ ...) + +=begin original + +C<\w> matches [a-zA-Z0-9_]. + +=end original + +C<\w> matches [a-zA-Z0-9_]. +(TBT) + +=back + +=back + +=back + +=begin original + +Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. + +=end original + +Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. +(TBT) + +=begin original + +There are a number of security issues with the full Unicode list of word +characters. See L<http://unicode.org/reports/tr36>. + +=end original + +There are a number of security issues with the full Unicode list of word +characters. See L<http://unicode.org/reports/tr36>. +(TBT) + +=begin original + +Also, for a somewhat finer-grained set of characters that are in programming +language identifiers beyond the ASCII range, you may wish to instead use the +more customized Unicode properties, "ID_Start", ID_Continue", "XID_Start", and +"XID_Continue". See L<http://unicode.org/reports/tr31>. + +=end original + +Also, for a somewhat finer-grained set of characters that are in programming +language identifiers beyond the ASCII range, you may wish to instead use the +more customized Unicode properties, "ID_Start", ID_Continue", "XID_Start", and +"XID_Continue". See L<http://unicode.org/reports/tr31>. +(TBT) + +=begin original + +Any character not matched by C<\w> is matched by C<\W>. + +=end original + +C<\w> にマッチングしない任意の文字は C<\W> にマッチングします。 + +=head3 Whitespace + +(空白) + +=begin original + +C<\s> matches any single character considered whitespace. + +=end original + +C<\s> は空白と考えられる単一の文字にマッチングします。 + +=over + +=item If the C</a> modifier is in effect ... + +=begin original + +C<\s> matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, +the newline, the form feed, the carriage return, and the space. (Note +that it doesn't match the vertical tab, C<\cK> on ASCII platforms.) + +=end original + +C<\s> matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, +the newline, the form feed, the carriage return, and the space. (Note +that it doesn't match the vertical tab, C<\cK> on ASCII platforms.) +(TBT) + +=item otherwise ... + +=over + +=item For code points above 255 ... + +=begin original + +C<\s> matches exactly the code points above 255 shown with an "s" column +in the table below. + +=end original + +C<\s> matches exactly the code points above 255 shown with an "s" column +in the table below. +(TBT) + +=item For code points below 256 ... + +=over + +=item if locale rules are in effect ... + +=begin original + +C<\s> matches whatever the locale considers to be whitespace. Note that +this is likely to include the vertical space, unlike non-locale C<\s> +matching. + +=end original + +C<\s> matches whatever the locale considers to be whitespace. Note that +this is likely to include the vertical space, unlike non-locale C<\s> +matching. +(TBT) + +=item if Unicode rules are in effect or if on an EBCDIC platform ... + +=begin original + +C<\s> matches exactly the characters shown with an "s" column in the +table below. + +=end original + +C<\s> matches exactly the characters shown with an "s" column in the +table below. +(TBT) + +=item otherwise ... + +=begin original + +C<\s> matches [\t\n\f\r ]. +Note that this list doesn't include the non-breaking space. + +=end original + +C<\s> matches [\t\n\f\r ]. +Note that this list doesn't include the non-breaking space. +(TBT) + +=back + +=back + +=back + +=begin original + +Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. + +=end original + +Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. +(TBT) + +=begin original + +Any character not matched by C<\s> is matched by C<\S>. + +=end original + +C<\s> にマッチングしない任意の文字は C<\S> にマッチングします。 + +=begin original + +C<\h> matches any character considered horizontal whitespace; +this includes the space and tab characters and several others +listed in the table below. C<\H> matches any character +not considered horizontal whitespace. + +=end original + +C<\h> は水平空白と考えられる任意の文字にマッチングします; これはスペースと +タブ文字および以下の表に上げられているいくつかのその他の文字です。 +C<\H> は水平空白と考えられない文字にマッチングします。 + +=begin original + +C<\v> matches any character considered vertical whitespace; +this includes the carriage return and line feed characters (newline) +plus several other characters, all listed in the table below. +C<\V> matches any character not considered vertical whitespace. + +=end original + +C<\v> は垂直空白と考えられる任意の文字にマッチングします; これは復帰と +行送り(改行)文字に加えていくつかのその他の文字です; 全ては以下の表に +挙げられています。 +C<\V> は垂直空白と考えられない任意の文字にマッチングします。 + +=begin original + +C<\R> matches anything that can be considered a newline under Unicode +rules. It's not a character class, as it can match a multi-character +sequence. Therefore, it cannot be used inside a bracketed character +class; use C<\v> instead (vertical whitespace). +Details are discussed in L<perlrebackslash>. + +=end original + +C<\R> は Unicode の規則で改行と考えられるものにマッチングします。 +複数文字の並びにマッチングすることもあるので、これは +文字クラスではありません。 +従って、大かっこ文字クラスの中では使えません; 代わりに C<\v> (垂直空白) を +使ってください。 +詳細は L<perlrebackslash> で議論しています。 + +=begin original + +Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match +the same characters, without regard to other factors, such as whether the +source string is in UTF-8 format. + +=end original + +C<\s> (および C<\d> と C<\w>) と違って、C<\h> および C<\v> は、ソース文字列が +UTF-8 形式かどうかといった他の要素に関わらず同じ文字にマッチングします。 + +=begin original + +One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. +For example, the vertical tab (C<"\x0b">) is not matched by C<\s>, it is +however considered vertical whitespace. + +=end original + +C<\s> が C<[\h\v]> と等価と考える人がいるかもしれません。 +これは正しくありません。 +例えば、垂直タブ (C<"\x0b">) は C<\s> にマッチングしませんが、垂直空白と +考えられます。 + +=begin original + +The following table is a complete listing of characters matched by +C<\s>, C<\h> and C<\v> as of Unicode 6.0. + +=end original + +以下の表は Unicode 6.0 現在で C<\s>, C<\h>, C<\v> にマッチングする文字の +完全な一覧です。 + +=begin original + +The first column gives the code point of the character (in hex format), +the second column gives the (Unicode) name. The third column indicates +by which class(es) the character is matched (assuming no locale or EBCDIC code +page is in effect that changes the C<\s> matching). + +=end original + +最初の列は文字の符号位置(16 進形式)、2 番目の列は (Unicode の)名前です。 +3 番目の列はどのクラスにマッチングするかを示しています +(C<\s> のマッチングを変更するようなロケールや EBCDIC コードページが +有効でないことを仮定しています)。 + + 0x00009 CHARACTER TABULATION h s + 0x0000a LINE FEED (LF) vs + 0x0000b LINE TABULATION v + 0x0000c FORM FEED (FF) vs + 0x0000d CARRIAGE RETURN (CR) vs + 0x00020 SPACE h s + 0x00085 NEXT LINE (NEL) vs [1] + 0x000a0 NO-BREAK SPACE h s [1] + 0x01680 OGHAM SPACE MARK h s + 0x0180e MONGOLIAN VOWEL SEPARATOR h s + 0x02000 EN QUAD h s + 0x02001 EM QUAD h s + 0x02002 EN SPACE h s + 0x02003 EM SPACE h s + 0x02004 THREE-PER-EM SPACE h s + 0x02005 FOUR-PER-EM SPACE h s + 0x02006 SIX-PER-EM SPACE h s + 0x02007 FIGURE SPACE h s + 0x02008 PUNCTUATION SPACE h s + 0x02009 THIN SPACE h s + 0x0200a HAIR SPACE h s + 0x02028 LINE SEPARATOR vs + 0x02029 PARAGRAPH SEPARATOR vs + 0x0202f NARROW NO-BREAK SPACE h s + 0x0205f MEDIUM MATHEMATICAL SPACE h s + 0x03000 IDEOGRAPHIC SPACE h s + +=over 4 + +=item [1] + +=begin original + +NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending +on the rules in effect. See +L<the beginning of this section|/Whitespace>. + +=end original + +NEXT LINE と NO-BREAK SPACE はどの規則が有効かによって C<\s> に +マッチングしたりマッチングしなかったりします。 +L<the beginning of this section|/Whitespace> を参照してください。 + +=back + +=head3 \N + +=begin original + +C<\N> is new in 5.12, and is experimental. It, like the dot, matches any +character that is not a newline. The difference is that C<\N> is not influenced +by the I<single line> regular expression modifier (see L</The dot> above). Note +that the form C<\N{...}> may mean something completely different. When the +C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline +character that many times. For example, C<\N{3}> means to match 3 +non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> +is not a legal quantifier, it is presumed to be a named character. See +L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and +C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose +names are respectively C<COLON>, C<4F>, and C<F4>. + +=end original + +C<\N> は 5.12 の新機能で、実験的なものです。 +これは、ドットのように、改行以外の任意の文字にマッチングします。 +違いは、C<\N> は I<単一行> 正規表現修飾子の影響を受けないことです +(上述の L</The dot> 参照)。 +Note +that the form C<\N{...}> may mean something completely different. When the +C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline +character that many times. +例えば、C<\N{3}> は三つの非改行にマッチングします; +C<\N{5,}> は五つ以上の非改行にマッチングします。 +But if C<{...}> +is not a legal quantifier, it is presumed to be a named character. See +L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and +C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose +names are respectively C<COLON>, C<4F>, and C<F4>. +(TBT) + +=head3 Unicode Properties + +(Unicode 特性) + +=begin original + +C<\pP> and C<\p{Prop}> are character classes to match characters that fit given +Unicode properties. One letter property names can be used in the C<\pP> form, +with the property name following the C<\p>, otherwise, braces are required. +When using braces, there is a single form, which is just the property name +enclosed in the braces, and a compound form which looks like C<\p{name=value}>, +which means to match if the property "name" for the character has that particular +"value". +For instance, a match for a number can be written as C</\pN/> or as +C</\p{Number}/>, or as C</\p{Number=True}/>. +Lowercase letters are matched by the property I<Lowercase_Letter> which +has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or +C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> +(the underscores are optional). +C</\pLl/> is valid, but means something different. +It matches a two character string: a letter (Unicode property C<\pL>), +followed by a lowercase C<l>. + +=end original + +C<\pP> と C<\p{Prop}> は指定された Unicode 特性に一致する文字に +マッチングする文字クラスです。 +一文字特性は C<\pP> 形式で、C<\p> に引き続いて特性名です; さもなければ +中かっこが必要です。 +中かっこを使うとき、単に特性名を中かっこで囲んだ単一形式と、 +C<\p{name=value}> のような形で、文字の特性 "name" が特定の "value" を +持つものにマッチングすることになる複合形式があります。 +例えば、数字にマッチングするものは C</\pN/> または C</\p{Number}/> または +C</\p{Number=True}/> と書けます。 +小文字は I<LowercaseLetter> 特性にマッチングします; これには +I<Ll> と言う短縮形式があります。 +中かっこが必要なので、C</\p{Ll}/> または C</\p{Lowercase_Letter}/> または +C</\p{General_Category=Lowercase_Letter}/> と書きます(下線はオプションです)。 +C</\pLl/> も妥当ですが、違う意味になります。 +これは 2 文字にマッチングします: 英字 (Unicode 特性 C<\pL>)に引き続いて +小文字の C<l> です。 + +=begin original + +If neither the C</a> modifier nor locale rules are in effect, the use of +a Unicode property will force the regular expression into using Unicode +rules. + +=end original + +If neither the C</a> modifier nor locale rules are in effect, the use of +a Unicode property will force the regular expression into using Unicode +rules. +(TBT) + +=begin original + +Note that almost all properties are immune to case-insensitive matching. +That is, adding a C</i> regular expression modifier does not change what +they match. There are two sets that are affected. The first set is +C<Uppercase_Letter>, +C<Lowercase_Letter>, +and C<Titlecase_Letter>, +all of which match C<Cased_Letter> under C</i> matching. +The second set is +C<Uppercase>, +C<Lowercase>, +and C<Titlecase>, +all of which match C<Cased> under C</i> matching. +(The difference between these sets is that some things, such as Roman +Numerals, come in both upper and lower case so they are C<Cased>, but +aren't considered to be letters, so they aren't C<Cased_Letter>s. They're +actually C<Letter_Number>s.) +This set also includes its subsets C<PosixUpper> and C<PosixLower>, both +of which under C</i> matching match C<PosixAlpha>. + +=end original + +Note that almost all properties are immune to case-insensitive matching. +That is, adding a C</i> regular expression modifier does not change what +they match. There are two sets that are affected. The first set is +C<Uppercase_Letter>, +C<Lowercase_Letter>, +and C<Titlecase_Letter>, +all of which match C<Cased_Letter> under C</i> matching. +The second set is +C<Uppercase>, +C<Lowercase>, +and C<Titlecase>, +all of which match C<Cased> under C</i> matching. +(The difference between these sets is that some things, such as Roman +Numerals, come in both upper and lower case so they are C<Cased>, but +aren't considered to be letters, so they aren't C<Cased_Letter>s. They're +actually C<Letter_Number>s.) +This set also includes its subsets C<PosixUpper> and C<PosixLower>, both +of which under C</i> matching match C<PosixAlpha>. +(TBT) + +=begin original + +For more details on Unicode properties, see L<perlunicode/Unicode +Character Properties>; for a +complete list of possible properties, see +L<perluniprops/Properties accessible through \p{} and \P{}>, +which notes all forms that have C</i> differences. +It is also possible to define your own properties. This is discussed in +L<perlunicode/User-Defined Character Properties>. + +=end original + +Unicode 特性に関するさらなる詳細については、 +L<perlunicode/Unicode Character Properties> を参照してください; 特性の完全な +一覧については +which notes all forms that have C</i> differences +L<perluniprops/Properties accessible through \p{} and \P{}> を参照して +ください。 +独自の特性を定義することも可能です。 +これは L<perlunicode/User-Defined Character Properties> で +議論されています。 +(TBT) + +=head4 Examples + +(例) + +=begin original + + "a" =~ /\w/ # Match, "a" is a 'word' character. + "7" =~ /\w/ # Match, "7" is a 'word' character as well. + "a" =~ /\d/ # No match, "a" isn't a digit. + "7" =~ /\d/ # Match, "7" is a digit. + " " =~ /\s/ # Match, a space is whitespace. + "a" =~ /\D/ # Match, "a" is a non-digit. + "7" =~ /\D/ # No match, "7" is not a non-digit. + " " =~ /\S/ # No match, a space is not non-whitespace. + +=end original + + "a" =~ /\w/ # マッチング; "a" は「単語」文字。 + "7" =~ /\w/ # マッチング; "7" も「単語」文字。 + "a" =~ /\d/ # マッチングしない; "a" は数字ではない。 + "7" =~ /\d/ # マッチング; "7" は数字。 + " " =~ /\s/ # マッチング; スペースは空白。 + "a" =~ /\D/ # マッチング; "a" は非数字。 + "7" =~ /\D/ # マッチングしない; "7" は非数字ではない。 + " " =~ /\S/ # マッチングしない; スペースは非空白ではない。 + +=begin original + + " " =~ /\h/ # Match, space is horizontal whitespace. + " " =~ /\v/ # No match, space is not vertical whitespace. + "\r" =~ /\v/ # Match, a return is vertical whitespace. + +=end original + + " " =~ /\h/ # マッチング; スペースは水平空白。 + " " =~ /\v/ # マッチングしない; スペースは垂直空白ではない。 + "\r" =~ /\v/ # マッチング; 復帰は垂直空白。 + +=begin original + + "a" =~ /\pL/ # Match, "a" is a letter. + "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. + +=end original + + "a" =~ /\pL/ # マッチング; "a" は英字。 + "a" =~ /\p{Lu}/ # マッチングしない; /\p{Lu}/ は大文字にマッチングする。 + +=begin original + + "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character + # 'THAI CHARACTER SO SO', and that's in + # Thai Unicode class. + "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. + +=end original + + "\x{0e0b}" =~ /\p{Thai}/ # マッチング; \x{0e0b} は文字 + # 'THAI CHARACTER SO SO' で、これは + # Thai Unicode クラスにある。 + "a" =~ /\P{Lao}/ # マッチング; "a" はラオス文字ではない。 + +=begin original + +It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not +complete numbers or words. To match a number (that consists of digits), +use C<\d+>; to match a word, use C<\w+>. But be aware of the security +considerations in doing so, as mentioned above. + +=end original + +It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not +complete numbers or words. To match a number (that consists of digits), +use C<\d+>; to match a word, use C<\w+>. But be aware of the security +considerations in doing so, as mentioned above. +(TBT) + +=head2 Bracketed Character Classes + +(かっこ付き文字クラス) + +=begin original + +The third form of character class you can use in Perl regular expressions +is the bracketed character class. In its simplest form, it lists the characters +that may be matched, surrounded by square brackets, like this: C<[aeiou]>. +This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other +character classes, exactly one character is matched.* To match +a longer string consisting of characters mentioned in the character +class, follow the character class with a L<quantifier|perlre/Quantifiers>. For +instance, C<[aeiou]+> matches one or more lowercase English vowels. + +=end original + +Perl 正規表現で使える文字クラスの第 3 の形式は大かっこ文字クラスです。 +もっとも単純な形式では、以下のように大かっこの中にマッチングする文字を +リストします: C<[aeiou]>. +これは C<a>, C<e>, C<i>, C<o>, C<u> のどれかにマッチングします。 +他の文字クラスと同様、正確に一つの文字にマッチングします。 +文字クラスで言及した文字で構成されるより長い文字列にマッチングするには、 +文字クラスに L<量指定子|perlre/Quantifiers> を付けます。 +例えば、C<[aeiou]+> は一つまたはそれ以上の小文字英語母音に +マッチングします。 + +=begin original + +Repeating a character in a character class has no +effect; it's considered to be in the set only once. + +=end original + +文字クラスの中で文字を繰り返しても効果はありません; 一度だけ現れたものと +考えられます。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. + "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. + "ae" =~ /^[aeiou]$/ # No match, a character class only matches + # a single character. + "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. + +=end original + + "e" =~ /[aeiou]/ # マッチング; "e" はクラスにある。 + "p" =~ /[aeiou]/ # マッチングしない; "p" はクラスにない。 + "ae" =~ /^[aeiou]$/ # マッチングしない; 一つの文字クラスは + # 一文字だけにマッチングする。 + "ae" =~ /^[aeiou]+$/ # マッチング; 量指定子により。 + + ------- + +=begin original + +* There is an exception to a bracketed character class matching only a +single character. When the class is to match caselessely under C</i> +matching rules, and a character inside the class matches a +multiple-character sequence caselessly under Unicode rules, the class +(when not L<inverted|/Negation>) will also match that sequence. For +example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S> +should match the sequence C<ss> under C</i> rules. Thus, + +=end original + +* There is an exception to a bracketed character class matching only a +single character. When the class is to match caselessely under C</i> +matching rules, and a character inside the class matches a +multiple-character sequence caselessly under Unicode rules, the class +(when not L<inverted|/Negation>) will also match that sequence. For +example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S> +should match the sequence C<ss> under C</i> rules. Thus, +(TBT) + + 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches + 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches + +=head3 Special Characters Inside a Bracketed Character Class + +(かっこ付き文字クラスの中の特殊文字) + +=begin original + +Most characters that are meta characters in regular expressions (that +is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose +their special meaning and can be used inside a character class without +the need to escape them. For instance, C<[()]> matches either an opening +parenthesis, or a closing parenthesis, and the parens inside the character +class don't group or capture. + +=end original + +正規表現内でメタ文字(つまり、C<.>, C<*>, C<(> のように特別な意味を持つ +文字)となるほとんどの文字は文字クラス内ではエスケープしなくても特別な意味を +失うので、エスケープする必要はありません。 +例えば、C<[()]> は開きかっこまたは閉じかっこにマッチングし、文字クラスの中の +かっこはグループや捕捉にはなりません。 + +=begin original + +Characters that may carry a special meaning inside a character class are: +C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be +escaped with a backslash, although this is sometimes not needed, in which +case the backslash may be omitted. + +=end original + +文字クラスの中でも特別な意味を持つ文字は: +C<\>, C<^>, C<->, C<[>, C<]> で、以下で議論します。 +これらは逆スラッシュでエスケープできますが、不要な場合もあり、そのような +場合では逆スラッシュは省略できます。 + +=begin original + +The sequence C<\b> is special inside a bracketed character class. While +outside the character class, C<\b> is an assertion indicating a point +that does not have either two word characters or two non-word characters +on either side, inside a bracketed character class, C<\b> matches a +backspace character. + +=end original + +シーケンス C<\b> は大かっこ文字クラスの内側では特別です。 +文字クラスの外側では C<\b> 二つの単語文字か二つの非単語文字のどちらかではない +位置を示す表明ですが、大かっこ文字クラスの内側では C<\b> は後退文字に +マッチングします。 + +=begin original + +The sequences +C<\a>, +C<\c>, +C<\e>, +C<\f>, +C<\n>, +C<\N{I<NAME>}>, +C<\N{U+I<hex char>}>, +C<\r>, +C<\t>, +and +C<\x> +are also special and have the same meanings as they do outside a +bracketed character class. (However, inside a bracketed character +class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first +one in the sequence is used, with a warning.) + +=end original + +並び +C<\a>, +C<\c>, +C<\e>, +C<\f>, +C<\n>, +C<\N{I<NAME>}>, +C<\N{U+I<hex char>}>, +C<\r>, +C<\t>, +C<\x> +も特別で、大かっこ文字クラスの外側と同じ意味を持ちます。 +(However, inside a bracketed character +class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first +one in the sequence is used, with a warning.) +(TBT) + +=begin original + +Also, a backslash followed by two or three octal digits is considered an octal +number. + +=end original + +また、逆スラッシュに引き続いて 2 または 3 桁の 8 進数字があると 8 進数として +扱われます。 + +=begin original + +A C<[> is not special inside a character class, unless it's the start of a +POSIX character class (see L</POSIX Character Classes> below). It normally does +not need escaping. + +=end original + +C<[> は、POSIX 文字クラス(後述の L</POSIX Character Classes> 参照)の +開始でない限りは文字クラスの中では特別ではありません。 +これは普通エスケープは不要です。 + +=begin original + +A C<]> is normally either the end of a POSIX character class (see +L</POSIX Character Classes> below), or it signals the end of the bracketed +character class. If you want to include a C<]> in the set of characters, you +must generally escape it. + +=end original + +A C<]> は普通は POSIX 文字クラス(後述の L</POSIX Character Classes> 参照)の +終わりか、大かっこ文字クラスの終了を示すかどちらかです。 +文字集合に C<]> を含める必要がある場合、一般的には +エスケープしなければなりません。 + +=begin original + +However, if the C<]> is the I<first> (or the second if the first +character is a caret) character of a bracketed character class, it +does not denote the end of the class (as you cannot have an empty class) +and is considered part of the set of characters that can be matched without +escaping. + +=end original + +しかし、C<]> が大かっこ文字クラスの I<最初> (または最初の文字がキャレットなら +2 番目) の文字の場合、(空クラスを作ることはできないので)これはクラスの +終了を意味せず、エスケープなしでマッチングできる文字の集合の一部と +考えられます。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + "+" =~ /[+?*]/ # Match, "+" in a character class is not special. + "\cH" =~ /[\b]/ # Match, \b inside in a character class + # is equivalent to a backspace. + "]" =~ /[][]/ # Match, as the character class contains. + # both [ and ]. + "[]" =~ /[[]]/ # Match, the pattern contains a character class + # containing just ], and the character class is + # followed by a ]. + +=end original + + "+" =~ /[+?*]/ # マッチング; 文字クラス内の "+" は特別ではない。 + "\cH" =~ /[\b]/ # マッチング; 文字クラスの内側の \b は後退と + # 等価。 + "]" =~ /[][]/ # マッチング; 文字クラスに [ と ] の両方を + # 含んでいる。 + "[]" =~ /[[]]/ # マッチング; パターンは ] だけを含んでいる + # 文字クラスと、それに引き続く + # ] からなる。 + +=head3 Character Ranges + +(文字範囲) + +=begin original + +It is not uncommon to want to match a range of characters. Luckily, instead +of listing all characters in the range, one may use the hyphen (C<->). +If inside a bracketed character class you have two characters separated +by a hyphen, it's treated as if all characters between the two were in +the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> +matches any lowercase letter from the first half of the old ASCII alphabet. + +=end original + +文字のある範囲にマッチングしたいというのは珍しくありません。 +幸運なことに、その範囲の文字を全て一覧に書く代わりに、ハイフン (C<->) を +使えます。 +大かっこ文字クラスの内側で二つの文字がハイフンで区切られていると、 +二つの文字の間の全ての文字がクラスに書かれているかのように扱われます。 +例えば、C<[0-9]> は任意の ASCII 数字にマッチングし、C<[a-m]> は +古い ASCII アルファベットの前半分の小文字にマッチングします。 + +=begin original + +Note that the two characters on either side of the hyphen are not +necessarily both letters or both digits. Any character is possible, +although not advisable. C<['-?]> contains a range of characters, but +most people will not know which characters that means. Furthermore, +such ranges may lead to portability problems if the code has to run on +a platform that uses a different character set, such as EBCDIC. + +=end original + +ハイフンのそれぞれの側の二つの文字は両方とも英字であったり両方とも +数字であったりする必要はないことに注意してください。 +任意の文字が可能ですが、勧められません。 +C<['-?]> は文字の範囲を含みますが、ほとんどの人はどの文字が含まれるか +分かりません。 +さらに、このような範囲は、コードが EBCDIC のような異なった文字集合を使う +プラットフォームで実行されると移植性の問題を引き起こします。 + +=begin original + +If a hyphen in a character class cannot syntactically be part of a range, for +instance because it is the first or the last character of the character class, +or if it immediately follows a range, the hyphen isn't special, and so is +considered a character to be matched literally. If you want a hyphen in +your set of characters to be matched and its position in the class is such +that it could be considered part of a range, you must escape that hyphen +with a backslash. + +=end original + +例えば文字クラスの最初または最後であったり、範囲の直後のために、文字クラスの +中のハイフンが文法的に範囲の一部となれない場合、ハイフンは特別ではなく、 +リテラルにマッチングするべき文字として扱われます。 +マッチングする文字の集合にハイフンを入れたいけれどもその位置が範囲の +一部として考えられる場合はハイフンを逆スラッシュで +エスケープしなければなりません。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + [a-z] # Matches a character that is a lower case ASCII letter. + [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or + # the letter 'z'. + [-z] # Matches either a hyphen ('-') or the letter 'z'. + [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the + # hyphen ('-'), or the letter 'm'. + ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? + # (But not on an EBCDIC platform). + +=end original + + [a-z] # 小文字 ASCII 英字にマッチング。 + [a-fz] # 'a' から 'f' の英字およびと 'z' の英字に + # マッチング。 + [-z] # ハイフン ('-') または英字 'z' にマッチング。 + [a-f-m] # 'a' から 'f' の英字、ハイフン ('-')、英字 'm' に + # マッチング。 + ['-?] # 文字 '()*+,-./0123456789:;<=>? のどれかにマッチング + # (しかし EBCDIC プラットフォームでは異なります)。 + +=head3 Negation + +(否定) + +=begin original + +It is also possible to instead list the characters you do not want to +match. You can do so by using a caret (C<^>) as the first character in the +character class. For instance, C<[^a-z]> matches any character that is not a +lowercase ASCII letter, which therefore includes almost a hundred thousand +Unicode letters. The class is said to be "negated" or "inverted". + +=end original + +代わりにマッチングしたくない文字の一覧を指定することも可能です。 +文字クラスの先頭の文字としてキャレット (C<^>) を使うことで実現します。 +例えば、C<[^a-z]> 小文字の ASCII 英字以外の文字にマッチングします。 +, which therefore includes almost a hundred thousand +Unicode letters. The class is said to be "negated" or "inverted". +(TBT) + +=begin original + +This syntax make the caret a special character inside a bracketed character +class, but only if it is the first character of the class. So if you want +the caret as one of the characters to match, either escape the caret or +else not list it first. + +=end original + +この文法はキャレットを大かっこ文字クラスの内側で特別な文字にしますが、 +クラスの最初の文字の場合のみです。 +それでマッチングしたい文字の一つでキャレットを使いたい場合、キャレットを +エスケープするか、最初以外の位置に書いてください。 + +=begin original + +In inverted bracketed character classes, Perl ignores the Unicode rules +that normally say that a given character matches a sequence of multiple +characters under caseless C</i> matching, which otherwise could be +highly confusing: + +=end original + +In inverted bracketed character classes, Perl ignores the Unicode rules +that normally say that a given character matches a sequence of multiple +characters under caseless C</i> matching, which otherwise could be +highly confusing: +(TBT) + + "ss" =~ /^[^\xDF]+$/ui; + +=begin original + +This should match any sequences of characters that aren't C<\xDF> nor +what C<\xDF> matches under C</i>. C<"s"> isn't C<\xDF>, but Unicode +says that C<"ss"> is what C<\xDF> matches under C</i>. So which one +"wins"? Do you fail the match because the string has C<ss> or accept it +because it has an C<s> followed by another C<s>? + +=end original + +This should match any sequences of characters that aren't C<\xDF> nor +what C<\xDF> matches under C</i>. C<"s"> isn't C<\xDF>, but Unicode +says that C<"ss"> is what C<\xDF> matches under C</i>. So which one +"wins"? Do you fail the match because the string has C<ss> or accept it +because it has an C<s> followed by another C<s>? +(TBT) + +=begin original + +Examples: + +=end original + +例: + +=begin original + + "e" =~ /[^aeiou]/ # No match, the 'e' is listed. + "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. + "^" =~ /[^^]/ # No match, matches anything that isn't a caret. + "^" =~ /[x^]/ # Match, caret is not special here. + +=end original + + "e" =~ /[^aeiou]/ # マッチングしない; 'e' がある。 + "x" =~ /[^aeiou]/ # マッチング; 'x' は小文字の母音ではない。 + "^" =~ /[^^]/ # マッチングしない; キャレット以外全てにマッチング。 + "^" =~ /[x^]/ # マッチング; キャレットはここでは特別ではない。 + +=head3 Backslash Sequences + +(逆スラッシュシーケンス) + +=begin original + +You can put any backslash sequence character class (with the exception of +C<\N> and C<\R>) inside a bracketed character class, and it will act just +as if you had put all characters matched by the backslash sequence inside the +character class. For instance, C<[a-f\d]> matches any decimal digit, or any +of the lowercase letters between 'a' and 'f' inclusive. + +=end original + +大かっこ文字クラスの中に(C<\N> と C<\R> を例外として)逆スラッシュシーケンス +文字クラスを置くことができ、逆スラッシュシーケンスにマッチングする全ての +文字を文字クラスの中に置いたかのように動作します。 +例えば、C<[a-f\d]> は任意の 10 進数字、あるいは 'a' から 'f' までの小文字に +マッチングします。 + +=begin original + +C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> +or C<\N{U+I<hex char>}>, and NOT be the form that matches non-newlines, +for the same reason that a dot C<.> inside a bracketed character class loses +its special meaning: it matches nearly anything, which generally isn't what you +want to happen. + +=end original + +大かっこ文字クラスの中のドット C<.> が特別な意味を持たないのと同じ理由で、 +大かっこ文字クラスの中の C<\N> は C<\N{I<name>}> または +C<\N{U+I<hex char>}> の形式で、かつ非改行マッチング形式でない形でなければ +なりません: これはほとんど何でもマッチングするので、一般的には起こって +欲しいことではありません。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + /[\p{Thai}\d]/ # Matches a character that is either a Thai + # character, or a digit. + /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic + # character, nor a parenthesis. + +=end original + + /[\p{Thai}\d]/ # タイ文字または数字の文字に + # マッチングする。 + /[^\p{Arabic}()]/ # アラビア文字でもかっこでもない文字に + # マッチングする。 + +=begin original + +Backslash sequence character classes cannot form one of the endpoints +of a range. Thus, you can't say: + +=end original + +逆スラッシュシーケンス文字クラスは範囲の端点の一つにはできません。 +従って、以下のようにはできません: + + /[\p{Thai}-\d]/ # Wrong! + +=head3 POSIX Character Classes +X<character class> X<\p> X<\p{}> +X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> +X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> + +(POSIX 文字クラス) + +=begin original + +POSIX character classes have the form C<[:class:]>, where I<class> is +name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear +I<inside> bracketed character classes, and are a convenient and descriptive +way of listing a group of characters. + +=end original + +POSIX 文字クラスは C<[:class:]> の形式で、I<class> は名前、C<[:> と C<:]> は +デリミタです。 +POSIX 文字クラスは大かっこ文字クラスの I<内側> にのみ現れ、文字のグループを +一覧するのに便利で記述的な方法です。 + +=begin original + +Be careful about the syntax, + +=end original + +文法について注意してください、 + + # Correct: + $string =~ /[[:alpha:]]/ + + # Incorrect (will warn): + $string =~ /[:alpha:]/ + +=begin original + +The latter pattern would be a character class consisting of a colon, +and the letters C<a>, C<l>, C<p> and C<h>. +POSIX character classes can be part of a larger bracketed character class. +For example, + +=end original + +後者のパターンは、コロンおよび C<a>, C<l>, C<p>, C<h> の文字からなる +文字クラスです。 +これら文字クラスはより大きな大かっこ文字クラスの一部にできます。 +例えば、 + + [01[:alpha:]%] + +=begin original + +is valid and matches '0', '1', any alphabetic character, and the percent sign. + +=end original + +これは妥当で、'0'、'1'、任意の英字、パーセントマークにマッチングします。 + +=begin original + +Perl recognizes the following POSIX character classes: + +=end original + +Perl は以下の POSIX 文字クラスを認識します: + +=begin original + + alpha Any alphabetical character ("[A-Za-z]"). + alnum Any alphanumeric character. ("[A-Za-z0-9]") + ascii Any character in the ASCII character set. + blank A GNU extension, equal to a space or a horizontal tab ("\t"). + cntrl Any control character. See Note [2] below. + digit Any decimal digit ("[0-9]"), equivalent to "\d". + graph Any printable character, excluding a space. See Note [3] below. + lower Any lowercase character ("[a-z]"). + print Any printable character, including a space. See Note [4] below. + punct Any graphical character excluding "word" characters. Note [5]. + space Any whitespace character. "\s" plus the vertical tab ("\cK"). + upper Any uppercase character ("[A-Z]"). + word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". + xdigit Any hexadecimal digit ("[0-9a-fA-F]"). + +=end original + + alpha 任意の英字 ("[A-Za-z]")。 + alnum 任意の英数字。("[A-Za-z0-9]") + ascii 任意の ASCII 文字集合の文字。 + blank GNU 拡張; スペースまたは水平タブ ("\t") と同じ。 + cntrl 任意の制御文字。後述の [2] 参照。 + digit 任意の 10 進数字 ("[0-9]"); "\d" と等価。 + graph 任意の表示文字; スペースを除く。後述の [3] 参照。 + lower 任意の小文字 ("[a-z]")。 + print 任意の表示文字; スペースを含む。後述の [4] 参照。 + punct 任意の「単語」文字を除く表示文字。[5] 参照。 + space 任意の空白文字。"\s" に加えて水平タブ ("\cK")。 + upper 任意の大文字 ("[A-Z]")。 + word Perl 拡張 ("[A-Za-z0-9_]"); "\w" と等価。 + xdigit 任意の 16 進文字 ("[0-9a-fA-F]")。 + +=begin original + +Most POSIX character classes have two Unicode-style C<\p> property +counterparts. (They are not official Unicode properties, but Perl extensions +derived from official Unicode properties.) The table below shows the relation +between POSIX character classes and these counterparts. + +=end original + +ほとんどの POSIX 文字クラスには、対応する二つの Unicode 式の C<\p> 特性が +あります。 +(これは公式 Unicode 特性ではなく、公式 Unicode 特性から派生した Perl +エクステンションです。) +以下の表は POSIX 文字クラスと対応するものとの関連を示します。 + +=begin original + +One counterpart, in the column labelled "ASCII-range Unicode" in +the table, matches only characters in the ASCII character set. + +=end original + +対応物の一つである、表で "ASCII-range Unicode" と書かれた列のものは、 +ASCII 文字集合の文字にのみマッチングします。 + +=begin original + +The other counterpart, in the column labelled "Full-range Unicode", matches any +appropriate characters in the full Unicode character set. For example, +C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any +character in the entire Unicode character set considered alphabetic. +The column labelled "backslash sequence" is a (short) synonym for +the Full-range Unicode form. + +=end original + +もう一つの対応物である、"Full-range Unicode" と書かれた列のものは、 +Unicode 文字集合全体の中の適切な任意の文字にマッチングします。 +例えば、C<\p{Alpha}> は単に ASCII アルファベット文字だけでなく、 +Unicode 文字集合全体の中からアルファベットと考えられる任意の文字に +マッチングします。 +The column labelled "backslash sequence" is a (short) synonym for +the Full-range Unicode form. +(TBT) + +=begin original + +(Each of the counterparts has various synonyms as well. +L<perluniprops/Properties accessible through \p{} and \P{}> lists all +synonyms, plus all characters matched by each ASCII-range property. +For example, C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, +and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) + +=end original + +(それぞれの対応物には様々な同義語もあります。 +L<perluniprops/Properties accessible through \p{} and \P{}> に +全ての同義語と、ASCII 範囲のそれぞれでマッチングする全ての文字の一覧が +あります。 +例えば、C<\p{AHex}> は C<\p{ASCII_Hex_Digit}> の同義語で、 +任意の C<\p> 特性名は、C<\p{IsAlpha}> のように、"Is" 接頭辞が使えます。) + +=begin original + +Both the C<\p> counterparts always assume Unicode rules are in effect. +On ASCII platforms, this means they assume that the code points from 128 +to 255 are Latin-1, and that means that using them under locale rules is +unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the +POSIX character classes are useful under locale rules. They are +affected by the actual rules in effect, as follows: + +=end original + +Both the C<\p> counterparts always assume Unicode rules are in effect. +On ASCII platforms, this means they assume that the code points from 128 +to 255 are Latin-1, and that means that using them under locale rules is +unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the +POSIX character classes are useful under locale rules. They are +affected by the actual rules in effect, as follows: +(TBT) + +=over + +=item If the C</a> modifier, is in effect ... + +=begin original + +Each of the POSIX classes matches exactly the same as their ASCII-range +counterparts. + +=end original + +Each of the POSIX classes matches exactly the same as their ASCII-range +counterparts. +(TBT) + +=item otherwise ... + +=over + +=item For code points above 255 ... + +=begin original + +The POSIX class matches the same as its Full-range counterpart. + +=end original + +The POSIX class matches the same as its Full-range counterpart. +(TBT) + +=item For code points below 256 ... + +=over + +=item if locale rules are in effect ... + +=begin original + +The POSIX class matches according to the locale. + +=end original + +The POSIX class matches according to the locale. +(TBT) + +=item if Unicode rules are in effect or if on an EBCDIC platform ... + +=begin original + +The POSIX class matches the same as the Full-range counterpart. + +=end original + +The POSIX class matches the same as the Full-range counterpart. +(TBT) + +=item otherwise ... + +=begin original + +The POSIX class matches the same as the ASCII range counterpart. + +=end original + +The POSIX class matches the same as the ASCII range counterpart. +(TBT) + +=back + +=back + +=back + +=begin original + +Which rules apply are determined as described in +L<perlre/Which character set modifier is in effect?>. + +=end original + +Which rules apply are determined as described in +L<perlre/Which character set modifier is in effect?>. +(TBT) + +=begin original + +It is proposed to change this behavior in a future release of Perl so that +whether or not Unicode rules are in effect would not change the +behavior: Outside of locale or an EBCDIC code page, the POSIX classes +would behave like their ASCII-range counterparts. If you wish to +comment on this proposal, send email to C<perl5****@perl*****>. + +=end original + +Perl の将来のバージョンではこの振る舞いを変えることが提案されています; +whether or not Unicode rules are in effect would not change the +behavior: Outside of locale or an EBCDIC code page, the POSIX classes +would behave like their ASCII-range counterparts. +この提案にコメントしたいなら、C<perl5****@perl*****> にメールを +送ってください。 +(TBT) + + [[:...:]] ASCII-range Full-range backslash Note + Unicode Unicode sequence + ----------------------------------------------------- + alpha \p{PosixAlpha} \p{XPosixAlpha} + alnum \p{PosixAlnum} \p{XPosixAlnum} + ascii \p{ASCII} + blank \p{PosixBlank} \p{XPosixBlank} \h [1] + or \p{HorizSpace} [1] + cntrl \p{PosixCntrl} \p{XPosixCntrl} [2] + digit \p{PosixDigit} \p{XPosixDigit} \d + graph \p{PosixGraph} \p{XPosixGraph} [3] + lower \p{PosixLower} \p{XPosixLower} + print \p{PosixPrint} \p{XPosixPrint} [4] + punct \p{PosixPunct} \p{XPosixPunct} [5] + \p{PerlSpace} \p{XPerlSpace} \s [6] + space \p{PosixSpace} \p{XPosixSpace} [6] + upper \p{PosixUpper} \p{XPosixUpper} + word \p{PosixWord} \p{XPosixWord} \w + xdigit \p{PosixXDigit} \p{XPosixXDigit} + +=over 4 + +=item [1] + +=begin original + +C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. + +=end original + +C<\p{Blank}> と C<\p{HorizSpace}> は同義語です。 + +=item [2] + +=begin original + +Control characters don't produce output as such, but instead usually control +the terminal somehow: for example, newline and backspace are control characters. +In the ASCII range, characters whose code points are between 0 and 31 inclusive, +plus 127 (C<DEL>) are control characters. + +=end original + +制御文字はそれ自体は出力されず、普通は何か端末を制御します: 例えば +改行と後退は制御文字です。 +ASCII の範囲では、符号位置が 0 から 31 までの範囲の文字および 127 (C<DEL>) が +制御文字です。 + +=begin original + +On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> +to be the EBCDIC equivalents of the ASCII controls, plus the controls +that in Unicode have code pointss from 128 through 159. + +=end original + +EBCDIC プラットフォームでは、コードページは C<[[:cntrl:]]> を、ASCII +制御文字の EBCDIC 等価物に加えて、Unicode で符号位置 128 から 139 を +持つものと定義しています。 + +=item [3] + +=begin original + +Any character that is I<graphical>, that is, visible. This class consists +of all alphanumeric characters and all punctuation characters. + +=end original + +I<graphical>、つまり見える文字。 +このクラスは全ての英数字と全ての句読点文字。 + +=item [4] + +=begin original + +All printable characters, which is the set of all graphical characters +plus those whitespace characters which are not also controls. + +=end original + +全ての表示可能な文字; 全ての graphical 文字に加えて制御文字でない空白文字。 + +=item [5] + +=begin original + +C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all +non-controls, non-alphanumeric, non-space characters: +C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, +it could alter the behavior of C<[[:punct:]]>). + +=end original + +ASCII の範囲の C<\p{PosixPunct}> と C<[[:punct:]]> は全ての非制御、非英数字、 +非空白文字にマッチングします: +C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (しかしロケールが有効なら、 +C<[[:punct:]]> の振る舞いが変わります)。 + +=begin original + +The similarly named property, C<\p{Punct}>, matches a somewhat different +set in the ASCII range, namely +C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing C<[$+E<lt>=E<gt>^`|~]>. +This is because Unicode splits what POSIX considers to be punctuation into two +categories, Punctuation and Symbols. + +=end original + +The similarly named property, C<\p{Punct}>, matches a somewhat different +set in the ASCII range, namely +C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing C<[$+E<lt>=E<gt>^`|~]>. +This is because Unicode splits what POSIX considers to be punctuation into two +categories, Punctuation and Symbols. +(TBT) + +=begin original + +C<\p{XPosixPunct}> and (in Unicode mode) C<[[:punct:]]>, match what +C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> +matches. This is different than strictly matching according to +C<\p{Punct}>. Another way to say it is that +if Unicode rules are in effect, C<[[:punct:]]> matches all characters +that Unicode considers punctuation, plus all ASCII-range characters that +Unicode considers symbols. + +=end original + +C<\p{XPosixPunct}> and (in Unicode mode) C<[[:punct:]]>, match what +C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> +matches. This is different than strictly matching according to +C<\p{Punct}>. +Unicode 規則が有効な場合のもう一つの言い方は、C<[[:punct:]]> は Unicode が +句読点として扱うものに加えて、Unicode が "symbols" として扱う ASCII 範囲の +全ての文字にマッチングします。 +(TBT) + +=item [6] + +=begin original + +C<\p{SpacePerl}> and C<\p{Space}> differ only in that in non-locale +matching, C<\p{Space}> additionally +matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. + +=end original + +C<\p{SpacePerl}> と C<\p{Space}> の違いは、非ロケールマッチングでは +C<\p{Space}> は垂直タブ C<\cK> にもマッチングすると言うことだけです。 +二つの ASCII のみの範囲の形式では同じです。 + +=back + +=begin original + +There are various other synonyms that can be used for these besides +C<\p{HorizSpace}> and \C<\p{XPosixBlank}>. For example, +C<\p{PosixAlpha}> can be written as C<\p{Alpha}>. All are listed +in L<perluniprops/Properties accessible through \p{} and \P{}>. + +=end original + +There are various other synonyms that can be used for these besides +C<\p{HorizSpace}> and \C<\p{XPosixBlank}>. For example, +C<\p{PosixAlpha}> can be written as C<\p{Alpha}>. All are listed +in L<perluniprops/Properties accessible through \p{} and \P{}>. +(TBT) + +=head4 Negation of POSIX character classes +X<character class, negation> + +(POSIX 文字クラスの否定) + +=begin original + +A Perl extension to the POSIX character class is the ability to +negate it. This is done by prefixing the class name with a caret (C<^>). +Some examples: + +=end original + +POSIX 文字クラスに対する Perl の拡張は否定の機能です。 +これはクラス名の前にキャレット (C<^>) を置くことで実現します。 +いくつかの例です: + + POSIX ASCII-range Full-range backslash + Unicode Unicode sequence + ----------------------------------------------------- + [[:^digit:]] \P{PosixDigit} \P{XPosixDigit} \D + [[:^space:]] \P{PosixSpace} \P{XPosixSpace} + \P{PerlSpace} \P{XPerlSpace} \S + [[:^word:]] \P{PerlWord} \P{XPosixWord} \W + +=begin original + +The backslash sequence can mean either ASCII- or Full-range Unicode, +depending on various factors as described in L<perlre/Which character set modifier is in effect?>. + +=end original + +The backslash sequence can mean either ASCII- or Full-range Unicode, +depending on various factors as described in L<perlre/Which character set modifier is in effect?>. +(TBT) + +=head4 [= =] and [. .] + +([= =] と [. .]) + +=begin original + +Perl recognizes the POSIX character classes C<[=class=]> and +C<[.class.]>, but does not (yet?) support them. Any attempt to use +either construct raises an exception. + +=end original + +Perl は POSIX 文字クラス C<[=class=]> と C<[.class.]> を認識しますが、 +これらには(まだ?)対応していません。 +このような構文を使おうとすると例外が発生します。 + +=head4 Examples + +(例) + +=begin original + + /[[:digit:]]/ # Matches a character that is a digit. + /[01[:lower:]]/ # Matches a character that is either a + # lowercase letter, or '0' or '1'. + /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything + # except the letters 'a' to 'f'. This is + # because the main character class is composed + # of two POSIX character classes that are ORed + # together, one that matches any digit, and + # the other that matches anything that isn't a + # hex digit. The result matches all + # characters except the letters 'a' to 'f' and + # 'A' to 'F'. + +=end original + + /[[:digit:]]/ # 数字の文字にマッチングする。 + /[01[:lower:]]/ # 小文字、'0'、'1' のいずれかの文字に + # マッチングする。 + /[[:digit:][:^xdigit:]]/ # 'a' から 'f' 以外の任意の文字に + # マッチング。これはメインの文字クラスでは二つの + # POSIX 文字クラスが OR され、一つは任意の数字に + # マッチングし、もう一つは 16 進文字でない全ての + # 文字にマッチングします。従って + # 'a' から 'f' および 'A' から 'F' を + # 除く全ての文字に + # マッチングすることになります。 + +=begin meta + +Translate: SHIRAKATA Kentaro <argra****@ub32*****> (5.10.1-) +Status: in progress + +=end meta +