Regular expression
From UNL Wiki
(Difference between revisions)
(New page: '''Regular expressions''', also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of char...) |
|||
Line 1: | Line 1: | ||
'''Regular expressions''', also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. In the UNL<sup>arium</sup> framework, regular expressions follow the [http://www.pcre.org/ PCRE library] and must be provided between / /. They are used mainly to enhance the power of [[Ph-rule]]s. The main features are the following: | '''Regular expressions''', also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. In the UNL<sup>arium</sup> framework, regular expressions follow the [http://www.pcre.org/ PCRE library] and must be provided between / /. They are used mainly to enhance the power of [[Ph-rule]]s. The main features are the following: | ||
− | {|border=0 cellpadding= | + | {|border=0 cellpadding=2 |
!Characters | !Characters | ||
! | ! | ||
|- | |- | ||
− | |a | + | |'''a''' |
|match the character a | |match the character a | ||
|- | |- | ||
− | |3 | + | |'''3''' |
|match the number 3 | |match the number 3 | ||
|- | |- | ||
− | |\n | + | |'''\n''' |
|newline (NL, LF) | |newline (NL, LF) | ||
|- | |- | ||
− | |\r | + | |'''\r''' |
|return (CR) | |return (CR) | ||
|- | |- | ||
− | |\f | + | |'''\f''' |
|form feed (FF) | |form feed (FF) | ||
|- | |- | ||
− | |\t | + | |'''\t''' |
|tab (TAB) | |tab (TAB) | ||
|- | |- | ||
− | |\x3C | + | |'''\x3C''' |
|character with the hex code 3C | |character with the hex code 3C | ||
|- | |- | ||
− | |\u561A | + | |'''\u561A''' |
|character with the hex code 561A | |character with the hex code 561A | ||
|- | |- | ||
− | |\e | + | |'''\e''' |
|escape character (alias \u001B) | |escape character (alias \u001B) | ||
|- | |- | ||
− | |\c… | + | |'''\c…''' |
|control character | |control character | ||
|- | |- | ||
Line 38: | Line 38: | ||
! | ! | ||
|- | |- | ||
− | |. | + | |'''.''' |
|match any character | |match any character | ||
|- | |- | ||
− | |\… | + | |'''\…''' |
|quote single metacharacter: \. matches a dot instead of any character and \\ matches a single backslash | |quote single metacharacter: \. matches a dot instead of any character and \\ matches a single backslash | ||
|- | |- | ||
− | |\w | + | |'''\w''' |
|alphanumeric + underscore (shortcut for [0-9a-zA-Z_]) | |alphanumeric + underscore (shortcut for [0-9a-zA-Z_]) | ||
|- | |- | ||
− | |\W | + | |'''\W''' |
|any character not covered by \w | |any character not covered by \w | ||
|- | |- | ||
− | |\d | + | |'''\d''' |
|numeric (shortcut for [0-9]) | |numeric (shortcut for [0-9]) | ||
|- | |- | ||
− | |\D | + | |'''\D''' |
|any character not covered by \d | |any character not covered by \d | ||
|- | |- | ||
− | |\s | + | |'''\s''' |
|whitespace (shortcut for [ \t\n\r\f]) | |whitespace (shortcut for [ \t\n\r\f]) | ||
|- | |- | ||
− | |\S | + | |'''\S''' |
|any character not covered by \s | |any character not covered by \s | ||
|- | |- | ||
− | |[…] | + | |'''[…]''' |
|any character listed: [a5!d-g] means a, 5, ! and d, e, f, g | |any character listed: [a5!d-g] means a, 5, ! and d, e, f, g | ||
|- | |- | ||
− | |[^…] | + | |'''[^…]''' |
|any character not listed: [^a5!d-g] means anything but a, 5, ! and d, e, f, g | |any character not listed: [^a5!d-g] means anything but a, 5, ! and d, e, f, g | ||
|- | |- | ||
Line 71: | Line 71: | ||
! | ! | ||
|- | |- | ||
− | |\b | + | |'''\b''' |
|matches at a word boundary (spot between \w and \W) | |matches at a word boundary (spot between \w and \W) | ||
|- | |- | ||
− | |\B | + | |'''\B''' |
|matches anything but a word boundary | |matches anything but a word boundary | ||
|- | |- | ||
− | |^ | + | |'''^''' |
|matches at the beginning of a line (m) or entire string (s) | |matches at the beginning of a line (m) or entire string (s) | ||
|- | |- | ||
− | |\A | + | |'''\A''' |
|matches at the beginning of the entire string | |matches at the beginning of the entire string | ||
− | | | + | |- |
− | |$ | + | |'''$''' |
|matches at the end of a line (m) or entire string (s) | |matches at the end of a line (m) or entire string (s) | ||
|- | |- | ||
− | |\Z | + | |'''\Z''' |
|matches at the end of the entire string ignoring a tailing \n | |matches at the end of the entire string ignoring a tailing \n | ||
|- | |- | ||
− | |\z | + | |'''\z''' |
|matches at the end of the entire string | |matches at the end of the entire string | ||
|- | |- | ||
Line 95: | Line 95: | ||
! | ! | ||
|- | |- | ||
− | |? | + | |'''?''' |
|match 1 or 0 times | |match 1 or 0 times | ||
|- | |- | ||
− | |* | + | |'''*''' |
|0 or more times | |0 or more times | ||
|- | |- | ||
− | |+ | + | |'''<nowiki>+</nowiki>''' |
|1 or more times | |1 or more times | ||
|- | |- | ||
− | |{n} | + | |'''{n}''' |
|exactly n times | |exactly n times | ||
|- | |- | ||
− | |{n,} | + | |'''{n,}''' |
|at least n times | |at least n times | ||
|- | |- | ||
− | |{n,m} | + | |'''{n,m}''' |
|at least n but not more than m times, as often as possible | |at least n but not more than m times, as often as possible | ||
|} | |} |
Revision as of 17:43, 22 March 2010
Regular expressions, also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. In the UNLarium framework, regular expressions follow the PCRE library and must be provided between / /. They are used mainly to enhance the power of Ph-rules. The main features are the following:
Characters | |
---|---|
a | match the character a |
3 | match the number 3 |
\n | newline (NL, LF) |
\r | return (CR) |
\f | form feed (FF) |
\t | tab (TAB) |
\x3C | character with the hex code 3C |
\u561A | character with the hex code 561A |
\e | escape character (alias \u001B) |
\c… | control character |
Wildcards | |
. | match any character |
\… | quote single metacharacter: \. matches a dot instead of any character and \\ matches a single backslash |
\w | alphanumeric + underscore (shortcut for [0-9a-zA-Z_]) |
\W | any character not covered by \w |
\d | numeric (shortcut for [0-9]) |
\D | any character not covered by \d |
\s | whitespace (shortcut for [ \t\n\r\f]) |
\S | any character not covered by \s |
[…] | any character listed: [a5!d-g] means a, 5, ! and d, e, f, g |
[^…] | any character not listed: [^a5!d-g] means anything but a, 5, ! and d, e, f, g |
Boundaries | |
\b | matches at a word boundary (spot between \w and \W) |
\B | matches anything but a word boundary |
^ | matches at the beginning of a line (m) or entire string (s) |
\A | matches at the beginning of the entire string |
$ | matches at the end of a line (m) or entire string (s) |
\Z | matches at the end of the entire string ignoring a tailing \n |
\z | matches at the end of the entire string |
Quantifiers | |
? | match 1 or 0 times |
* | 0 or more times |
+ | 1 or more times |
{n} | exactly n times |
{n,} | at least n times |
{n,m} | at least n but not more than m times, as often as possible |