Regular expression

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Syntax)
 
(22 intermediate revisions by one user not shown)
Line 1: Line 1:
'''Regular expressions''', also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. In the UNL<sup>arium</sup> framework, regular expressions follow the [http://www.pcre.org/ PCRE library] and must be provided between / /. They are used mainly to enhance the power of [[Ph-rule]]s. The main features are the following:
+
'''Regular expressions''', also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. In the UNL<sup>arium</sup> framework, regular expressions follow the [http://www.pcre.org/ PCRE library].  
  
{|border=0 cellpadding=2
+
== Syntax ==
!Characters
+
Regular expressions come between /forward slashes/. They are used to replace "/strings/" (between quotes), [/natural language entries/] (between brackets), <nowiki>[[/UWs/]]</nowiki> (between double square brackets) or /features/. For instance:
!
+
*"/.../" matches any string made of three characters
 +
*[/[abc]/] matches the natural language entries "a", "b" and "c"
 +
*<nowiki>[[/(abc|def)/]]</nowiki> matches the UW "abc" or "def"
 +
*/(MCL|FEM)/ matches the features MCL or FEM
 +
 
 +
== Metacharacters ==
 +
For a comprehensive list of metacharacters, please consult [http://www.pcre.org/ Perl Compatible Regular Expressions].
 +
 
 +
 
 +
{|border=1 cellpadding=2 align=center
 +
!colspan=2|Characters
 
|-
 
|-
 
|'''a'''
 
|'''a'''
Line 11: Line 21:
 
|match the number 3
 
|match the number 3
 
|-
 
|-
|'''\n'''
+
!colspan=2|Wildcards
|newline (NL, LF)
+
|-
+
|'''\r'''
+
|return (CR)
+
|-
+
|'''\f'''
+
|form feed (FF)
+
|-
+
|'''\t'''
+
|tab (TAB)
+
|-
+
|'''\x3C'''
+
|character with the hex code 3C
+
|-
+
|'''\u561A'''
+
|character with the hex code 561A
+
|-
+
|'''\e'''
+
|escape character (alias \u001B)
+
|-
+
|'''\c…'''
+
|control character
+
|-
+
!Wildcards
+
!
+
 
|-
 
|-
 
|'''.'''
 
|'''.'''
Line 68: Line 53:
 
|any character not listed: [^a5!d-g] means anything but a, 5, ! and d, e, f, g
 
|any character not listed: [^a5!d-g] means anything but a, 5, ! and d, e, f, g
 
|-
 
|-
!Boundaries
+
!colspan=2|Quantifiers
!
+
|-
+
|'''\b'''
+
|matches at a word boundary (spot between \w and \W)
+
|-
+
|'''\B'''
+
|matches anything but a word boundary
+
|-
+
|'''^'''
+
|matches at the beginning of a line (m) or entire string (s)
+
|-
+
|'''\A'''
+
|matches at the beginning of the entire string
+
|-
+
|'''$'''
+
|matches at the end of a line (m) or entire string (s)
+
|-
+
|'''\Z'''
+
|matches at the end of the entire string ignoring a tailing \n
+
|-
+
|'''\z'''
+
|matches at the end of the entire string
+
|-
+
!Quantifiers
+
!
+
 
|-
 
|-
 
|'''?'''
 
|'''?'''
Line 112: Line 72:
 
|'''{n,m}'''
 
|'''{n,m}'''
 
|at least n but not more than m times, as often as possible
 
|at least n but not more than m times, as often as possible
 +
|-
 +
!colspan=2|Grouping
 +
|-
 +
|'''(...)'''
 +
|
 +
|-
 +
!colspan=2|Special characters
 +
|-
 +
|<nowiki>{ } [ ] ( ) ^ $ . | * + ?</nowiki>
 +
|to match these characters, override (escape) with \
 +
|}
 +
 +
== Examples ==
 +
 +
{|border=1 cellpadding=2 align=center
 +
!RegEx
 +
!Description
 +
!Matches
 +
|-
 +
|/abc/
 +
|match the sequence "abc"
 +
|abc
 +
|-
 +
|/abc./
 +
|match the sequence "abc" plus one character
 +
|abca, abcb, abcc, abcd, abce, ...
 +
|-
 +
|/abc(a)?/
 +
|match the sequence "abc" plus zero or one character "a"
 +
|abc, abca
 +
|-
 +
|/abc(a)*/
 +
|match the sequence "abc" plus zero or more characters "a"
 +
|abc, abca, abcaa, abcaaa, abcaaaa, abcaaaaa, ...
 +
|-
 +
|/abc(a)+/
 +
|match the sequence "abc" plus one or more characters "a"
 +
|abca, abcaa, abcaaa, abcaaaa, ...
 +
|-
 +
|/abc(a){3}/
 +
|match the sequence "abc" plus three characters "a"
 +
|abcaaa
 +
|-
 +
|/abc(a){3,}/
 +
|match the sequence "abc" plus at least three characters "a"
 +
|abcaaa, abcaaaa, abcaaaaa, abcaaaaaa, ...
 +
|-
 +
|/abc(a){2,5}/
 +
|match the sequence "abc" plus two to five characters "a"
 +
|abcaa, abcaaa, abcaaaa, abcaaaaa
 +
|-
 +
|/a[bcd]e/
 +
|match "a" plus "b", "c" or "d", plus "e"
 +
|abe, ace, ade
 +
|-
 +
|/a[^bcd]e/
 +
|match "a" plus any character that is not "b", "c" or "d", plus "e"
 +
|aae, aee, afe, age, ahe, ...
 +
|-
 +
|/a\d/
 +
|match "a" plus any single digit
 +
|a0, a1, a2, a3, a4, a5, a6, a7, a8, a9
 +
|-
 +
|/a(\d){2}/
 +
|match "a" plus any two digits
 +
|a00, a01, a02, a03, a04, ...
 
|}
 
|}

Latest revision as of 15:43, 16 June 2014

Regular expressions, also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. In the UNLarium framework, regular expressions follow the PCRE library.

Syntax

Regular expressions come between /forward slashes/. They are used to replace "/strings/" (between quotes), [/natural language entries/] (between brackets), [[/UWs/]] (between double square brackets) or /features/. For instance:

  • "/.../" matches any string made of three characters
  • [/[abc]/] matches the natural language entries "a", "b" and "c"
  • [[/(abc|def)/]] matches the UW "abc" or "def"
  • /(MCL|FEM)/ matches the features MCL or FEM

Metacharacters

For a comprehensive list of metacharacters, please consult Perl Compatible Regular Expressions.


Characters
a match the character a
3 match the number 3
Wildcards
. match any character
\… quote single metacharacter: \. matches a dot instead of any character and \\ matches a single backslash
\w alphanumeric + underscore (shortcut for [0-9a-zA-Z_])
\W any character not covered by \w
\d numeric (shortcut for [0-9])
\D any character not covered by \d
\s whitespace (shortcut for [ \t\n\r\f])
\S any character not covered by \s
[…] any character listed: [a5!d-g] means a, 5, ! and d, e, f, g
[^…] any character not listed: [^a5!d-g] means anything but a, 5, ! and d, e, f, g
Quantifiers
? match 1 or 0 times
* 0 or more times
+ 1 or more times
{n} exactly n times
{n,} at least n times
{n,m} at least n but not more than m times, as often as possible
Grouping
(...)
Special characters
{ } [ ] ( ) ^ $ . | * + ? to match these characters, override (escape) with \

Examples

RegEx Description Matches
/abc/ match the sequence "abc" abc
/abc./ match the sequence "abc" plus one character abca, abcb, abcc, abcd, abce, ...
/abc(a)?/ match the sequence "abc" plus zero or one character "a" abc, abca
/abc(a)*/ match the sequence "abc" plus zero or more characters "a" abc, abca, abcaa, abcaaa, abcaaaa, abcaaaaa, ...
/abc(a)+/ match the sequence "abc" plus one or more characters "a" abca, abcaa, abcaaa, abcaaaa, ...
/abc(a){3}/ match the sequence "abc" plus three characters "a" abcaaa
/abc(a){3,}/ match the sequence "abc" plus at least three characters "a" abcaaa, abcaaaa, abcaaaaa, abcaaaaaa, ...
/abc(a){2,5}/ match the sequence "abc" plus two to five characters "a" abcaa, abcaaa, abcaaaa, abcaaaaa
/a[bcd]e/ match "a" plus "b", "c" or "d", plus "e" abe, ace, ade
/a[^bcd]e/ match "a" plus any character that is not "b", "c" or "d", plus "e" aae, aee, afe, age, ahe, ...
/a\d/ match "a" plus any single digit a0, a1, a2, a3, a4, a5, a6, a7, a8, a9
/a(\d){2}/ match "a" plus any two digits a00, a01, a02, a03, a04, ...
Software