Perl Regular Expressions

Table of Contents

Table of Contents

1. Overview

Perl is a scripting language widely used for system administration and programming on the World Wide Web. It originated in the UNIX community and has a strong UNIX slant, but is still very useful for Win32 platforms. perl (small 'p') is the program used to interpret the Perl language.

Table of Contents

2. Introduction to Regular Expression.

Regular Expression is a simple string that must match the text exactly. The string can contain special characters which has different or special meaning. These characters are not treated as a usual character and they are not matched literally. These characters denote the string has more generic pattern.

Special characters which makes the pattern more generic are:

        [   ]   - Square Brackets.
        (  )    - Parenthesis.
        {  }    - Braces.
        -       - Hyphen.
        +       - Plus.
        *       - Asterix.
        .       - Dot or Period.
        ^       - Carat.
        $       - Dollar.
        ?       - Question Mark.
        |       - Pipe or OR.
        \       - Back slash.

These special characters are widely used to interpret the patterns. Usage of these characters depends on the occurrence of the pattern. There is no limitations in using these.

Using Regular Expression, searching a pattern in the text becomes easy. The search done with regular expression is called potential search.

Regular Expression is represented in-between 2 forward slash ( '/' ) character.

Table of Contents

3. Literal Pattern.

Literal pattern is a string which contains no special characters. A literal pattern matches an identical string, but no other characters. These patterns will not contain any RegEx defined operator to search.

Example:

a. PERL Regular Expression

b. Pattern matching language.

These are the simple examples of Literal patterns. These are like searching a word or string in any text editor.

Table of Contents

4. Character Sets.

Defining a list of characters pertaining to the pattern is called a character set. There are many types of character sets. Each one has a special meaning. When the search engine looks these character sets it matches a character it is specified in the character list.

Character sets are always enclosed in square brackets ( [ ] ).

Example:

        
        /[a-z]/         Specifies the lower case alphabet character set.
        /[A-Z]/         Specifies the upper case alphabet character set.
        /[0-9]/         Specifies the numeric character set.
        /[,'\.\?]       Specifies the punctuation character set.
                        Here . and ? are special character.  To use the special 
                        character as literal character, it has be preceded by a 
                        backslash.
        /[A-Za-z]/      Specifies both lower & upper case alphabet character set.
        /[A-Za-z0-9]/   Specifies both lower & upper case alphabet with numeric 
                        character set.

        /[^a-z]/        Specifies the character set other than lower case alphabet.
                        ^ symbol used inside square brackets are treated as negation 
                        symbol In this character set we tell the compiler to negate 
                        lower case alphabets.

        /[^0-9]/        Specifies the character set other than numeric characters.

        
        Character sets can be used based on the pattern.  It differs from pattern 
        to pattern.

        Character set is also called as class of characters.

Table of Contents

5. Range.

Range is a short form interpreting a list of character. The range is always specified by the character hyphen ( - ).

Example:

        /[a-z]/         Specifies the lower case alphabet character set which is a short form of 
                        interpreting  /[abcdefghijklmnopqrstuvwxyz]/.

        /[0-9]/         Specifies the number character set which is a short form of interpreting
                          /[0123456789/.

Table of Contents

6. Any Character.

Class of character or character set specifies the list of characters to match. Regular expression compiler will match only the characters listed. But when we need to match any character we need to use the operator dot ( . )

Dot tell the compiler to match any character.

Example:

/.at/ which match all of the following.

1. Bat

2. Cat

3. Eat

4. Fat

5. Rat

Dot is a simple notation to match any character.

Dot will not match NewLine ( \n ), Return Character ( \r ), Line Feed ( \f ) & NULL character ( \0 ).

Table of Contents

7. Grouping

A series of patterns or characters are grouped to a single element or pattern is called grouping. Grouped elements can be reproduced when ever necessary. This helps us to cut a specific pattern from a text and reproduce or paste at the appropriate place.

The grouping operator is parenthesis [ ( ) ].

Characters enclosed in the parenthesis are grouped to single element and stored in a variable. The variables are named according to their occurrence. 1st grouped element is stored in the variable $1, second in $2 and so on.

Example:

1. RegEx: /This is ([0-9]) testing/ Source Text: /This is 1 testing/

The above RegEx will match the text and store the number 1 in $1 variable.

2. RegEx: /456 (ULRA) 73/ Source Text: /This is sample text with 456 ULRA 73.

The above RegEx will match the text and store ULRA in $1 variable.

There can be any number of grouping. Each grouping is stored in different variables.

Table of Contents

8. Back references & Extraction.

Grouped elements can be rematched literally using back references. These back references help matching the grouped elements in the same expression to achieve the results.

Back reference is represented with the grouping number preceded with backslash.

\1 - Represents back referencing 1st grouping.

\2 - Represents back referencing 2nd grouping.

Example 1:

RegEx: /([0-9]) \1 ([0-9])/

Source Text: /1 1 3/

In the above example each variable will have.

$1 = 1

$2 = 3

It will literally match the 1st variable.

Example 2:

RegEx: /([0-9]) \1 ([0-9])/

Source Text: /1 2 3/

The above RegEx will not match because $1 will have value 1 and it will search for 1 subsequently which is not present.

Back referenced text will not be stored in a different variable.

Table of Contents

9. Optional Expressions.

A part of an pattern can be made optional in a regular expression with a ? operator.

Example:

RegEx: /[0-9]? This is sample/

Source Text1: /1 This is sample/

Source Text2: / This is sample/

Above regular expression will match both source text1 and source text 2.

Table of Contents

10. Counted Expressions.

An interval expression, {m,n} where 'm' and 'n' are non-negative integers with 'n >= m', applies to the proceeding character, character set, subexpression or backreference. It indicates that the preceeding element must match at least 'm' times and may match as many as 'n' times.

Example:

RegEx: /cat{1,4}/

Source Text: catttt.

Above regular expression will match the full text. The expression {1,4 } says that the pattern should match at least once and maximum of 4 times.

Types of Counted expressions

1. {n} Matches exactly n times.

2. {n,} Matches at least n times.

3. {n,m} Matches at least n but not more than m times.

Table of Contents

11. Alternative Expressions.

Alternative expression is a one which matches any of the specified list of patterns. This helps us to give OR conditions in our patterns.

Example:

RegEx: /(TEXT|text)/

Source Text1: This is sample TEXT.

Source Text2: This is sample text.

Regular expression will match both source text1 and source text2 because of alternative expressions. It will match either TEXT or text.

Table of Contents

12. Repeated Expressions.

To match a part of a pattern repeatedly for many times. It is just like counted patterns but here it is more generic.

Operators used in Repeated Expressions:

1. * ( Asterix ) - Represents 0 or many times of matching.

2. + ( Plus ) - Represents 1 or many times of matching.

Operator * represents that the pattern is optional and it can come any times.

Operator + represents that the pattern is mandatory or must and it can come any times.

Example 1:

RegEx: /[0-9]+/

Source Text1: 123 This is a sample text.

Result: It will match "123".

Example 1:

RegEx: /This is a [0-9]*/

Source Text1: 123 This is a sample text.

Result: It will match "This is a " because [0-9] is optional.

Table of Contents

13. Short Cut Notations.

Perl provides lot of short cut notations to write regular expressions. These short cut notations help us to understand the regex easily and write smaller regular expressions.

List of short cut notations.

1. \w - Match a "word" character ( alphanumeric & _ )

2. \W - Match a non-word character.

3. \s - Match a whitespace character. ( Tab ( \t ), NewLine ( \n ), Return ( \r ) & space )

4. \S - Match a non-whitespace character.

5. \d - Match a digit character.

6. \D - Match a non-digit character.

These short cut notations can be used inside character classes also. To match repeatedly use repeated expressions.

Example:

1. RegEx: /[\w]+/

This will match a word.

2. RegEx: /[^\w]+/

This will match other than a alphanumeric & _ character.

Table of Contents

14. Miscellaneous Information.

^ operator tells the compiler to match the text from the beginning of a line.

$ operator tells the compiler to match the text from end of the line.

Table of Contents

15. Summary

1. Literal matching. /Text/

2. Character Sets. /[a-z]/

3. Range /[0-9]/

4. Any character /./

5. Grouping / ( [0-9]+ )/

6. Back references / ([0-9]+) \1 /

7. Optional Expression /[0-9]?/

8. Counted Expression /([0-9]){1,4}/

9. Alternative Expression /(TEXT|text)/

10. * - Zero or many times.

11. + - One or many times.

Table of Contents

16. Quick Reference Guide

Regular Expression

Each character matches itself, unless it is one of the special characters +?.*$()[]{}|\. The special meaning of these characters can be escaped using a ‘\’.

. matches an arbitrary character, but not a newline unless it is a single-line match (see m//s).

(...) groups a series of pattern elements to a single element. matches the beginning of the target. In multi-line mode (see m//m)also matches after every newline character.

$ matches the end of the line. In multi-line mode also matches before every newline character.

[...] denotes a class of characters to match. [...] negates the class.

(...|...|...) matches one of the alternatives.

(?# TEXT ) Comment.

(?: REGEXP ) Like (REGEXP) but does not make back-references.

(?= REGEXP ) Zero width positive look-ahead assertion.

(?! REGEXP ) Zero width negative look-ahead assertion.

(? MODIFIER ) Embedded pattern-match modifier. MODIFIER can be one or more of i, m, s or x. Quantified subpatterns match as many times as possible. When followed with a ‘?’ they match the minimum number of times. These are the quantifiers:

+ matches the preceding pattern element one or more times.

? matches zero or one times.

* matches zero or more times.

{N,M} denotes the minimum N and maximum M match count. {N} means exactly N times; {N,} means at least N times.

A ‘\’ escapes any special meaning of the following character if non-alphanumeric, but it turns most alphanumeric characters into something special:

\w matches alphanumeric, including ‘_’, \W matches non-alphanumeric.

\s matches whitespace, \S matches non-whitespace.

\d matches numeric, \D matches non-numeric.

\A matches the beginning of the string, \Z matches the end.

\b matches word boundaries, \B matches non-boundaries.

\G matches where the previous m//g search left off.

\n, \r, \f, \t etc. have their usual meaning.

\w, \s and \d may be used within character classes, \b denotes backspace in this context.

Back-references:

\1...\9 refer to matched sub-expressions, grouped with (), inside the match. \10 and up can also be used if the pattern matches that many sub-expressions. See also $1...$9, $+, $&, $‘ and $’ in section ‘Special variables’. With modifier x, whitespace can be used in the patterns for readability purposes.

Search & Replace

[ EXPR =˜ ][m]/PATTERN/ [ g ][i][m][o][s][x]

Searches EXPR (default: $_) for a pattern. If you prepend an m you can use almost any pair of delimiters instead of the slashes. If used in array context, an array is returned consisting of the sub-expressions matched by the parentheses in pattern, i.e. ($1,$2,$3,...).

Optional modifiers: g matches as many times as possible; i searches in a case-insensitive manner; o interpolates variables only once. m treats the string as multiple lines; s treats the string as a single line; x allows for regular expression extensions.

If PATTERN is empty, the most recent pattern from a previous match or replacement is used. With g the match can be used as an iterator in scalar context.

?PATTERN?

This is just like the /PATTERN/ search, except that it matches only once between calls to the reset operator.

[ $VAR =˜ ] s/PATTERN/REPLACEMENT/ [ e ][g][i][m][o][s][x]

Searches a string for a pattern, and if found, replaces that pattern with the replacement text. It returns the number of substitutions made, if any, otherwise it returns false.

Optional modifiers: g replaces all occurrences of the pattern; e evaluates the replacement string as a Perl expression; for the other modifiers, see /PATTERN/ matching. Almost any delimiter may replace the slashes; if single quotes are used, no interpolation is done on the strings between the delimiters, otherwise they are interpolated as if inside double quotes. If bracketing delimiters are used, PATTERN and REPLACEMENT may have their own delimiters, e.g. s(foo)[bar].

If PATTERN is empty, the most recent pattern from a previous match or replacement is used.

[ $VAR =˜ ] tr/SEARCHLIST/REPLACEMENTLIST/ [ c ][d][s]

Translates all occurrences of the characters found in the search list with the corresponding character in the replacement list. It returns the number of characters replaced. y may be used instead of tr.

Optional modifiers: c complements the SEARCHLIST; d deletes all characters found in SEARCHLIST that do not have a corresponding character in REPLACEMENTLIST; s squeezes all sequences of characters that are translated into the same target character into one occurrence of this character.

pos SCALAR

Returns the position where the last m//g search left off for SCALAR.May be assigned to.

study [ $VARy ]

Studies the scalar variable $VAR in anticipation of performing many pattern matches on its contents before the variable is next modified.