Regular Expressions Part 1 – Introduction

In programming you quite often want to test a string for the occurrence of a character or group of characters.  This is particularly true for such things as data validation on form input boxes.  Regular expressions provide a pattern matching facility.

This tutorial gives a brief overview of basic regular expression syntax and then considers the functions that PHP provides for
working with regular expressions.

The Basics
Matching Patterns
Replacing Patterns
Array Processing

PHP supports two different types of regular expressions: POSIX-extended and Perl-Compatible Regular Expressions (PCRE). The PCRE functions are more powerful than the POSIX ones, and faster too, so we will concentrate on them.

The Basics

In a regular expression, most characters match only themselves. For instance, if you search for the regular expression “ball” in the string “John plays football,” you get a match because “ball” occurs in that string. Some characters have special meanings in regular expressions. For instance, a dollar sign ($) is used to match strings that end with the given pattern. Similarly, a caret (^)
character at the beginning of a regular expression indicates that it must match the beginning of the string. The characters that
match themselves are called literals. The characters that have special meanings are called metacharacters.

The dot (.) metacharacter matches any single character except newline (\). So, the pattern h.t matches hat, hothit, hut, h7t, etc. The vertical pipe (|) metacharacter is used for alternatives in a regular expression. It behaves much like a logical OR operator and you should use it if you want to construct a pattern that matches more than one set of characters. For instance, the pattern Utah| Idaho|Nevada matches strings that contain “Utah” or “Idaho” or “Nevada”. Parentheses give us a way to group sequences.

For example :

(Nant|b)ucket matches “Nantucket” or “bucket”. Using parentheses to group together characters for alternation is called grouping.

If you want to match a literal metacharacter in a pattern, you have to escape it with a backslash.

To specify a set of acceptable characters in your pattern, you can either build a character class yourself or use a predefined one.

A character class lets you represent a bunch of characters as a single item in a regular expression. You can build your own character class by enclosing the acceptable characters in square brackets. A character class matches any one of the characters in
the class. For example a character class [abc] matches a, b or c. To define a range of characters, just put the first and last
characters in, separated by hyphen. For example, to match all alphanumeric characters: [a-zA-Z0-9]. You can also create a negated character class, which matches any character that is not in the class. To create a negated character class, begin the character class with ^: [^0-9].

The metacharacters +, *, ?, and {} affect the number of times a pattern should be matched. + means “Match one or more of the
preceding expression”, * means “Match zero or more of the preceding expression”, and ? means “Match zero or one of the preceding expression”. Curly braces {} can be used differently. With a single integer, {n} means “match exactly n occurrences of the preceding expression”, with one integer and a comma, {n,} means “match n or more occurrences of the preceding expression”, and with two comma-separated integers {n,m} means “match the previous character if it occurs at least n times, but no more than m times”.

Now, have a look at the examples:

Regular Expression     Will match…
foo     The string “foo”
^foo     “foo” at the start of a string
foo$     “foo” at the end of a string
^foo$     “foo” when it is alone on a string
[abc]     a, b, or c
[a-z]     Any lowercase letter
[^A-Z]     Any character that is not a uppercase letter
(gif|jpg)     Matches either “gif” or “jpeg”
[a-z]+     One or more lowercase letters
[0-9\.\-]     ?ny number, dot, or minus sign
^[a-zA-Z0-9_]{1,}$     Any word of at least one letter, number or _
([wx])([yz])     wy, wz, xy, or xz
[^A-Za-z0-9]     Any symbol (not a number or a letter)
([A-Z]{3}|[0-9]{4})     Matches three letters or four numbers

Perl-Compatible Regular Expressions emulate the Perl syntax for patterns, which means that each pattern must be enclosed in a pair of delimiters. Usually, the slash (/) character is used. For instance, /pattern/.

The PCRE functions can be divided in several classes: matching, replacing, splitting and filtering.

Leave a Reply