Thursday, December 27, 2012

Attempting to understand handling regular expressions with php

Regular Expression, commonly known as RegEx is considered to be one of the most complex concepts. However, this is not really true. Unless you have worked with regular expressions before, when you look at a regular expression containing a sequence of special characters like /, $, ^, \, ?, *, etc., in combination with alphanumeric characters, you might think it a mess. RegEx is a kind of language and if you have learnt its symbols and understood their meaning, you would find it as the most useful tool in hand to solve many complex problems related to text searches.
Just consider how you would make a search for files on your computer. You most likely use the ? and * characters to help find the files you're looking for. The ? character matches a single character in a file name, while the * matches zero or more characters. A pattern such as 'file?.txt' would find the following files:
file1.txt
filer.txt
files.txt

Using the * character instead of the ? character expands the number of files found. 'file*.txt' matches all of the following:
file1.txt
file2.txt
file12.txt
filer.txt
filedce.txt
While this method of searching for files can certainly be useful, it is also very limited. The limited ability of the ? and * wildcard characters give you an idea of what regular expressions can do, but regular expressions are much more powerful and flexible.
Let Us Start on RegEx
A regular expression is a pattern of text that consists of ordinary characters (for example, letters a through z) and special characters, known as metacharacters. The pattern describes one or more strings to match when searching a body of text. The regular expression serves as a template for matching a character pattern to the string being searched.
The following table contains the list of some metacharacters and their behavior in the context of regular expressions:
Character Description
\ Marks the next character as either a special character, a literal, a backreference, or an octal escape. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(".
^ Matches the position at the beginning of the input string.
$ Matches the position at the end of the input string.
* Matches the preceding subexpression zero or more times.
+ Matches the preceding subexpression one or more times.
? Matches the preceding subexpression zero or one time.
{n} Matches exactly n times, where n is a nonnegative integer.
{n,} Matches at least n times, n is a nonnegative integer.
{n,m} Matches at least n and at most m times, where m and n are nonnegative integers and n <= m.
? When this character immediately follows any of the other quantifiers (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible, whereas the default greedy pattern matches as much of the searched string as possible.
. Matches any single character except "\n".
x|y Matches either x or y.
[xyz] A character set. Matches any one of the enclosed characters.
[^xyz] A negative character set. Matches any character not enclosed.
[a-z] A range of characters. Matches any character in the specified range.
[^a-z] A negative range characters. Matches any character not in the specified range.
\b Matches a word boundary, that is, the position between a word and a space.
\B Matches a nonword boundary. 'er\B' matches the 'er' in "verb" but not the 'er' in "never".
\d Matches a digit character.
\D Matches a nondigit character.
\f Matches a form-feed character.
\n Matches a newline character.
\r Matches a carriage return character.
\s Matches any whitespace character including space, tab, form-feed, etc.
\S Matches any non-whitespace character.
\t Matches a tab character.
\v Matches a vertical tab character.
\w Matches any word character including underscore.
\W Matches any nonword character.
\un Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©).

RegEx functions in PHP
PHP has functions to work on complex string manipulation using RegEx.  The following are the RegEx functions provided in PHP.

Function Description
ereg This function matches the text pattern in a string using a RegEx pattern.
eregi This function is similar to ereg(), but ignore the case sensitivity.
ereg_replace This function matches the text pattern in a string using a RegEx Pattern and replaces it with the given text.
eregi_replace This is similar to ereg_replace(), but ignores the case sensitivity.
split This function split string into array using RegEx.
Spliti This is similar to Split(), but ignores the case sensitivity.
sql_regcase This function create a RegEx from the given string to make a case insensitive match.

Finding US Zip Code
Now let us see a simple example to match a US 5 digit zip code from a string
<?
$zip_pattern = "[0-9]{5}";
$str = "Mission Viejo, CA 92692";
ereg($zip_pattern,$str,$regs);
echo $regs[0];
?>
This script would output as follows
92692
The above example can also be rewritten using Perl-compatible regular expression syntax with preg_match() function.
<?
$zip_pattern = "/\d{5}/";
$str = "Mission Viejo, CA 92692";
preg_match($zip_pattern,$str,$regs);
echo $regs[0];
?>
Note the change in the RegEx pattern in both examples. preg_match() is considered as  faster alternative for ereg().
RegEx for US Phone Numbers
Now let us try to create a RegEx pattern to match a US telephone number.  US telephone numbers are 10 digit numbers usually written with three parts like xxx xxx xxxx.  These three parts are normally used with – hyphen, () braces, and blank spaces. The most common patterns can be seen as follows:
XXX XXX XXXX
(XXX) XXX XXXX
XXX-XXX-XXXX
(XXX) XXX-XXXX
In some cases, US ISD code would be added in the first, like +1 XXX XXX XXXX.
Let us create a Perl-Compatible RegEx pattern to match the above patterns. First we would need to match the single digit ISD code (let us not restrict it to 1). But this may or may not available in the phone numbers, hence we would write it as follows:
$Phone_Pattern = “/(\d)?/”;
Here \d is equivalent to 0-9 and the succeeding ‘?’ indicates that the digit may appear one time or doesn’t appear at all.
Now what would appear next in the sequence? The possibilities are a blank space or a hyphen. So we would add the pattern “(\s|-)?” with the above RegEx. This pattern indicates that either a blank space or a hyphen may or may not appear. So our RegEx becomes:
$Phone_Pattern = “/(\d)?(\s|-)?/”;
The next sequence would be either XXX or (XXX). To match this sequence, we need to first match the braces with the pattern “(\()?”. As we use braces to enclose the patterns in RegEx, braces are metacharacters and to match these metacharacters explicitly, we need to use the escape character “\” preceding the metacharacters. Hence we use “\(“ in our RegEx pattern.  Now we need to match the three digits and a closing braces. So this can be written as “(\d){3}(\))?”. Now our RegEx is added with these patterns,
$Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?/”;
After the first part XXX, there should be either a blank space or a hyphen. So we add “(\s|-){1}” to the phone pattern.
$Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}/”;
Further construction of RegEx would be much more simpler, as we need to match either XXX-XXXX or XXX XXXX. This could be written as “(\d){3}(\s|-){1}(\d){4}”. Adding this part of pattern to our RegEx,
$Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}(\d){3}(\s|-){1}(\d){4}/”;
Yippee!!! We have created a RegEx to match US phone numbers.

Now we need to use this RegEx to perform some task, so that we can understand the significance of RegEx better. Now let us try to script a code to fetch the phone numbers from Google contact us page. So first we need to fetch the html content from Google’s contact us page.
$str = implode("",file("http://www.google.com/intl/en/contact/index.html"));
Then we need to search for the phone number pattern with the help of our “Just Created” RegEx. If we use the preg_match(), we can fetch only one match. So to get more than one match we would use preg_match_all().
preg_match_all($Phone_Pattern,$str,$phone);
Now putting all these pieces into a single script,
<?
$str = implode("",file("http://www.google.com/intl/en/contact/index.html"));
$Phone_Pattern = "/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}(\d){3}(\s|-){1}(\d){4}/";
preg_match_all($Phone_Pattern,$str,$phone);
for($i=0;$i<count($phone[0]);$i++)
{
echo $phone[0][$i]."<br>";
}
?>
This script will display the following output,
(650) 253-0000
(650) 253-0001
Wrap Up
Hope you had a good session with RegEx and now you would have some understanding on tackling problems related to text pattern findings using RegEx.  To become a specialist in RegEx, you need to continuously practice it and need to identify complex problems and give a try to solve them. Happy Practicing With RegEx.

No comments:

Post a Comment