Perl features extensive regular expression handling ability. Regular expressions, for those who may not know, are a mathematical way for dealing with patterns. Regular expressions are a huge area of knowledge, bordering on an art sometimes. Mastering their usage can take years of practice as well as lots and lots of book study. There are numerous books written on the subject if anyone is more interested, I will only discuss very basic usage here.
Regular expressions are a syntax, implemented in Perl and certain other environments, making it not only possible but easy to do some of the following:
First we will do some examples of string comparisons. Much like other forms
of comparisons, we are looking for a result of either true or false. The basic
format for performing a pattern match is m/some stuff/. We
then compare a known string, usually in a variable, to the
m// segment. We are allowed to use the =~
or !~ operators to get the job done. There
are numerous little tricks that can be added to this basic idea, here are a
couple of elementary examples:
#!/usr/bin/perl -w
$string = "Testing This String";
$search = "Testing";
print "The string:\t", $string, "\n";
print "The pattern:\t", $search, "\n";
# This returns true if string $string contains substring $search,
# false otherwise.
if($string =~ m/$search/) {
print "Success, '", $search, "' was found.\n";
}
else {
print "Failure, '", $search, "' was not found.\n";
}
# If you want only those strings where the $search appears at the
# very beginning, you could write the following:
if($string =~ m/^$search/) {
print "Success, '", $search, "' was found at the start.\n";
}
else {
print "Failure, '", $search, "' was not found at the start.\n";
}
# Similarly, the $ operator indicates "end of string". If you
# wanted to find out if the $search was the very last text in
# the string, you could write this:
if($string =~ m/$search$/) {
print "Success, '", $search, "' was found at the end.\n";
}
else {
print "Failure, '", $search, "' was not found at the end.\n";
}
# Now, if you want the comparison to be true only if $string
# contains $search and nothing but the sought text,
# simply do this:
if($string =~ m/^$search$/) {
print "Success, '", $search, "' is the entire string.\n";
}
else {
print "Failure, '", $search, "' is not the entire string.\n";
}
#Now what if you want the comparison to be case insensitive? All you do
# is add the letter i after the ending delimiter:
if($string =~ m/^$search$/i) {
print "Success, '", $search, "' is the entire string, ignoring case.\n";
}
else {
print "Failure, '", $search, "' is not the entire string, ignoring case.\n";
}
The output should be:
The string: Testing This String
The pattern: Testing
Success, 'Testing' was found.
Success, 'Testing' was found at the start.
Failure, 'Testing' was not found at the end.
Failure, 'Testing' is not the entire string.
Failure, 'Testing' is not the entire string, ignoring case.
We can now play a little bit, and perform some more interesting patter matching activities. Calling these "wildcards" may actually conflict with the theoretical grammar and syntax of PERL, but in fact is the most intuitive way to think of it, and will not lead to any coding mistakes.
| Character | Description |
|---|---|
.
|
Match any character |
\w
|
Match "word" character (alphanumeric plus "_") |
\W
|
Match non-word character |
\s
|
Match whitespace character |
\S
|
Match non-whitespace character |
\d
|
Match digit character |
\D
|
Match non-digit character |
\t
|
Match tab |
\n
|
Match newline |
\r
|
Match return |
\e
|
Match escape |
\021
|
Match octal char (in this case 21 octal) |
\xf0
|
Match hex char (in this case f0 hexidecimal) |
\b01
|
Match bin char (in this case 01 bin) |
You can follow any character, wildcard, or series of characters and/or wildcard with a repetition. Here's where you start getting some power:
| Character | Description |
|---|---|
*
|
Match 0 or more times |
+
|
Match 1 or more times |
?
|
Match 1 or 0 times |
{n}
|
Match exactly n times |
{n,}
|
Match at least n times |
{n,m}
|
Match at least n but not more than m times |
Finally, there are some characters that do not behave in a regular expression. These characters need to have a backslash precede them. These are the metacharacters:
| Character | How to represent it |
|---|---|
\
|
\\ |
|
|
\| |
(
|
\( |
)
|
\) |
[
|
\[ |
{
|
\{ |
^
|
\^ |
$
|
\$ |
*
|
\* |
+
|
\+ |
?
|
\? |
.
|
\. |
All of these are confusing to be seen in a chart, so lets see some in action. This is a little script to verify if a phone number is of a legal format. We will allow only 3 digits, followed by a hyphen, a period, or a space, and then finally the last 4 digits. It is important to note that if you fail any of the conditions, the matching will stop. If you pass all of the conditions, even if there is more to the string, you will still get success. Here it goes:
#!/usr/bin/perl -w
@phones = ("123-4567", "12-4567", "123-567", "123.4567",
"123 4567", "123.456", "123-45678");
foreach $phone (@phones) {
if($phone =~ m/\d{3}(-|.| )\d{4}/) {
print $phone, " :\tthis is legal\n";
}
else {
print $phone, " :\tthis is not legal\n";
}
}
The output should be:
123-4567 : this is legal
12-4567 : this is not legal
123-567 : this is not legal
123.4567 : this is legal
123 4567 : this is legal
123.456 : this is not legal
123-45678: this is legal
If you need to look through a long list of first & last names for specific critera, lets say people who have a last name that begins with a 'Z' and ends with an 'i' you can write something like this:
#!/usr/bin/perl -w
@names = ("Jason Zurawski", "Stefan Robila", "Mark Zlotek",
"Andreas Koeller", "Roman Zaritski", "Zeke Jones",
"Zurawski Jason", "Jason Zurawski", "RomanZaritski");
print "Looking for a first name (series of letters), anywhere from 0
to MAXNUM spaces, and a last name (series of letters) that starts with
'Z' and ends in 'i'.\n\n";
foreach $name (@names) {
# If I changed \s* into something else like \s{1}
# or \s+ I would change the ammount of spaces allowed.
if($name =~ m/^\w+\s*Z\w+i$/) {
print "YES : \t", $name, "\n";
}
else {
print "NO : \t", $name, "\n";
}
}
The output should be:
Looking for last name that starts with 'Z' and ends in 'i'.
YES : Jason Zurawski
NO : Stefan Robila
NO : Mark Zlotek
NO : Andreas Koeller
YES : Roman Zaritski
NO : Zeke Jones
NO : Zurawski Jason
YES : Jason Zurawski
YES : RomanZaritski
Besides doing simple comparisons, we can also do string replacements. This is
extremely handy if you need to port code from machine to machine, or change small
details in a document like a madden name to a married name. Replacement is
accomplished in a similar manner to matching, except we use the
s/some stuff/replacement stuff/ construct. Again, we can use
any text or variable as well as the modifiers above. Lets see it in action:
#!/usr/bin/perl -w
$string = "AT&T Wireless Is The Best Service! I will say it again, ";
$string = $string . "AT&T Wireless!";
$search = "AT&T";
$replace = "Cingular";
print "The string:\t\t", $string, "\n";
print "The pattern:\t\t", $search, "\n";
print "The replacement:\t", $replace, "\n\n";
if($string =~ s/$search/$replace/) {
print "Success, '", $search, "' was found and replaced with '",
$replace, "'.\n";
}
else {
print "Failure, '", $search, "' was not found.\n";
}
print "The string:\t", $string, "\n\n\n";
The output should be:
The string: AT&T Wireless Is The Best Service! I will say it again,
AT&T Wireless!
The pattern: AT&T
The replacement: Cingular
Success, 'AT&T' was found and replaced with 'Cingular'.
The string: Cingular Wireless Is The Best Service! I will say it again,
AT&T Wireless!
A more complex example can be seen below. If we wish to replace all occurrences of a word, we need to add the global (g) switch to the end.
#!/usr/bin/perl -w
$string = "We are at war with Eurasia, we have always been at war with Eurasia, ";
$string = $string . "and we always will be at war with Eurasia.";
$replace = "Eastasia";
print "The string:\t", $string, "\n\n";
$string =~ s/\w*asia/$replace/g;
print "The string:\t", $string, "\n";
The output should be:
The string: We are at war with Eurasia, we have always been at war
with Eurasia, and we always will be at war with Eurasia.
The string: We are at war with Eastasia, we have always been at war
with Eastasia, and we always will be at war with Eastasia.
Groups and character classes make things much easier, especially when you want to write compact, yet complex items. Groups are enclosed by parenthesis, and can have or (|), or and (&) symbols on the inside. This allows you to specify a range of potential items. Character Classes are enclosed by square brackets ([]) and can contain specific, or ranges of values. The last example is what is called a translation. It can be used to substitute items. Here are some examples of usage:
#!/usr/bin/perl -w
$string = "The quick brown fox jumped over the lazy dog's back.";
if($string =~ m/(A|E|I|O|U|Y|a|e|i|o|u|y)/) {
print "The string '", $string,"' contains a vowel!\n";
}
else {
print "No vowels in the string '", $string, "'.\n";
}
if($string =~ m/[AEIOUY]/i) {
print "The string '", $string,"' contains a vowel!\n";
}
else {
print "No vowels in the string '", $string, "'.\n";
}
$string =~ tr/[a-z]/[A-Z]/;
print "The new string: \t", $string, "\n\";
The output should be:
The string 'The quick brown fox jumped over the lazy dog's back.' contains a vowel!
The string 'The quick brown fox jumped over the lazy dog's back.' contains a vowel!
The new string: THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG'S BACK.
The first assignment we did in this class (parsing through text to find sentences and periods) can be accomplished using this general idea. Here is the program, written in Perl and including some bells and whistles:
#!/usr/bin/perl -w
use Fcntl ':flock';
$fileContents = "";
$numWords = 0;
$numSentences = 0;
if($#ARGV == 0) {
# Read and store the file
$fileContents = readFile($ARGV[0]);
# Count the number of words
$numWords = countWords($fileContents);
# Count the number of sentences
$numSentences = countSentences($fileContents);
# Print the results
printResults($fileContents, $numWords, $numSentences);
}
else {
# Error message
print "Usage: -- ./readFile.pl filename\n";
}
sub readFile {
my $fileContents = ""; # Where to store the file
my $file = @_; # Reference Stack
open(INFILE, $file) or die "The File Was Not Found.";
flock(INFILE, LOCK_EX); # Exclusive file lock
while (<INFILE>) { # Read each line
$fileContents = $fileContents . $_;
}
flock(INFILE, LOCK_UN); # Unlock the file
close(INFILE); # Be polite and close the file
return $fileContents; # Return the contents to main
}
sub countWords {
my $fileContents = @_; # Pass by reference from the main
my @wordsArray; # Array for the words
my $numWords = 0; # Word counter
# Split up the file's contents by
# 'word', but we ignore multiple
# spaces as well as multiple
# returns
@wordsArray = split(/\s+|\n+/, $fileContents);
foreach(@wordsArray)
{
$numWords++; # Count each word
}
return $numWords; # Return to main
}
sub countSentences {
my $fileContents = @_; # Pass by reference from the main
my $numSentences = 0; # Sentence counter
while(<>) # Go to the end
{
# These are just some rules written using regular
# expressions. We want to count periods, but only when
# they are used to end a sentence. Some periods, such
# as when used in real number, or in an abreviation
# are not counted. Others are counted but only one
# time such as when an ellipse is used.
# This rule is: a period followed by any letters or
# numbers is NOT counted
s/\.\w+/0/igm;
# This rule is: a period after any abreviation is
# NOT counted
s/\b(mr|mrs|ms|dr|etc|prof
|esq)\./$1/igm;
# This rule is: any number of periods is counted as
# just ONE period
s/\.+/./gm;
# This is where we do the counting, we utlize our rules
# and come out with a final number
$numSentences += (s/\.(\W|$)/./igm);
}
return $numSentences; # Pass back to main
}
sub printResults
{
my($fileContents, $numWords, $numSentences) = @_;
print"\n------------------------------------------\n";
print" Entered Text\n";
print"------------------------------------------\n\n";
print "$fileContents\n\n";
print"------------------------------------------\n";
print" Results\n";
print"------------------------------------------\n";
print "Number of Words:\t $numWords \n";
print "Number of Sentences:\t $numSentences \n\n";
}
The output should be:
------------------------------------------
Entered Text
------------------------------------------
When I am grown to man's estate
I shal be very proud and great.
And tell the other girls and boys
Not to meddle with my toys.
------------------------------------------
Results
------------------------------------------
Number of Words: 27
Number of Sentances: 2
A now for an example using some IPC from the previous sections, we can parse
the output of a command like ls -l, and output what we want.
We want to output all Perl files in a directory, as well as mark the ones
that were created in March. Something like this could be applied recursively, and
could be used to find all of the Perl files on a hard drive, etc.
#!/usr/bin/perl -w
pipe(README, WRITEME);
$child_pid = fork();
if($child_pid == 0) {
close(README);
open(STDOUT, ">&WRITEME") or die "Can't redirect stdout";
system("ls -als");
close(STDOUT);
exit;
}
else {
print "All Perl Files:\n\n";
waitpid($child_pid, 0);
close(WRITEME);
@strings = <README>;
foreach $string (@strings) {
if($string =~ m/^\w*.pl$/) {
print $string;
if($string =~ m/Mar/) {
print "\t\tThis file was written in March.\n";
}
}
}
print "\n";
}
The output at the time was:
All Perl Files:
4 -rwx------ 1 jason jason 220 Feb 26 08:13 grade.pl
4 -rwx------ 1 jason jason 45 Feb 23 10:03 HelloWorld.pl
8 -rwxr-xr-x 1 jason jason 4354 Feb 19 12:06 hw.pl
4 -rwx------ 1 jason jason 604 Feb 26 10:25 ps.pl
4 -rwx------ 1 jason jason 226 Feb 25 15:07 readdir.pl
4 -rwx------ 1 jason jason 252 Feb 25 12:47 read.pl
4 -rwx------ 1 jason jason 684 Mar 1 09:43 stuff.pl
This file was written in March.
4 -rwx------ 1 jason jason 558 Feb 25 17:47 try.pl
4 -rwx------ 1 jason jason 141 Feb 25 12:37 var2.pl
4 -rwx------ 1 jason jason 2908 Feb 25 10:19 var.pl
4 -rwx------ 1 jason jason 477 Feb 25 14:14 write.pl