Regular Expressions


Perl features extensive regular expression handling ability. Regular expressions, for those who may not know, are a mathematical way for dealing with patterns. Regular expressions are a huge area of knowledge, bordering on an art sometimes. Mastering their usage can take years of practice as well as lots and lots of book study. There are numerous books written on the subject if anyone is more interested, I will only discuss very basic usage here.

Regular expressions are a syntax, implemented in Perl and certain other environments, making it not only possible but easy to do some of the following:

First we will do some examples of string comparisons. Much like other forms of comparisons, we are looking for a result of either true or false. The basic format for performing a pattern match is m/some stuff/. We then compare a known string, usually in a variable, to the m// segment. We are allowed to use the =~ or !~ operators to get the job done. There are numerous little tricks that can be added to this basic idea, here are a couple of elementary examples:

      #!/usr/bin/perl -w

      $string = "Testing This String";

      $search = "Testing";

      print "The string:\t", $string, "\n";
      print "The pattern:\t", $search, "\n";

      # This returns true if string $string contains substring $search, 
      # false otherwise.

      if($string =~ m/$search/) {
        print "Success, '", $search, "' was found.\n";  
      }
      else {
        print "Failure, '", $search, "' was not found.\n"; 
      }


      # If you want only those strings where the $search appears at the 
      # very beginning, you could write the following:

      if($string =~ m/^$search/) {
        print "Success, '", $search, "' was found at the start.\n";  
      }
      else {
        print "Failure, '", $search, "' was not found at the start.\n"; 
      }


      # Similarly, the $ operator indicates "end of string". If you 
      # wanted to find out if the $search was the very last text in 
      # the string, you could write this:

      if($string =~ m/$search$/) {
        print "Success, '", $search, "' was found at the end.\n";  
      }
      else {
        print "Failure, '", $search, "' was not found at the end.\n"; 
      }


      # Now, if you want the comparison to be true only if $string 
      # contains $search and nothing but the sought text, 
      # simply do this:

      if($string =~ m/^$search$/) {
        print "Success, '", $search, "' is the entire string.\n";  
      }
      else {
        print "Failure, '", $search, "' is not the entire string.\n"; 
      }


      #Now what if you want the comparison to be case insensitive? All you do 
      # is add the letter i after the ending delimiter:

      if($string =~ m/^$search$/i) {
        print "Success, '", $search, "' is the entire string, ignoring case.\n";  
      }
      else {
        print "Failure, '", $search, "' is not the entire string, ignoring case.\n"; 
      }
      
The output should be:
 
      The string:     Testing This String
      The pattern:    Testing
      Success, 'Testing' was found.
      Success, 'Testing' was found at the start.
      Failure, 'Testing' was not found at the end.
      Failure, 'Testing' is not the entire string.
      Failure, 'Testing' is not the entire string, ignoring case.      
      

We can now play a little bit, and perform some more interesting patter matching activities. Calling these "wildcards" may actually conflict with the theoretical grammar and syntax of PERL, but in fact is the most intuitive way to think of it, and will not lead to any coding mistakes.

Character Description
. Match any character
\w Match "word" character (alphanumeric plus "_")
\W Match non-word character
\s Match whitespace character
\S Match non-whitespace character
\d Match digit character
\D Match non-digit character
\t Match tab
\n Match newline
\r Match return
\e Match escape
\021 Match octal char (in this case 21 octal)
\xf0 Match hex char (in this case f0 hexidecimal)
\b01 Match bin char (in this case 01 bin)

You can follow any character, wildcard, or series of characters and/or wildcard with a repetition. Here's where you start getting some power:


Character Description
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times

Finally, there are some characters that do not behave in a regular expression. These characters need to have a backslash precede them. These are the metacharacters:


Character How to represent it
\ \\
| \|
( \(
) \)
[ \[
{ \{
^ \^
$ \$
* \*
+ \+
? \?
. \.

All of these are confusing to be seen in a chart, so lets see some in action. This is a little script to verify if a phone number is of a legal format. We will allow only 3 digits, followed by a hyphen, a period, or a space, and then finally the last 4 digits. It is important to note that if you fail any of the conditions, the matching will stop. If you pass all of the conditions, even if there is more to the string, you will still get success. Here it goes:

      #!/usr/bin/perl -w

      @phones = ("123-4567", "12-4567", "123-567", "123.4567", 
                 "123 4567", "123.456", "123-45678");

      foreach $phone (@phones) {

        if($phone =~ m/\d{3}(-|.| )\d{4}/) {
          print $phone, " :\tthis is legal\n"; 
        }
        else {
          print $phone, " :\tthis is not legal\n";
        }

      }
      
The output should be:
 
      123-4567 :      this is legal
      12-4567 :       this is not legal
      123-567 :       this is not legal
      123.4567 :      this is legal
      123 4567 :      this is legal
      123.456 :       this is not legal   
      123-45678:      this is legal  
      

If you need to look through a long list of first & last names for specific critera, lets say people who have a last name that begins with a 'Z' and ends with an 'i' you can write something like this:

      #!/usr/bin/perl -w

      @names = ("Jason Zurawski", "Stefan Robila", "Mark Zlotek", 
                "Andreas Koeller", "Roman Zaritski", "Zeke Jones", 
      	        "Zurawski Jason", "Jason     Zurawski", "RomanZaritski");
	  
      print "Looking for a first name (series of letters), anywhere from 0
      to MAXNUM spaces, and a last name (series of letters) that starts with 
      'Z' and ends in 'i'.\n\n";

      foreach $name (@names) {

        # If I changed \s* into something else like \s{1}
        # or \s+ I would change the ammount of spaces allowed.

        if($name =~ m/^\w+\s*Z\w+i$/) {
          print "YES : \t", $name, "\n"; 
        }
        else {
          print "NO : \t", $name, "\n";
        }

      }
      
The output should be:
 
      Looking for last name that starts with 'Z' and ends in 'i'.      
      
      YES :   Jason Zurawski
      NO :    Stefan Robila
      NO :    Mark Zlotek
      NO :    Andreas Koeller
      YES :   Roman Zaritski
      NO :    Zeke Jones
      NO :    Zurawski Jason
      YES :   Jason     Zurawski
      YES :   RomanZaritski
   
      

Besides doing simple comparisons, we can also do string replacements. This is extremely handy if you need to port code from machine to machine, or change small details in a document like a madden name to a married name. Replacement is accomplished in a similar manner to matching, except we use the s/some stuff/replacement stuff/ construct. Again, we can use any text or variable as well as the modifiers above. Lets see it in action:

      #!/usr/bin/perl -w

      $string = "AT&T Wireless Is The Best Service!  I will say it again, ";
      $string = $string . "AT&T Wireless!";
      $search = "AT&T";
      $replace = "Cingular";

      print "The string:\t\t", $string, "\n";
      print "The pattern:\t\t", $search, "\n";
      print "The replacement:\t", $replace, "\n\n";

      if($string =~ s/$search/$replace/) {
        print "Success, '", $search, "' was found and replaced with '", 
              $replace, "'.\n";  
      }
      else {
        print "Failure, '", $search, "' was not found.\n"; 
      }

      print "The string:\t", $string, "\n\n\n";
      
      
The output should be:
 
      The string:             AT&T Wireless Is The Best Service!  I will say it again, 
       AT&T Wireless!
      The pattern:            AT&T
      The replacement:        Cingular
 
      Success, 'AT&T' was found and replaced with 'Cingular'.
      The string:     Cingular Wireless Is The Best Service!  I will say it again, 
       AT&T Wireless!     
      

A more complex example can be seen below. If we wish to replace all occurrences of a word, we need to add the global (g) switch to the end.

      #!/usr/bin/perl -w

      $string = "We are at war with Eurasia, we have always been at war with Eurasia, ";
      $string = $string . "and we always will be at war with Eurasia.";

      $replace = "Eastasia";

      print "The string:\t", $string, "\n\n";

      $string =~ s/\w*asia/$replace/g;

      print "The string:\t", $string, "\n";
      
The output should be:
 
      The string:     We are at war with Eurasia, we have always been at war 
      with Eurasia, and we always will be at war with Eurasia.
                                                                               
      The string:     We are at war with Eastasia, we have always been at war 
      with Eastasia, and we always will be at war with Eastasia.      
      

Groups and character classes make things much easier, especially when you want to write compact, yet complex items. Groups are enclosed by parenthesis, and can have or (|), or and (&) symbols on the inside. This allows you to specify a range of potential items. Character Classes are enclosed by square brackets ([]) and can contain specific, or ranges of values. The last example is what is called a translation. It can be used to substitute items. Here are some examples of usage:

      #!/usr/bin/perl -w

      $string = "The quick brown fox jumped over the lazy dog's back.";

      if($string =~ m/(A|E|I|O|U|Y|a|e|i|o|u|y)/) {
        print "The string '", $string,"' contains a vowel!\n";
      }
      else {
        print "No vowels in the string '", $string, "'.\n";
      }

      if($string =~ m/[AEIOUY]/i) {
        print "The string '", $string,"' contains a vowel!\n";
      }
      else {
        print "No vowels in the string '", $string, "'.\n";
      }

      $string =~ tr/[a-z]/[A-Z]/;

      print "The new string: \t", $string, "\n\";
      
The output should be:
 
      The string 'The quick brown fox jumped over the lazy dog's back.' contains a vowel!
      The string 'The quick brown fox jumped over the lazy dog's back.' contains a vowel!
      The new string:         THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG'S BACK.      
      

The first assignment we did in this class (parsing through text to find sentences and periods) can be accomplished using this general idea. Here is the program, written in Perl and including some bells and whistles:

      #!/usr/bin/perl -w

      use Fcntl ':flock';

      $fileContents = "";
      $numWords = 0;
      $numSentences = 0;

      if($#ARGV == 0) {	
				# Read and store the file
         $fileContents = readFile($ARGV[0]);

				# Count the number of words
         $numWords = countWords($fileContents); 

				# Count the number of sentences
         $numSentences = countSentences($fileContents);	

				# Print the results
         printResults($fileContents, $numWords, $numSentences);
      }
      else {
				# Error message
         print "Usage: -- ./readFile.pl filename\n";
      }


      sub readFile {
         my $fileContents = "";		# Where to store the file

         my $file = @_;			# Reference Stack

         open(INFILE, $file) or die "The File Was Not Found.";

         flock(INFILE, LOCK_EX);	# Exclusive file lock

         while (<INFILE>) { 		# Read each line
            $fileContents = $fileContents . $_;
         }

         flock(INFILE, LOCK_UN);	# Unlock the file

         close(INFILE);			# Be polite and close the file

         return $fileContents;		# Return the contents to main
      }


      sub countWords {
         my $fileContents = @_;		# Pass by reference from the main
         my @wordsArray;		# Array for the words
         my $numWords = 0;		# Word counter

   				# Split up the file's contents by
				# 'word', but we ignore multiple
				# spaces as well as multiple 
				# returns
				
         @wordsArray = split(/\s+|\n+/, $fileContents);

         foreach(@wordsArray) 	
         {
            $numWords++;		# Count each word
         }

         return $numWords;   		# Return to main
      }


      sub countSentences {
         my $fileContents = @_;		# Pass by reference from the main
         my $numSentences = 0;		# Sentence counter

         while(<>)			# Go to the end
         {
      	    # These are just some rules written using regular 
     	    # expressions.  We want to count periods, but only when
      	    # they are used to end a sentence.  Some periods, such
            # as when used in real number, or in an abreviation
            # are not counted.  Others are counted but only one
      	    # time such as when an ellipse is used.				


      	    # This rule is: a period followed by any letters or 
      	    # numbers is NOT counted
   
            s/\.\w+/0/igm;
   
      
            # This rule is: a period after any abreviation is 
            # NOT counted
					     		     
            s/\b(mr|mrs|ms|dr|etc|prof
                   |esq)\./$1/igm;
   
   
            # This rule is: any number of periods is counted as
            # just ONE period
   
            s/\.+/./gm;
	
            # This is where we do the counting, we utlize our rules
            # and come out with a final number	
				    
            $numSentences += (s/\.(\W|$)/./igm);
         }

         return $numSentences;	# Pass back to main
      }


      sub printResults
      {
         my($fileContents, $numWords, $numSentences) = @_;

         print"\n------------------------------------------\n";
         print"            Entered Text\n";
         print"------------------------------------------\n\n";

         print "$fileContents\n\n";

         print"------------------------------------------\n";
         print"              Results\n";
         print"------------------------------------------\n";

         print "Number of Words:\t $numWords \n";
         print "Number of Sentences:\t $numSentences \n\n";
      }
      
The output should be:
 
      ------------------------------------------
                  Entered Text
      ------------------------------------------
                                                                               
      When I am grown to man's estate
      I shal be very proud and great.
      And tell the other girls and boys
      Not to meddle with my toys.
                                                                               
                                                                               
      ------------------------------------------
                    Results
      ------------------------------------------
      Number of Words:         27
      Number of Sentances:     2
      

A now for an example using some IPC from the previous sections, we can parse the output of a command like ls -l, and output what we want. We want to output all Perl files in a directory, as well as mark the ones that were created in March. Something like this could be applied recursively, and could be used to find all of the Perl files on a hard drive, etc.

      #!/usr/bin/perl -w

      pipe(README, WRITEME);

      $child_pid = fork();

      if($child_pid == 0) {
  
        close(README);
  
        open(STDOUT, ">&WRITEME") or die "Can't redirect stdout";
  
        system("ls -als");
  
        close(STDOUT);
  
        exit;
      }
      else {
        print "All Perl Files:\n\n";
  
        waitpid($child_pid, 0);
  
        close(WRITEME);
  
        @strings = <README>;
  
        foreach $string (@strings) {
  
          if($string =~ m/^\w*.pl$/) {
	    
	    print $string;
	    
	    if($string =~ m/Mar/) {
	      print "\t\tThis file was written in March.\n";
	    }
	  }
        }
  
        print "\n";
      } 
      
The output at the time was:
 
      All Perl Files:
                                                                               
         4 -rwx------    1 jason    jason         220 Feb 26 08:13 grade.pl
         4 -rwx------    1 jason    jason          45 Feb 23 10:03 HelloWorld.pl
         8 -rwxr-xr-x    1 jason    jason        4354 Feb 19 12:06 hw.pl
         4 -rwx------    1 jason    jason         604 Feb 26 10:25 ps.pl
         4 -rwx------    1 jason    jason         226 Feb 25 15:07 readdir.pl
         4 -rwx------    1 jason    jason         252 Feb 25 12:47 read.pl
         4 -rwx------    1 jason    jason         684 Mar  1 09:43 stuff.pl
                      This file was written in March.
         4 -rwx------    1 jason    jason         558 Feb 25 17:47 try.pl
         4 -rwx------    1 jason    jason         141 Feb 25 12:37 var2.pl
         4 -rwx------    1 jason    jason        2908 Feb 25 10:19 var.pl
         4 -rwx------    1 jason    jason         477 Feb 25 14:14 write.pl      
      

Command Line    <Regular Expressions>

Created By: Jason Zurawski
Last Modified: June 11, 2004