Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist, Rapid Learning: Secret Weapon of the Successful Technologist, and Samba Unleashed. |
Regular expressions is a HUGE area of knowledge, bordering on an art. Rather than regurgitate the contents of the PERL documentation or the plethora of PERL books at your local bookstore, this page will attempt to give you the 10% of regular expressions you'll use 90% of the time. Note that for this reason we assume all strings to be single-line strings containing no newline chars.
$string =~ m/sought_text/;The above returns true if string $string contains substring "sought_text", false otherwise. If you want only those strings where the sought text appears at the very beginning, you could write the following:
$string =~ m/^sought_text/;Similarly, the $ operator indicates "end of string". If you wanted to find out if the sought text was the very last text in the string, you could write this:
$string =~ m/sought_text$/;Now, if you want the comparison to be true only if $string contains the sought text and nothing but the sought text, simply do this:
$string =~ m/^sought_text$/;Now what if you want the comparison to be case insensitive? All you do is add the letter i after the ending delimiter:
$string =~ m/^sought_text$/i;
. Match any characterYou can follow any character, wildcard, or series of characters and/or wildcard with a repetiton. Here's where you start getting some power:
/w Match "word" character (alphanumeric plus "_")
/W Match non-word character
/s Match whitespace character
/S Match non-whitespace character
/d Match digit character
/D Match non-digit character
/t Match tab
/n Match newline
/r Match return
/f Match formfeed
/a Match alarm (bell, beep, etc)
/e Match escape
/021 Match octal char ( in this case 21 octal)
/xf0 Match hex char ( in this case f0 hexidecimal)
* Match 0 or more timesNow for some examples:
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
$string =~ m//s*rem/i; #true if the first printable text is rem or REM
$string =~ m/^/S{1,8}/./S{0,3}/; # check for DOS 8.3 filename
# (note a few illegals can sneak thru)
Powerful regular expressions can be made with groups At its simplest, you can match either all lowercase or name case like this:
if($string =~ m/(B|b)ill (C|c)linton/)Detect all strings containing vowels
{print "It is Clinton, all right!/n"}
if($string =~ m/(A|E|I|O|U|Y|a|e|i|o|u|y)/)Detect if the line starts with any of the last three presidents:
{print "String contains a vowel!/n"}
if($string =~ m/^(Clinton|Bush|Reagan)/i)Note that the parenthesized element will appear as $1 statements that follow the regular expression. That's OK. If you don't want to use $1, just ignore it. The use of $1, etc, will be explained in the section on Doing String Selections.
{print "$string/n"};
Character classes have three main advantages:
if($string =~ /[Clinton|Bush|Reagan]/){$office = "President"}The above may even appear to work upon casual testing. Don't do it. Remember that everything inside the brackets represents ONE character, simply listing all it's alternative possibilities.
An uparrow (^) at immediately following the opening square bracket means "Anything but these characters", and effectively negates the character class. For instance, to match anything that is not a vowel, do this:
if($string =~ /[^AEIOUYaeiouy]/){print "This string contains a non-vowel"}Contrast to this:
if($string !~ /[AEIOUYaeiouy]/){print "This string contains no vowels at all"}
if($string =~ m/^[A-E]/)If character classes are giving you quirky results, consider using groups!
{print "$string/n"}
if($string =~ m/^/S+/s+(Clinton|Bush|Reagan)/i)Print every line with a valid phone number.
{print "$string/n"};
if($string =~ m/[/)/s/-]/d{3}-/d{4}[/s/./,/?]/)
{print "Phone line: $string/n"};
For instance, create a program whose input is a piped in directory command and whose output is stdout, and whose output represents a batch file which copies every file (not directory) older than 12/22/97 to a directory called /oldie. This would be pretty nasty in C or C++. The directory output would look something like this:
Volume in drive D has no labelUUUUgly! I'd hate to do this in C or C++. But wait. It's 18 lines in Perl?
Volume Serial Number is 4547-15E0
Directory of D:/polo/marco
.12-18-97 11:14a .
..12-18-97 11:14a ..
INDEX HTM 3,237 02-06-98 3:12p index.htm
APPDEV HTM 6,388 12-24-97 5:13p appdev.htm
NORM HTM 5,297 12-24-97 5:13p norm.htm
IMAGES12-18-97 11:14a images
TCBK GIF 532 06-02-97 3:14p tcbk.gif
LSQL HTM 5,027 12-24-97 5:13p lsql.htm
CRASHPRF HTM 11,403 12-24-97 5:13p crashprf.htm
WS_FTP LOG 5,416 12-24-97 5:24p WS_FTP.LOG
FIBB HTM 10,234 12-24-97 5:13p fibb.htm
MEMLEAK HTM 19,736 12-24-97 5:13p memleak.htm
LITTPERL02-06-98 1:58p littperl
9 file(s) 67,270 bytes
4 dir(s) 132,464,640 bytes free
while(Not bad for 18 lines of code. It could have been shorter, but I wanted to keep it readable. In the snippet above, $1, $2, $3 and $4 are the scalers inside the first, second, third and fourth parenthesis sets. The first three are re-assembled into a yymmdd date string which can be compared with the constant "971222". The fourth holds the filename which will be copied to the /oldie directory if it's not a directory, it's a line with a date, and the date is before 971222. This is the true power of regular expressions and PERL.)
{
my($line) = $_;
chomp($line);
if($line !~ //) #directories don't count
{
#** only lines with dates at position 28 and (long) filename at pos 44 **
if ($line =~ /.{28}(/d/d)-(/d/d)-(/d/d).{8}(.+)$/)
{
my($filename) = $4;
my($yymmdd) = "$3$1$2";
if($yymmdd lt "971222")
{
print "copy $filename //oldie/n";
}
}
}
}
Now count the bytes in the directory:
my($totalBytes) = 0;Note the group within a group, where the inner one is used for character alternation, and the outer is used as a selection.
while()
{
my($line) = $_;
chomp($line);
if($line !~ //) #directories don't count
{
#*** only lines with dates at position 28 ****
if ($line =~ /.{12}((/d| |,){14}) /d/d-/d/d-/d/d/)
{
my($bytes) = $1;
$bytes =~ s/,//; #substitute nothing for comma -- delete commas
$totalBytes += $bytes;
}
}
}
print "$totalBytes bytes in directory./n";
$string =~ s/Bill Clinton/Al Gore/;Now do it ignoring the case of bIlL ClInToN.
$string =~ s/Bill Clinton/Al Gore/i;
$string =~ tr/[a,e,i,o,u,y]/[A,E,I,O,U,Y]/;Change everything to upper case:
$string =~ tr/[a-z]/[A-Z]/;Change everything to lower case
$string =~ tr/[A-Z]/[a-z]/;Change all vowels to numbers to avoid "4 letter words" in a serial number.
$string =~ tr/[A,E,I,O,U,Y]/[1,2,3,4,5]/;
my($text) = "mississippi";Run the preceding code, and here's what you get:
$text =~ m/(i.*s)/;
print $1 . "/n";
ississIt matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:
my($text) = "mississippi";Now look what the code produces:
$text =~ m/(i.*?s)/;
print $1 . "/n";
isClearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible. Read on...
This is MUCH trickier than it might seem. It's likely that all your ideas about greedy matching, replacement strings and the like won't work. Here's the regular expression to resolve A SINGLE double dot:
$text =~ s///[^//]*///././/;In English, this says "find a slash, followed by any number of nonslashes, followed by a slash, followed by two dots, and replace them with nothing. This technique will resolve doubledots in a string as long as that string has only one doubledot. But the plot thickens...
Doubledots can occur alternatively with directories (/a/b/../c/../d) or nested (/a/b/c/../../d). The best way I've found to reliably resolve all doubledots is to make a function that loops through the preceding regular expression until there are no more doubledots. Here's the function:
sub deleteDoubleDots($)The preceding function will resolve all doubledots, be they alternating or nested, or combinations thereof.
{
while($_[0] =~ m//././)
{
$_[0] =~ s///[^//]*///././/;
}
}
my($text) = "/etc/sysconfig/network-scripts/ifcfg-eth0";Is that cool or what?
my($directory, $filename) = $text =~ m/(.*//)(.*)$/;
print "D=$directory, F=$filename/n";
#!/usr/bin/perl -w |
[slitt@mydesk slitt]$ ./junk.pl |
$string =~ m/Bill Clinton/; #return true if var $string contains the name of the president
$string =~ tr/Bill Clinton/Al Gore/; #replace the president with the vice president
$string =~ m/Bill Clinton/; #return true if var $string contains the name of the president
$string =~ tr/Bill Clinton/Al Gore/; #replace the president with the vice president
$string =~ m/Bill Clinton/; #return true if var $string contains the name of the president
$string =~ /Bill Clinton/; #same result as previous statement
$string =~ m/^Bill Clinton/; #true only when "Bill Clinton" is the first text in the string
$string =~ m/Bill Clinton$/; #true only when "Bill Clinton" is the last text in the string
$string =~ m/Bill Clinton/i; #true when $string contains "Bill Clinton" or BilL ClInToN"