Regular Expressions in ASP.NET
A Crash Course
Steven A. Smith
March 2004
Applies to:
Microsoft® .NET Framework
Microsoft® ASP.NET
Regular Expression API
Summary: Regular expressions are an extremely useful tool for working with text. Whether you need to validate user input, search for patterns within strings, or reformat text in powerful ways, regular expressions can help. (14 printed pages)
Download the source code for this article.
Contents
Introduction
Brief History of Regular Expressions
Simple Expressions
Quantifiers
Metacharacters
Character Classes
Predefined Set Metacharacters
Sample Expressions
Validation in ASP.NET
Regular Expression API
Free Tools
Advanced Topics
Conclusion
Resources
About the Author
Introduction
Support for regular expressions in the Microsoft®.NET Framework is first-class, and even just within Microsoft® ASP.NET there are controls that rely on the language of regular expressions. This article covers the basics and recommends where to go to learn more.
This article is designed for beginners with little or no experience with regular expressions, but who are familiar with ASP.NET and programming in .NET. I hope it will also make a handy reference/refresher for developers who have used regular expressions before, in conjunction with my regular expression cheat sheet. In this article, I will discuss:
- Brief History of Regular Expressions
- Simple Expressions
- Quantifiers
- Metacharacters
- Character Classes
- Predefined Set Metacharacters
- Sample Expressions In Detail
- Validation in ASP.NET
- Regular Expression API
- Free Tools
- Advanced Topics Overview
- Summary and Additional Resources
If you have questions about this article or regular expressions in general, I invite you to ask them on the regex mailing list at http://aspadvice.com, which as I'm writing this has over 350 subscribers.
Brief History of Regular Expressions
Regular expressions as they exist today were invented in the 1950s. Regular expressions were originally used to describe "regular sets," which were patterns under study by neurophysiologists. Credit for the first regular expressions is given to the mathematician Stephen Kleene. Eventually, Ken Thompson built support for regular expressions into qed and grep, both very popular text utilities. Jeffrey Friedl goes into more depth in his book, Mastering Regular Expressions (2nd edition), which is strongly recommended for those wishing to learn more about the theory and history behind regular expressions.
In the last five decades, regular expressions have slowly made their way from mathematic obscurity to a staple feature of many tools and software packages. While regular expressions were supported by many UNIX tools for decades, it was only in the last decade or so that they have found their way into most Windows developers' toolkits. Using regular expressions in Microsoft® Visual Basic® 6 or Microsoft® VBScript was awkward at best, but with the introduction of the .NET Framework, regular expression support is top-notch and available to all Microsoft developers and all .NET languages.
So just what are regular expressions? Regular expressions are a language that can be used to explicitly describe patterns within strings of text. In addition to simply describing such patterns, regular expression engines can typically be used to iterate through matches, to parse strings into substrings using patterns as delimiters, or to replace or reformat text in an intelligent fashion. They provide a powerful and usually very succinct way to solve many common tasks related to text manipulation.
It is common when discussing regular expressions to analyze them based on text they would or would not match. In this article (and in the System.Text.RegularExpressions classes), we'll refer to three players in the regular expression interaction: the regular expression pattern, the input string, and any matches the pattern makes within the string.
Simple Expressions
The simplest regular expression is one you're already familiar with—the literal string. A particular string can be described, literally, by itself, and thus a regular expression pattern like foo would match the input string foo exactly once. In this case, it would also match the input: The food was quite tasty, which might be not be desired if only a precise match is sought.
Of course, matching exact strings to themselves is a trivial implementation of regular expressions, and doesn't begin to reveal their power. What if instead of foo you wanted to find all words starting with the letter f, or all three letter words? Now you've gone beyond what literal strings can do (within reason)—it's time to learn some more about regular expressions. Below is a sample literal expression and some inputs it would match.
Pattern | Inputs (Matches) |
---|---|
foo | foo, food, foot, "There's evil afoot." |
Quantifiers
Quantifiers provide a simple way to specify within a pattern how many times a particular character or set of characters is allowed to repeat itself. There are three non-explicit quantifiers:
- *, which describes "0 or more occurrences",
- +, which describes "1 or more occurrences", and
- ?, which describes "0 or 1 occurrence".
Quantifiers always refer to the pattern immediately preceding (to the left of) the quantifier, which is normally a single character unless parentheses are used to create a pattern group. Below are some sample patterns and inputs they would match.
Pattern | Inputs (Matches) |
---|---|
fo* | foo, foe, food, fooot, "forget it", funny, puffy |
fo+ | foo, foe, food, foot, "forget it" |
fo? | foo, foe, food, foot, "forget it", funny, puffy |
In addition to specifying that a given pattern may occur exactly 0 or 1 time, the ? character also forces a pattern or subpattern to match the minimal number of characters when it might match several in an input string.
In addition to the non-explicit quantifiers (generally just referred to as quantifiers, but I'm distinguishing them from this next group), there are also explicit quantifiers. Where quantifiers are fairly vague in terms of how many occurrences there may be of a pattern, explicit quantifiers allow an exact number, range, or set of numbers to be specified. Explicit quantifiers are positioned following the pattern they apply to, just like regular quantifiers. Explicit quantifiers use curly braces {} and number values for upper and lower occurrence limits within the braces. For example, x{5} would match exactly five x characters (xxxxx). When only one number is specified, it is used as the upper bound unless it is followed by a comma, such as x{5,}, which would match any number of x characters greater than 4. Below are some sample patterns and inputs they would match.
Pattern | Inputs (Matches) |
---|---|
ab{2}c | abbc, aaabbccc |
ab{,2}c | ac, abc, abbc, aabbcc |
ab{2,3}c | abbc, abbbc, aabbcc, aabbbcc |
Metacharacters
The constructs within regular expressions that have special meaning are referred to as metacharacters. You've already learned about several metacharacters, such as the *, ?, +, and { } characters. Several other characters have special meaning within the language of regular expressions. These include the following: $ ^ . [ ( | ) ] and \.
The . (period or dot) metacharacter is one of the simplest and most used. It matches any single character. This can be useful for specifying that certain patterns can contain any combination of characters, but must fall within certain length ranges by using quantifiers. Also, we have seen that expressions will match any instance of the pattern they describe within a larger string, but what if you only want to match the pattern exactly? This is often the case for validation scenarios, such as ensuring the user entered something that is the proper format for a postal code or telephone number. The ^ metacharacter is used to designate the beginning of a string (or line), and the $ metacharacter is used to designate the end of a string (or line). By adding these characters to the beginning and end of a pattern, you can force it to only match input strings that exactly match the pattern. The ^ metacharacter also has special meaning when used at the start of a character class, designated by hard braces [ ]. These are covered below.
The \ (backslash) metacharacter is used to "escape" characters from their special meaning, as well as to designate instances of predefined set metacharacters. These too are covered below. In order to include a literal version of a metacharacter in a regular expression, it must be "escaped" with a backslash. So for instance if you wanted to match strings that begin with "c:\" you might use this: ^c:\\ Note that we used the ^ metacharacter to indicate that the string must begin with this pattern, and we escaped our literal backslash with a backslash metacharacter.
The | (pipe) metacharacter is used for alternation, essentially to specify 'this OR that' within a pattern. So something like a|b would match anything with an 'a' or a 'b' in it, and would be very similar to the character class [ab].
Finally, the parentheses ( ) are used to group patterns. This can be done to allow a complete pattern to occur multiple times using quantifiers, for readability only, or to allow certain portions of the input to be matched separately, perhaps to allow for reformatting or parsing.
Some examples of metacharacter usage are listed below.
Pattern | Inputs (Matches) |
---|---|
. | a, b, c, 1, 2, 3 |
.* | Abc, 123, any string, even no characters would match |
^c:\\ | c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else |
abc$ | abc, 123abc, any string ending with abc |
(abc){2,3} | abcabc, abcabcabc |
Character Classes
Character classes are a mini-language within regular expressions, defined by the enclosing hard braces [ ]. The simplest character class is simply a list of characters within these braces, such as [aeiou]. When used in an expression, any one of these characters can be used at this position in the pattern (but only one unless quantifiers are used). It's important to note that character classes cannot be used to define words or patterns, only single characters.
To specify any numeric digit, the character class [0123456789] could be used. However, since this would quickly get cumbersome, ranges of characters can be defined within the braces by using the hyphen character, -. The hyphen character has special meaning within character classes, not within regular expressions (thus it doesn't qualify as a regular expression metacharacter, exactly), and it only has special meaning within a character class if it is not the first character. To specify any numeric digit using a hyphen, you would use [0-9]. Similarly for any lowercase letter, you could use [a-z], or for any uppercase letter [A-Z]. The range defined by the hyphen depends on the character set being used, so the order in which the characters occur in the (for example) ASCII or Unicode table determines which characters are included in the range. If you need a hyphen to be included in your range, specify it as the first character. For example, [-.? ] would match any one of those four characters (note the last character is a space). Also note, the regular expression metacharacters are not treated special within character classes, so they do not need escaped. Consider character classes to be a separate language from the rest of the regular expression world, with their own rules and syntax.
You can also match any character except a member of a character class by negating the class using the carat ^ as the first character in the character class. Thus, to match any non-vowel character, you could use a character class of [^aAeEiIoOuU]. Note that if you want to negate a hyphen, it should be the second character in the character class, as in [^-]. Remember that the ^ has a totally different meaning within a character class than it has at the start of a regular expression pattern.
Some examples of character classes in action are listed below.
Pattern | Inputs (Matches) |
---|---|
^b[aeiou]t$ | Bat, bet, bit, bot, but |
^[0-9]{5}$ | 11111, 12345, 99999 |
^c:\\ | c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else |
abc$ | abc, 123abc, any string ending with abc |
(abc){2,3} | abcabc, abcabcabc |
^[^-][0-9]$ | 0, 1, 2, … (will not match -0, -1, -2, etc.) |
In in the next version of the .NET Framework, code-named "Whidbey", a new feature is slated to be added to character classes, called character class subtraction. Basically this would allow one character class to be subtracted from another, which would provide a more readable way to describe some patterns. The specification is available now, at https://www.gotdotnet.com/team/clr/bcl/TechArticles/techarticles/Specs/Regex/CharacterClassSubtraction.doc. The syntax would be something like [a-z-[aeiou]] to match all lowercase consonants.
Predefined Set Metacharacters
There's a great deal that can be done with the tools we've covered so far. However, it is still rather longwinded to use [0-9] for every numeric digit in a pattern, or worse, [0-9a-zA-Z] for any alphanumeric character. To ease the pain of dealing with these common but lengthy patterns, a set of predefined metacharacters was defined. Different implementations of regular expressions define different sets of predefined metacharacters—the ones I describe here are supported by the System.Text.RegularExpressions API in the .NET Framework. The standard syntax for these predefined metacharacters is a backslash \ followed by one or more characters. Most of these are just one character long, making them easy to use and an ideal replacement for lengthy character classes. Two such examples are \d which matches any numeric digit and \w which matches any word character (alphanumeric plus underscore). The exceptions are specific character code matches, which must specify the address of the character they are matching, such as \u000D which would match the Unicode carriage return character. Some of the most common character classes and their metacharacter equivalents are listed below.
Metacharacter | Equivalent Character Class |
---|---|
\a | Matches a bell (alarm); \u0007 |
\b | Matches a word boundary except in a character class, where it matches a backspace character, \u0008 |
\t | Matches a tab; \u0009 |
\r | Matches a carriage return; \u000D |
\w | Matches a vertical tab; \u000B |
\f | Matches a form feed; \u000C |
\n | Matches a new line; \u000A |
\e | Matches an escape; \u001B |
\040 | Matches an ASCII character with a three-digit octal. \040 represents a space (Decimal 32). |
\x20 | Matches an ASCII character using 2-digit hexadecimal. In this case, \x2- represents a space. |
\cC | Matches an ASCII control character, in this case ctrl-C. |
\u0020 | Matches a Unicode character using exactly four hexadecimal digits. In this case \u0020 is a space. |
\* | Any character that does not represent a predefined character class is simply treated as that character. Thus \* is the same as \x2A (a literal *, not the * metacharacter). |
\p{name} | Matches any character in the named character class 'name'. Supported names are Unicode groups and block ranges. For example Ll, Nd, Z, IsGreek, IsBoxDrawing, and Sc (currency). |
\P{name} | Matches text not included in the named character class 'name'. |
\w | Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. |
\W | The negation of \w, this equals the ECMAScript compliant set [^a-zA-Z_0-9] or the Unicode character categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. |
\s | Matches any white-space character. Equivalent to the Unicode character classes [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v] (note leading space). |
\S | Matches any non-white-space character. Equivalent to the Unicode character categories [^\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \S is equivalent to [^ \f\n\r\t\v] (note space after ^). |
\d | Matches any decimal digit. Equivalent to [\p{Nd}] for Unicode and [0-9] for non-Unicode, ECMAScript behavior. |
\D | Matches any non-decimal digit. Equivalent to [\P{Nd}] for Unicode and [^0-9] for non-Unicode, ECMAScript behavior. |
Sample Expressions
Most people learn best by example, so here are a very few sample expressions. For more samples, you should visit the online regular expression library, at http://RegexLib.com.
Pattern | Description |
---|---|
^\d{5}$ | 5 numeric digits, such as a US ZIP code. |
^(\d{5})|(\d{5}-\d{4}$ | 5 numeric digits, or 5 digits-dash-4 digits. This matches a US ZIP or US ZIP+4 format. |
^(\d{5})(-\d{4})?$ | Same as previous, but more efficient. Uses ? to make the -4 digits portion of the pattern optional, rather than requiring two separate patterns to be compared individually (via alternation). |
^[+-]?\d+(\.\d+)?$ | Matches any real number with optional sign. |
^[+-]?\d*\.?\d*$ | Same as above, but also matches empty string. |
^(20|21|22|23|[01]\d)[0-5]\d$ | Matches any 24-hour time value. |
/\*.*\*/ | Matches the contents of a C-style comment /* … */ |
Validation in ASP.NET
ASP.NET provides a suite of validation controls, which make validating inputs on web forms extremely easy compared to the same task using legacy (or classic if you prefer) ASP. One of the more powerful validators is the RegularExpressionValidator which, as you might guess, allows you to validate inputs by providing a regular expression which must match the input. The regular expression pattern is specified by setting the ValidationExpression property of the control. An example validator for a ZIP code field is shown below:
<asp:RegularExpressionValidator runat="server" id="ZipCodeValidator" ControlToValidate="ZipCodeTextBox" ErrorMessage="Invalid ZIP code format; format should be either 12345 or 12345-6789." ValidationExpression="(\d{5}(-\d{4})?" />
A few things to note about the RegularExpressionValidator:
- It will never be activated by an empty string in the control it is validating. Only the RequiredFieldValidator catches empty strings
- You do not need to specify beginning of string and end of string matching characters (^ and $)—they are assumed. If you add them, it won't hurt (or change) anything—it's simply unnecessary.
- As with all validation controls, the validation is done client-side as well as server side. If your regular expression is not ECMAScript compliant, it will fail on the client. To avoid this, either ensure your expression is ECMAScript compliant, or set the control to perform its validation only on the server.
Regular Expression API
Outside of the ASP.NET validation controls, most of the time when you're using regular expressions in .NET, you'll use the classes found in the System.Text.RegularExpressions namespace. In particular, the main classes you'll want to become familiar with are Regex, Match, and MatchCollection.
Incidentally, there is some dispute as to whether the shortened version of regular expression, regex, should be pronounced /reg-eks/ or /rej-eks/. Personally I prefer the latter, but there are experts in both pronunciation camps, so pick whichever sounds better to you.
The Regex class has a rich set of methods and properties, which can be rather daunting if you haven't used it before. A summary of the most frequently used methods is included here:
Method | Description |
---|---|
Escape / Unescape | Escapes metacharacters in a string for use as literals in an expression. |
IsMatch | Returns true if the regex finds a match in the input string. |
Match | Returns a Match object if a match is found in the input string. |
Matches | Returns a MatchCollection object containing any and all matches found in the input string. |
Replace | Replaces matches in the input string with a given replacement string. |
Split | Returns an array of strings by splitting up the input string into array elements separated by regex matches. |
In addition to many methods, there are also a number of options that can be specified, usually in the constructor of the Regex object. These options are part of a bitmask, and thus can be OR'd together (yes, you can have both Multiline and Singleline turned on at the same time).
Option | Description |
---|---|
Compiled | Use this option when you will be doing many match operations in a loop. This saves the step of parsing the expression on each iteration. |
Multiline | Has nothing to do with how many lines are in the input string. Rather, this simply modifies the behavior of ^ and $ so that they match BOL and EOL instead of the beginning and end of the entire input string. |
IgnoreCase | Causes the pattern to ignore case sensitivity when matching the search string. |
IgnorePatternWhitespace | Allows pattern to have as much white space as desired, and also enables the use of in-pattern comments, using the (?# comment #) syntax. |
SingleLine | Has nothing to do with how many lines are in the input string. Rather, will cause the . (period) metacharacter to match any character, instead of any character except \n, which is the default. |
Some common things you may use regular expressions for include validating, matching, and replacing. In many cases, these can be accomplished using static methods of the Regex class, without any need to instantiate the Regex class itself. To perform validation, all you must do is create or find the right expression and apply it to your input string using the IsMatch() method of the Regex class. For example, the following function demonstrates how to use a regular expression to validate a ZIP code:
private void ValidateZipButton_Click(object sender, System.EventArgs e) { String ZipRegex = @"^\d{5}$"; if(Regex.IsMatch(ZipTextBox.Text, ZipRegex)) { ResultLabel.Text = "ZIP is valid!"; } else { ResultLabel.Text = "ZIP is invalid!"; } }
Similarly, the static Replace() method can be used to replace matches with a particular string, as this snippet demonstrates:
String newText = Regex.Replace(inputString, pattern, replacementText);
Finally, you can iterate through a collection of matches in an input string using code like this:
private void MatchButton_Click(object sender, System.EventArgs e) { MatchCollection matches = Regex.Matches(SearchStringTextBox.Text, MatchExpressionTextBox.Text); MatchCountLabel.Text = matches.Count.ToString(); MatchesLabel.Text = ""; foreach(Match match in matches) { MatchesLabel.Text += "Found " + match.ToString() + " at position " + match.Index + ".<br>"; } }
Where you'll typically need to instantiate an instance of the Regex class is when you need to specify anything outside the default behavior. In particular, setting options. For example, to create an instance of Regex that ignores case and pattern white space, and then retrieve the set of matches for that expression, you would use code like the following:
Regex re = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace); MatchCollection mc = re.Matches(inputString);
Complete working versions of these samples are included in the download for this article, as simple ASP.NET pages.
Free Tools
The Regulator (http://royo.is-a-geek.com/iserializable/regulator/) – A regular expression testing tool designed to run client-side, it includes tight integration with RegexLib via web services and provides support for Match, Split, Replace and more. Includes performance analysis and syntax highlighting.
RegexDesigner.NET (http://www.sellsbrothers.com/tools/#regexd) – A powerful visual tool for helping you construct and test regular expressions. Will generate C# and/or VB.NET code and compiled assembliles to help you integrate expressions into your applications.
Regular Expression Workbench (v2.0) (https://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=C712F2DF-B026-4D58-8961-4EE2729D7322) – Eric Gunnerson's tool for creating, testing, and studying regular expressions. Has "Examine-o-matic" feature, allowing you to hover the mouse over a regex to decode its meaning.
Advanced Topics
Two regular expression features that really make me have to think are named groups and lookaround processing. Since you'll only need these on rare occasions, I'll only briefly describe these topics here.
With named groups, you can name individual matching groups and then refer to these groups within the expression programmatically. This can be especially powerful when combined with the Replace method as a way of reformatting an input string by re-arranging the order and placement of the elements within the input string. For example, suppose you were given a date in string format of the form MM/DD/YYYY and you wanted it in the form DD-MM-YYYY. You could use write an expression to capture the first format, iterate through its Matches collection, parse each string, and use string manipulation to build the replacement string. This would require a fair amount of code and a fair amount of processing. Using named groups, you could accomplish the same things like so:
String MDYToDMY(String input) { return Regex.Replace(intput, @"\b(?<month>\d{1,2})/(?<day>\d{1,2}/(?<year>\d{4})\b", "${day}- ${month}-${year}"); }
You can also refer to groups by number as well as by name. In any event such references are collectively referred to as backreferences. Another common use of backreferences is within matching expressions themselves, such as this expression for finding repeated letters: [a-z]\1. This will match 'aa', 'bb', 'cc' and is not the same as [a-z]{2} or [a-z][a-z] which are equivalent and would allow 'ab' or 'ac' or any other two-letter combination. Backreferences allow an expression to remember things about parts of the input string it has already parsed and matched.
"Lookaround processing" refers to positive and negative lookahead and lookbehind capabilities supported by many regular expression engines. Not all regular expression engines support all variations of lookaround processing. These constructs do not consume characters even though they may match them. Some patterns are impossible to describe without lookaround processing, especially ones in which the existence of one part of the pattern depends on the existence of a separate part. The syntax for each flavor of lookaround is described below.
Syntax | Description |
---|---|
(?=…) | Positive Lookahead |
(?!...) | Negative Lookahead |
(?<=…) | Positive Lookbehind |
(?<!...) | Negative Lookbehind |
One example of where lookaround processing is necessary is password validation. Consider a password restriction where the password must be between 4 and 8 characters long, and must contain at least one digit. You could do this by just testing \d for a match and using string operations to test the length, but to do the whole thing in a regular expression requires lookahead. Specifically positive lookahead, as this expression demonstrates: ^(?=.*\d).{4,8}$
Conclusion
Regular expressions provide a very powerful way to describe patterns in text, making them an excellent resource for string validation and manipulation. The .NET Framework provides first-rate support for regular expressions in its System.Text.RegularExpressions namespace and specifically the Regex class found there. Using the API is simple; coming up with the right regular expression is often the tough part. Luckily, regular expressions are highly reusable, and there are many resources online where you can find expressions designed by others or get help with ones you are struggling to create.
Resources
Regular Expression Library http://regexlib.com/
Regular Expression Discussion List http://aspadvice.com/login.aspx?ReturnUrl=%2fSignUp%2flist.aspx%3fl%3d68%26c%3d16&l=68&c=16
Regular Expression Forums http://forums.regexadvice.com/
Regular Expression Web Logs http://blogs.regexadvice.com/
Mastering Regular Expressions (O'Reilly), by Jeffrey Friedl http://regex.info/
.NET Regular Expression Reference https://msdn.microsoft.com/library/en-us/cpref/html/frlrfSystemTextRegularExpressions.asp
Jscript Regular Expression Syntax https://msdn.microsoft.com/library/en-us/script56/html/js56jsgrpregexpsyntax.asp
Regular Expression Info http://www.regular-expressions.info
About the Author
Steven A. Smith, Microsoft ASP.NET MVP, is president and owner of ASPAlliance.com and DevAdvice.com. He is also the owner and head instructor for ASPSmith Ltd, a .NET-focused training company. He has authored two books, the ASP.NET Developer's Cookbook and ASP.NET By Example, as well as articles in MSDN and AspNetPRO magazines. Steve speaks at several conferences each year and is a member of the INETA speaker's bureau. Steve has a Master's degree in Business Administration and a Bachelor of Science degree in Computer Science Engineering.
Steve can be reached at ssmith@aspalliance.com.