Managing String Manipulation with Regular Expressions
Posted in Technology by UpperStrata on December 23, 2009
Searching for particular strings in data is a task that many developers have done. The standard String methods within the Microsoft .NET framework are sufficient, but tedious to implement in certain situations. This problem is magnified particularly in projects that require a great deal of string manipulation and processing. One simple example includes:
// example of finding values in between patterns
string origStr = "<Voyage ID=\"REG080910\"><Sync Code=\"P\"/><Sync Code=\"E\"/></Voyage><Voyage ID=\"INS071205\" />";
string tempStr = origStr;
string foundStr = "";
int cntStart = 0;
int cntLength = 0;
do
{
cntStart = tempStr.IndexOf("ID=\"");
if (cntStart < 0) break;
cntLength = tempStr.IndexOf("\"", cntStart);
if (cntLength < 0) break;
foundStr = tempStr.Substring(cntStart + 4, cntLength - cntStart + 6); // offset to get beginning & end of pattern we want.
tempStr = tempStr.Substring(cntLength + foundStr.Length + 3); // offset to get to new beginning of string
} while (!String.IsNullOrEmpty(tempStr));
This process is often repeated ad nauseam to retrieve particular strings. For projects and tasks that require a great deal of string manipulation and processing, the amount of this type of code becomes unmanageable and unmaintainable. A more elegant solution is required.
Enter Regular Expressions
Since patterns are used to detect and manage the string manipulation, the use of regular expressions is an excellent answer for the foundation of our solution. If you are unfamiliar with regular expressions, a great application is available to help you learn.
The regular expression “Named Groups” concept is a very useful tool in matching particular patterns within strings. The “Named Groups” syntax is as follows:
(?<groupName>Pattern)
An implementation example follows below:
Regex pattern = new Regex("<Voyage ID=\"(?<voyage>[a-z0-9]*)\"|<PortCode>(?<port>[a-z0-9]*)", RegexOptions.IgnoreCase);
string sampleStr = "<Voyage ID=\"REG080910\"><Sync Code=\"P\"/><Sync Code=\"E\"/></Voyage><Voyage ID=\"INS071205\" />";
MatchCollection matches = pattern.Matches(sampleStr);
foreach (Match m in matches)
{
// m.Groups["groupName"].Value retrieves the matched pattern value
if (m.Groups["voyage"].Success)
// do something
if (m.Groups["port"].Success)
// do something
}
The above example regular expression has two named groups: “voyage” and “port”. In common language, the regular expression is simply looking for any alphanumeric pattern of:
<Voyage ID=”some alphanumeric”
and
<PortCode>some alphanumeric
The MatchCollection found via the regular expression can be enumerated and processed by checking the Success or Failure of your defined named groups. As you can see, this approach is very clean and easy to read. It definitely beats determining lengths, calculating positions, and writing extraneous looping code to obtain the required data.
Bonus
Regular Expressions are common to many programming languages and this methodology can be easily ported. Be careful, however, .NET handles the syntax for “Named Groups” slightly different compared to other programming languages. This can be easily remedied by changing a few characters and should not deter a developer from trying out this feature. You can read more about this here: http://www.regular-expressions.info/named.html