VBScript Complex Regex Replace

A few days ago I posted a  blog entry on simple regular expression replacements in VBScript. Let me show you a more complex example. It helps to have a purpose, even for demonstration so my need is to convert an html table to CSV output using regular expressions. We’re going to need the functions I’ve written about before but I’ll post them again so you don’t have to go looking for them.

Function RegExReplace(strString,strPattern,strReplace)
On Error Resume Next
    Dim RegEx
    Set RegEx = New RegExp              ' Create regular expression.
    RegEx.IgnoreCase = True             ' Make case insensitive.
    RegEx.Global=True                   'Search the entire String
    If RegEx.Test(strString) Then       'Test if match is made
        RegExReplace = regEx.Replace(strString, strReplace) ' Make replacement.
         'return original string   
    End If
End Function
Function RegExMatch(strString,strPattern)
    Dim RegEx
    Set RegEx = New RegExp              
    RegEx.IgnoreCase = True             
    If RegEx.Test(strString) Then RegExMatch=True
End Function 
Function GetMatch(strString,strPattern)
    Dim RegEx,arrMatches
    Set RegEx = New RegExp              
    RegEx.IgnoreCase = True             
    Set colMatches=RegEx.Execute(strString)
    Set GetMatch=colMatches
End Function

Let’s dig in. Here’s the table I want to parse.

DeviceID Size FreeSpace Volumename SystemName
C: 80024170496 8556191744   CHAOS
F: 80023715840 71938871296 EDGE_DISKGO CHAOS
L: 80024170496 8556191744   CHAOS
Q: 250057060352 136200368128 New Volume CHAOS

I’ll begin by reading in the contents of the html file and saving it to a variable.

Set objfso=CreateObject("Scripting.FileSystemObject")
Set objFile=objfso.OpenTextFile("c:\test\drives.html")
Do While objFile.AtEndOfStream <> True

The next step is to strip out just the table. Using a regular expression, I find the text that matches everything between and including the table tags.

'get just the table
If RegExMatch(html,"<table[^>]*>([\S\s]*?)</table>") Then
    Set matches=GetMatch(html,"<table[^>]*>([\S\s]*?)</table>")
    For Each match In matches
End If

All that’s left at this point is to get rid of unnecessary tags like TH and convert TH and/or TD tags. If the pattern is matched in the string, then for every match I call my RegexReplace function.

'strip off <tr> tags
tableText = RegexReplace(tableText,"</?tr[^>]*>","")
'convert <th></th> to ","
tableText = RegexReplace(tableText,"</th><th>",CHR(34) & "," & CHR(34))
'convert <th> or </th> to "
tableText = RegexReplace(tableText,"<th>|</th>",CHR(34))
'convert </td><td> to ","
tableText = RegexReplace(tableText,"</td><td>",CHR(34) & "," & CHR(34))
'convert <td> or </td> to "
tableText = RegexReplace(tableText,"<td>|</td>",CHR(34))

It’s possible there might be some tags still in my tableText variable so I’ll process it one more time looking for any HTML tag and replace it with a blank (“”).

'strip off any remaining tags
tableText = RegexReplace(tableText,"<(?![!/]?[ABIU][>\s])[^>]*>","")

Now the tricky part. If I look at tableText there will be blank lines for any tags I replaced at the end. Plus if I wanted to save the output to a text file I need some way to parse this variable. My solution was to turn it into an array and enumerate it, only displaying lines with a length greater than 0.

'turn remaining text into an array
'strip out blank lines
For i=0 To UBound(arrText) -1
    if Len(arrText(i)) >0 Then
        'or send output to a text file
        WScript.Echo arrText(i)
    End if

When I run my script I get output like this:

“Q:”,”250057060352″,”136200368128″,”New Volume”,”CHAOS”

Now before you think I’m some Regex guru (not by any means), I didn’t come up with any of the more complex regular expression patterns. Instead I went to my favorite site for this sort of thing, RegexLib.com. Fortunately many people have already done the hard work of developing regular expression patterns for all sorts of things.  A little search and copy/paste and I’m in business. Because regular expressions work the same just about everywhere you can use these expressions in VBScript, PowerShell, PHP, Perl or probably anything you happen to be working in.

Download a text file with code from this entry here.

As always, if you need help with regular expression scripts or any other scripting problem please join me in the forums at ScriptingAnswers.com.  Oh…don’t forget there is an entire chapter on using the REGEX object in VBScript in WSH and VBScript Core: TFM.