VBScript Complex Regex Replace

A few days ago I posted a  blog entry on simple regular expression replacements in VBScript. Let me show you a more complex example. It helps to have a purpose, even for demonstration so my need is to convert an html table to CSV output using regular expressions. We’re going to need the functions I’ve written about before but I’ll post them again so you don’t have to go looking for them.

Function RegExReplace(strString,strPattern,strReplace)
On Error Resume Next
    Dim RegEx
    Set RegEx = New RegExp              ' Create regular expression.
    RegEx.IgnoreCase = True             ' Make case insensitive.
    RegEx.Global=True                   'Search the entire String
    RegEx.Pattern=strPattern
        
    If RegEx.Test(strString) Then       'Test if match is made
        RegExReplace = regEx.Replace(strString, strReplace) ' Make replacement.
     Else
         'return original string   
         RegExReplace=strString
    End If
End Function
 
Function RegExMatch(strString,strPattern)
    Dim RegEx
    RegExMatch=False
    
    Set RegEx = New RegExp              
    RegEx.IgnoreCase = True             
    RegEx.Global=True                   
    RegEx.Pattern=strPattern
    
    If RegEx.Test(strString) Then RegExMatch=True
 
End Function 
 
Function GetMatch(strString,strPattern)
    Dim RegEx,arrMatches
    Set RegEx = New RegExp              
    RegEx.IgnoreCase = True             
    RegEx.Global=True                   
    RegEx.Pattern=strPattern
    Set colMatches=RegEx.Execute(strString)
    Set GetMatch=colMatches
End Function

Let’s dig in. Here’s the table I want to parse.

DeviceID Size FreeSpace Volumename SystemName
C: 80024170496 8556191744   CHAOS
F: 80023715840 71938871296 EDGE_DISKGO CHAOS
L: 80024170496 8556191744   CHAOS
Q: 250057060352 136200368128 New Volume CHAOS

I’ll begin by reading in the contents of the html file and saving it to a variable.

Set objfso=CreateObject("Scripting.FileSystemObject")
Set objFile=objfso.OpenTextFile("c:\test\drives.html")
 
Do While objFile.AtEndOfStream <> True
 html=objFile.ReadAll()
Loop
 
objFile.Close

The next step is to strip out just the table. Using a regular expression, I find the text that matches everything between and including the table tags.

'get just the table
If RegExMatch(html,"<table[^>]*>([\S\s]*?)</table>") Then
    Set matches=GetMatch(html,"<table[^>]*>([\S\s]*?)</table>")
    For Each match In matches
        tableText=Trim(match.value)
    Next
End If

All that’s left at this point is to get rid of unnecessary tags like TH and convert TH and/or TD tags. If the pattern is matched in the string, then for every match I call my RegexReplace function.

'strip off <tr> tags
tableText = RegexReplace(tableText,"</?tr[^>]*>","")
'convert <th></th> to ","
tableText = RegexReplace(tableText,"</th><th>",CHR(34) & "," & CHR(34))
'convert <th> or </th> to "
tableText = RegexReplace(tableText,"<th>|</th>",CHR(34))
'convert </td><td> to ","
tableText = RegexReplace(tableText,"</td><td>",CHR(34) & "," & CHR(34))
'convert <td> or </td> to "
tableText = RegexReplace(tableText,"<td>|</td>",CHR(34))

It’s possible there might be some tags still in my tableText variable so I’ll process it one more time looking for any HTML tag and replace it with a blank (“”).

'strip off any remaining tags
tableText = RegexReplace(tableText,"<(?![!/]?[ABIU][>\s])[^>]*>","")

Now the tricky part. If I look at tableText there will be blank lines for any tags I replaced at the end. Plus if I wanted to save the output to a text file I need some way to parse this variable. My solution was to turn it into an array and enumerate it, only displaying lines with a length greater than 0.

'turn remaining text into an array
arrText=Split(tabletext,VbCrLf)
 
'strip out blank lines
For i=0 To UBound(arrText) -1
    if Len(arrText(i)) >0 Then
        'or send output to a text file
        WScript.Echo arrText(i)
    End if
Next    

When I run my script I get output like this:

“DeviceID”,”Size”,”FreeSpace”,”Volumename”,”SystemName”
“C:”,”80024170496″,”8556191744″,””,”CHAOS”
“F:”,”80023715840″,”71938871296″,”EDGE_DISKGO”,”CHAOS”
“L:”,”80024170496″,”8556191744″,””,”CHAOS”
“Q:”,”250057060352″,”136200368128″,”New Volume”,”CHAOS”

Now before you think I’m some Regex guru (not by any means), I didn’t come up with any of the more complex regular expression patterns. Instead I went to my favorite site for this sort of thing, RegexLib.com. Fortunately many people have already done the hard work of developing regular expression patterns for all sorts of things.  A little search and copy/paste and I’m in business. Because regular expressions work the same just about everywhere you can use these expressions in VBScript, PowerShell, PHP, Perl or probably anything you happen to be working in.

Download a text file with code from this entry here.

As always, if you need help with regular expression scripts or any other scripting problem please join me in the forums at ScriptingAnswers.com.  Oh…don’t forget there is an entire chapter on using the REGEX object in VBScript in WSH and VBScript Core: TFM.