A few days ago I posted a blog entry on simple regular expression replacements in VBScript. Let me show you a more complex example. It helps to have a purpose, even for demonstration so my need is to convert an html table to CSV output using regular expressions. We’re going to need the functions I’ve written about before but I’ll post them again so you don’t have to go looking for them.
Function RegExReplace(strString,strPattern,strReplace)
On Error Resume Next
Dim RegEx
Set RegEx = New RegExp ' Create regular expression.
RegEx.IgnoreCase = True ' Make case insensitive.
RegEx.Global=True 'Search the entire String
RegEx.Pattern=strPattern
If RegEx.Test(strString) Then 'Test if match is made
RegExReplace = regEx.Replace(strString, strReplace) ' Make replacement.
Else
'return original string
RegExReplace=strString
End If
End Function
Function RegExMatch(strString,strPattern)
Dim RegEx
RegExMatch=False
Set RegEx = New RegExp
RegEx.IgnoreCase = True
RegEx.Global=True
RegEx.Pattern=strPattern
If RegEx.Test(strString) Then RegExMatch=True
End Function
Function GetMatch(strString,strPattern)
Dim RegEx,arrMatches
Set RegEx = New RegExp
RegEx.IgnoreCase = True
RegEx.Global=True
RegEx.Pattern=strPattern
Set colMatches=RegEx.Execute(strString)
Set GetMatch=colMatches
End Function
Let’s dig in. Here’s the table I want to parse.
DeviceID | Size | FreeSpace | Volumename | SystemName |
---|---|---|---|---|
C: | 80024170496 | 8556191744 | CHAOS | |
F: | 80023715840 | 71938871296 | EDGE_DISKGO | CHAOS |
L: | 80024170496 | 8556191744 | CHAOS | |
Q: | 250057060352 | 136200368128 | New Volume | CHAOS |
I’ll begin by reading in the contents of the html file and saving it to a variable.
Set objfso=CreateObject("Scripting.FileSystemObject")
Set objFile=objfso.OpenTextFile("c:\test\drives.html")
Do While objFile.AtEndOfStream <> True
html=objFile.ReadAll()
Loop
objFile.Close
The next step is to strip out just the table. Using a regular expression, I find the text that matches everything between and including the table tags.
'get just the table
If RegExMatch(html,"<table[^>]*>([\S\s]*?)</table>") Then
Set matches=GetMatch(html,"<table[^>]*>([\S\s]*?)</table>")
For Each match In matches
tableText=Trim(match.value)
Next
End If
All that’s left at this point is to get rid of unnecessary tags like TH and convert TH and/or TD tags. If the pattern is matched in the string, then for every match I call my RegexReplace function.
'strip off <tr> tags
tableText = RegexReplace(tableText,"</?tr[^>]*>","")
'convert <th></th> to ","
tableText = RegexReplace(tableText,"</th><th>",CHR(34) & "," & CHR(34))
'convert <th> or </th> to "
tableText = RegexReplace(tableText,"<th>|</th>",CHR(34))
'convert </td><td> to ","
tableText = RegexReplace(tableText,"</td><td>",CHR(34) & "," & CHR(34))
'convert <td> or </td> to "
tableText = RegexReplace(tableText,"<td>|</td>",CHR(34))
It’s possible there might be some tags still in my tableText variable so I’ll process it one more time looking for any HTML tag and replace it with a blank (“”).
'strip off any remaining tags
tableText = RegexReplace(tableText,"<(?![!/]?[ABIU][>\s])[^>]*>","")
Now the tricky part. If I look at tableText there will be blank lines for any tags I replaced at the end. Plus if I wanted to save the output to a text file I need some way to parse this variable. My solution was to turn it into an array and enumerate it, only displaying lines with a length greater than 0.
'turn remaining text into an array
arrText=Split(tabletext,VbCrLf)
'strip out blank lines
For i=0 To UBound(arrText) -1
if Len(arrText(i)) >0 Then
'or send output to a text file
WScript.Echo arrText(i)
End if
Next
When I run my script I get output like this:
“DeviceID”,”Size”,”FreeSpace”,”Volumename”,”SystemName”
“C:”,”80024170496″,”8556191744″,””,”CHAOS”
“F:”,”80023715840″,”71938871296″,”EDGE_DISKGO”,”CHAOS”
“L:”,”80024170496″,”8556191744″,””,”CHAOS”
“Q:”,”250057060352″,”136200368128″,”New Volume”,”CHAOS”
Now before you think I’m some Regex guru (not by any means), I didn’t come up with any of the more complex regular expression patterns. Instead I went to my favorite site for this sort of thing, RegexLib.com. Fortunately many people have already done the hard work of developing regular expression patterns for all sorts of things. A little search and copy/paste and I’m in business. Because regular expressions work the same just about everywhere you can use these expressions in VBScript, PowerShell, PHP, Perl or probably anything you happen to be working in.
Download a text file with code from this entry here.
As always, if you need help with regular expression scripts or any other scripting problem please join me in the forums at ScriptingAnswers.com. Oh…don’t forget there is an entire chapter on using the REGEX object in VBScript in WSH and VBScript Core: TFM.
Excellent Jeff – I hope you keep going and dnon’t give up on explaing this. We scripters should become proficient in reglar expressions as they can save a tremendous amount of coding time and can do some things that are not erally possible with linear code.
There is a small bug in your code. It will only work on teh versoin of teh HTML you tested with for a couple of reasons. First Regex is not set up for multiline so the line terminators will break the match logic if table elements cross line boundaries. My first attemp to run this showed that it was missing the match because it was spread across two lines in my copy of the table. In many cases teh td and th pairs were separated by one or more spaces. In HTML it is permissible for spaces and line formatters to appear anywhere and in any number without breaking teh HTML. This is tyhe sameas the “C”, “C#” and “C++” specifications which allows us to make teh code look any way that suits our needs.
is teh same as or or
All line format characters are ignored by the HTML parsers but may not be ignored by RegEx match logic.
Frm years of parsing HTML and C code for various reasons I have learned that we need to normalize the line formatters. The easiest way to do this is to strip all of them. The is RegEx code in RegexLib.com that with do this in one transform but I have done it in two to show what is happpening.
First remove all line feeds and carriage returns. This wiil prevent future mutliline issues although you could also turn on multiline mode (?m) (Regex.Global=True) You have this enabled. This will also require a rethink of your match logic in some cases as matches will now cross line boundaries. Your code alerady accounts for MultiLine mode but I prefer stripping newlines anyway and you will see why shortly.
REmove all “tab” characters and space characters except space characters in the middle of words that occur in the text area of the tags.
Example My Empty Value needs to be My Empty Value There are many Regexs to do this in various ways depending on application.
Replace all elements with vbCrLf which will always work correctly if you do the above first.
After this the remianing conversion steps will work correctly most of the time. Remember that this will only work for simple tables. Tables with style or other formatting will have to be stripped further first. There are numerous match scripts that will remove all attributes or convert them in some way as needed. The above will help with ensuring that they will work correctly as many fail to factor in teh line formatters issues.
Here is my adjusted version of your code which will work with a few more variations but still not all. Notice that only one output statement is needed as all formatting necessary is already in the text.
‘ remove all line enders to prevent cross line match failures
tableText = RegexReplace(tableText,vbCrLf,””)
‘ remove all blocks of spaces to single space.
tableText = RegexReplace(tableText,” “,””)
‘strip off tags
tableText = RegexReplace(tableText,””,””)
‘ convert to vbCrLf newline chars
tableText = RegexReplace(tableText,””,vbCrLf)
‘convert to “,”
tableText = RegexReplace(tableText,””,CHR(34) & “,” & CHR(34))
‘convert or to ”
tableText = RegexReplace(tableText,”|”,CHR(34))
‘convert to “,”
tableText = RegexReplace(tableText,””,CHR(34) & “,” & CHR(34))
‘convert or to ”
tableText = RegexReplace(tableText,”|”,CHR(34))
‘strip off any remaining tags
tableText = RegexReplace(tableText,”\s])[^>]*>”,””)
‘ text should now dump as properly formatted CSV.
WScript.Echo tableText
Have you made any headway with the use of callbacks to do group replacement?
Jeff
In case you haven’t already seen this take a look here: the dot,/a>
It is the best explanation for some behaviors that I have found so far.
I’m not surprised there are problems. I should have been clearer that my example wasn’t intended as a ready to roll solution. It worked for me with the particular HTML file I was using. Thanks for the clarifications and suggestions.