Sanitizing HTML with TextMate using Regular Expressions
Sometimes you might come into a situation, where you have to deal with very large documents and prepare those for publication on your Website.
I came across such a document with ~7300 Lines , or 117.500 words.
( The result of a research, that had cost the Author approximately one and half Years of his time. )
The problem: The document was formated with whole shiploads of both inline and formating table code. Hundreds of inline style attributes empty and/or double paragraphs and other formating data that needed to be removed.
To get this Job done in a reasonable amount of time, the best Tool available is TextMate, which, if you haven’t heard about it yet, is a quite powerful Texteditor.
Now ,for sanitizing HTML, TextMate already has a built-in Command which is called Tidy HTML. You can find it under the Gear Menu under:
HTML > Tidy, or simpler with ⌃ ⇧ H ( A command-line Tool originally written by David Ragget, which you can find in lots of other apps as well)
Some other commands that are useful in sanitizing or converting HTML documents are:
- Strip HTML tags from document ( You might have to manually assign this one a shortcut if not present, from within the Bundle Editor)
- Convert document to Markdown ( Here you first need to create an empty markdown document with .markdown extension, create a scratch Project, then open your markdown document, and then simply drag the file you want to convert from your Project drawer into your empty markdown document.)
However, just running our document through Tidy wouldn’t be sufficient in our case, and what makes the situation more complex, was that the document had 345 footnotes, which also needed to be converted into another format. Besides this I wanted to split the document into two parts, convert the Table of contents and fix the image paths of the 30 image tags that where involved. I also decided, that in order to publish the document with Textpattern, it would be better not to leave the document at Textile’s fate and publish it directly into an XHTML format.
To deal with the situation, we can do a Find and Replace with ⌘ F. Below the two text-input windows, you will see a check Box that is labeled Regular Expression. If you want to try some examples, that are given below, you want to switch this on.
To get this Job done in less then 2 days I used a set of approximately 25 Regular Expressions and also ran the document through Tidy at several stages. Below are a couple of examples of RE that where used:
The procedure is fairly simple. Once you have RE switched on, you put your match Expression in the first input field and the replacement expression in the second input field. If you simply want to remove multiple occurrences entirely, then you just leave the second field empty ( without whitespace)
Note: Switch to HTML document Scope. The Toolbar at the bottom should indicate this Scope for you.
Before you hit the Replace all button in the bottom left Corner, it is a good Idea to use Replace and Find first, or the Search Count Indicator at the top right Corner, just to see, if your RE works as expected.
Remove all inline attributes from paragraph tags
Matching Expression:
<p(\s+[^>]*)>
Here we first match an opening angle bracket, then the literal character p, then one or more space-characters followed by zero or more occurrences of any character that is not a closing bracket, followed by a closing bracket.
Replacement Expression:
(Simple)
<p>
Remove all table formatting data
Matching Expression:
</?(table|tr|td|th)(\s?[^>]*)>
This is a similar example then the previous one, but unlike in the previous example we make the forward slash ( the second character in the Expression optional) so that all the table closing tags are matched as well. We also make use of a conditional match in the first Group (match any of the following tags: table, tr, td, th, by separating items with the Pipe character)
Replacement Expression:
Empty (nothing)
Remove all instances of empty p tags with optional br tags
Matching Expression:
<p>\s*(<br\s?/>)*\s*</p>
Note: The Circumflexes in the above Example should be asterisks, but it seems that Textile doesn’t parse this right.
Replacement Expression:
\n Newline character
(A linebreak)
Replace the Markup for the Footnotes
Now for the more complex example:
The Footnotes are in the following format:
Reference Links:
<span class="style87">(<a href="#342">342</a>)</span>
Anchors:
[342]<a name="342"></a>
which we want to transform into a format that looks like this:
Reference Links:
<p id="fn00r342">
<sup class="footnote">
<a href="#fn342">342</a>
</sup>
</p>
Anchors:
<p id="fn342" class="footnote"><sup>342</sup></p>
How can we actually do the conversion in a matter of seconds? Now, you might have noticed that from the Beginning some parts of the expression where enclosed in rounded brackets. This serves two purposes. The first one is called Grouping, and the second one is to create a Capture Register, that we can later on refer to in our Replacement Expression, so it becomes like a variable. The Index Names for these variables are numbers, that start from one.
If you have multiple occurrences of those, then the first one, which is
$1 refers to the first occurrence of a group ( round bracketed expression) starting from the left. Because in our sample code the same number also appears as the linktext itself, we can now refer to it within the expression itself using a back-reference with the Notation \1, which is the equivalent for everything that is inside the rounded brackets.
Broken down: Match any number, that is either one, two or three digits long.
Matching Expression for Reference Links:
<span class="style\d+">\(<a href="#(\d{1,3})">\1</a>\)</span>
Replacement Expression for Reference Links:
<p id="fn00r$1"><sup class="footnote">
<a href="#fn$1">$1</a></sup></p>
The matching expression for the Anchors is even more interesting: Here the Pipe symbol creates an either/or condition, which was done here, because some footnote Anchors had the Footnote Square Brackets after the Anchor and and some before.
Matching Expression for Anchors:
\[(\d{1,3})\]<a name="\1"></a>|
<a name="(\d{1,3})"></a>\[(\d{1,3})\]
Replacement Expression for Anchors:
<p id="fn$2" class="footnote"><sup>$2</sup></p>
Another important thing to notice here, are the backward slashes. Here we want to match
[ and ] as literal characters. Since all brackets except triangular brackets are special characters used to describe, what is being matched, we need to escape them.
Regular Expressions in TextMate
Needless to say, TextMate makes very heavy use of Regular Expressions and are used all over in the Application as in the Language Grammars for instance.
TextMate uses the Oniguruma regular expression library by K. Kosako. It might be useful just to look up The Syntax Rules for a moment, which are included in TexMate’s Help command ( Section 20.3), and TextMate Replacement Syntax under (Section 20.4)
What this also means, is that we can switch on multi-line or extended mode, include comments and all sorts of other powerful stuff.
Now apart from our very basic examples above, we do sometimes want to match something, that comes after or before something that does not get matched directly, but as a so-called zero-width position.
Those Expressions are called Lookarounds. A good Usage-example for such a case is an Expression that would match anything, that is in-between two HTML tags, for instance to clean up some arbitrary junk, that hangs around in our document.
A look-around Example
(?<=>)(?m:.*?)(?=<)
Here the Expression has three parts. The first one is a Lookbehind, because it makes the second Group match occur after the first occurrence of a closing HTML bracket. The second Group is what we actually want to match, and the third Group describes, where our match should stop, without matching what is described within. ( Match must end before the first occurrence of an HTML start bracket).
This is called a Lookahead. Now, if you look at the second Group, which is our actual Match, you will have noticed, that it starts with a set of three characters which are:
?m:
This is a special notation, which switches on multi-line mode, so that our Match would not stop before a line-break occurs.
Automating a RE with Macros
If we want to take real advantage of such powerful Expressions, we can help ourselves with a simple Macro, that turns it into a Keyboard Shortcut for convenience.
Here’s how:
- From within a document Window where a match exists press ⌥ ⌘ M which will start the Macro.
- Press ⌥ F to perform a Find
- Insert the desired Expression like above in the Search input field and press the next Button
- Press ⌥ ⌘ M again, which will stop the Macro Recording
- Press ⌃ ⌘ ⌥ M to store the Macro. You will see that this will popup the Macro Editor, so Give your Macro a Name and assign a keybord Shortcut like ⌃ ⌘ ⌥ → for Instance.
Now, if you repeat the above Procedure again to create a second Macro, but instead in the third step press the Back Button and in the Last Step assign the shortcut ⌃ ⌘ ⌥ ← , you have just created a convenient way to tab through all those instances at once, forwards and backwards wise.
You could name those Macros:
- Tab content Forward
- Tab content Backwards
What RE you use does not matter in this example and depends on your personal needs, and those can differ of course.
A small useful advice
If you haven’t already discovered Allan Odgaards secretive Edit in TextMate command (⌃ ⌥ E) then it is a good Idea to configure this now. Then once you get inside the input field for the find dialog you can trigger this command, which opens a scratch document window in its proper Perl Scope which will make it much easier to construct your regular Expressions without running into Typos of unclosed brackets and such, or break up your expression on multiple lines before putting it together again.
Once you have finished, just press ⌥ S to save your changes and ⌥ W to close the Window.
In fact, you can now even insert a RE directly on multiple lines. In Oniguruma Syntax this is done like this:
(?<=>)(
# Turn on extended mode
?x:
# Match zero or more occurrences of any charcter
.*?
# End of match
)(?=<)
If you remember our previous example, you saw that we had switched on multi-line Mode. If we use an x character instead, the RE expression Engine will now ignore all the whitespace characters and linebreaks it finds inside the second Group of the expression. What this allows you to do is, to split up your Expressions on multiple lines and leave even comments (# notation) so your code becomes more readable and understandable.
Just keep in mind though, that without using the Edit in TextMate command ( Or working from a separate document window), you cannot try the above example, as it is not possible to insert literal line-breaks and Tabs in the Find and Replace input fields.
This is known is the Extended Mode of the RE Engine.
Useful Resources about regular Expressions
- Regular Expression Info
- Regular Expression How to A.M. Kuchling
Books and Articles
- TextMate Powerediting for the Mac is a very useful Book written by Rails Developer James Edward Gray II. Inside you will find resources, how to work with RE in TextMate, both for Beginners and Advanced Users
- An article about how to customize TextMate, written by James Edward Gray II at MecDev Center
- Mastering Regular Expressions Powerful Techniques for Perl and Other Tools By Jeffrey E. F. Friedl ( By far the best Book available about the Subject)
Posted ·2008-03-04 · by · Marios ·
© 2006-2008 marios buttner
send this article to a friend
send articleCommenting is closed for this article.