Tuesday, November 10, 2009

Grep replacement using my own Regex library

This one took a lot of testing! I had no idea it would be this much work to rewrite grep (the regex search utility on *nix.) But finally it's at a level of functionality that is actually useful to me.

A regular expression (regex) is a formal syntax that identifies patterns in text. There are variants, such as Perl's regex engine which differs from the Unix standard. I stuck with the Linux grep syntax, since I'm familiar with that. If you're not already using a grep-like utility in Windows, try this one, it's powerful enough to work with directories full of source files (it's especially handy for working with HTML/XML tags.) Moreover, it's a simple library which can be extended or used to build more complex apps.

Source:

regex.cpp (General purpose Regex library)
main.cpp (Grep program - uses the Boost Filesystem Library)

The Regex library can also be used in C programs - simply rename the source to regex.c

It supports upto 9 backreferences, case insensitive search and can search any file containing text (it examines the first 1Kb for text characters) in terms of lines or bytes-from-start. It supports numerical ranges, shorthand range symbols like \d, \D and most parts of the standard regex language.

The heart of the program is the B+ tree that is created when a regex is fed into it. This tree then stays in memory while different input strings are matched against it's leaf nodes. The advantage of this NFA-based scheme is that matching at runtime is faster than with a system built on, say, derivatives where the regex string needs to be processed on every input character. Another advantage is that once built, complex regexes can be stored as search trees so they never have to be rebuilt.

It's not a paragon of efficiency - construction is O(n), worst case search performance is O(n^2) over a block of text. Only leaf nodes store data, but every node occupies 16 bytes. Despite this it can search very large directories in seconds.

There is only one issue with the program right now - it is designed for the Linux End-Of-Line scheme (LF). For any other encoding, bytes-from-start shown may be off by upto 10%.

It will probably be a long time before I work on a text manipulation program again... testing for all the edge cases was quite painful.

No comments:

Post a Comment