Tuesday, November 17, 2009

Working with Big Numbers

When you need big integers in C++ you know you can turn to any mature library available online. If that isn't good enough, you can put something together using vectors and other STL goodies.

But if you ever need to find the factorial of 30 in the non-standard Borland TC 2 compiler, then you need...
the bignum<int> template!

The need for this arose because most of the code I wrote in school was for the TC 2 compiler. It uses a different template system, non-standard headers and matches function prototypes differently than GCC. It is also very relaxed about how you use const references. It invariably takes work to port the simplest programs to GCC, so anytime I had to reuse my old TC 2 code this template let me do any useful computation. I've used it many times to work with large numbers - computing factorials, testing primality (Rabin-Miller algorithm) and finding GCDs.

Source: bignum.cpp

It supports overloaded arithmetic operators (+ - * / ++ --), bitwise operators (& | ~ ^ >> <<), iostream I/O (>> <<), all the relational operators and a function to return a boolean value. While + and - are O(n), * and / are O(n^2) for n-byte integers. The code cannot be moved to GCC since it makes assumptions like sizeof(unsigned) being 2 bytes.

When I do eventually rewrite it, I'll probably build it around a vector<type> and let the bignum object manage it's own size dynamically. Moving from a 32 to 64-bit base type would be as simple as :%s/uint32_t/uint64_t/g

Tuesday, November 10, 2009

Grep replacement using my own Regex library

This one took a lot of testing! I had no idea it would be this much work to rewrite grep (the regex search utility on *nix.) But finally it's at a level of functionality that is actually useful to me.

A regular expression (regex) is a formal syntax that identifies patterns in text. There are variants, such as Perl's regex engine which differs from the Unix standard. I stuck with the Linux grep syntax, since I'm familiar with that. If you're not already using a grep-like utility in Windows, try this one, it's powerful enough to work with directories full of source files (it's especially handy for working with HTML/XML tags.) Moreover, it's a simple library which can be extended or used to build more complex apps.

Source:

regex.cpp (General purpose Regex library)
main.cpp (Grep program - uses the Boost Filesystem Library)

The Regex library can also be used in C programs - simply rename the source to regex.c

It supports upto 9 backreferences, case insensitive search and can search any file containing text (it examines the first 1Kb for text characters) in terms of lines or bytes-from-start. It supports numerical ranges, shorthand range symbols like \d, \D and most parts of the standard regex language.

The heart of the program is the B+ tree that is created when a regex is fed into it. This tree then stays in memory while different input strings are matched against it's leaf nodes. The advantage of this NFA-based scheme is that matching at runtime is faster than with a system built on, say, derivatives where the regex string needs to be processed on every input character. Another advantage is that once built, complex regexes can be stored as search trees so they never have to be rebuilt.

It's not a paragon of efficiency - construction is O(n), worst case search performance is O(n^2) over a block of text. Only leaf nodes store data, but every node occupies 16 bytes. Despite this it can search very large directories in seconds.

There is only one issue with the program right now - it is designed for the Linux End-Of-Line scheme (LF). For any other encoding, bytes-from-start shown may be off by upto 10%.

It will probably be a long time before I work on a text manipulation program again... testing for all the edge cases was quite painful.

Bad, bad batchfile...

This set of batchfiles is the product of an evening spent online with nothing to do. I was reading up on the Windows command line, stuff I haven't touched since class 8 or so, and the end result isn't particularly useful (except to show your friends why Autorun should always be disabled on Windows XP.)

To see what it does, copy these files onto a pen drive with the same names:

hendrix.m3u (This file must not be hidden)

The first batchfile, grue.bat is launched by autorun.inf when you double click on the pen drive. This batchfile opens itself recursively in minimised mode before quitting. It then copies Hendrix.m3u into C:\Documents and Settings\All Users\Start Menu\Programs\Startup, under the name syslock.bat, and then deletes C:\NTDETECT.COM before shutting down. As soon as the user logs in again syslock.bat is launched, which forces a shutdown. Congratulations, you're locked out of your computer!

One problem was that syslock.bat could not be a hidden file if it was to be copied into the startup folder. So I hid it in plain sight as a music playlist with the .M3U extension. There's probably a way to copy hidden files in DOS, but this works well enough for my purposes.

Fixing the damage done by the batchfile is straightforward - remove syslock.bat, then restore NTDETECT.COM (copy it from any other Windows installation.) You could connect the hard disk as a slave to do this but it's easier with BartPE (Preinstalled Environment), or a Linux live CD plus the drivers to write to an NTFS partition.

So there are essentially two things to learn from this:

(1) You can do a lot with batchfiles. Not as much as a shell script in Linux, but it's still neat!

(2) Antiviruses don't consider a batchfile a threat. How would they react to a more malicious script, I wonder.