Monday, December 15, 2008

Quick Review of Microsoft's Seadragon

Just got around to installing Seadragon on my iPhone and took it for a quick spin. First impressions are that it rocks. If I load the map views, for example, zooming seems far far smoother than Google Maps. Zooming in on the articles in the Library of Congress Set was amazingly smooth as well.

Herding Code: Episode 29 w/ Miguel de Icaza (Part 2)

Part 2 of the Herding Code podcast w/ Miguel de Icaza is up: http://herdingcode.com/?p=114

I'll be listening to this tonight when I get home.

Sunday, December 14, 2008

Fun with Tries

A number of years ago, Michael Zucchi introduced me to the Aho-Corasick algorithm which he implemented in Evolution for the "Find in Message" feature. Later, I ended up extracting that logic out and refactoring it a bit into the ETrie class which I ended up using to replace the regex logic (which is common among most GNOME apps that do this) to scan for urls in the message body (in order to insert hyperlinks, etc). This turned out to be an order of magnitude faster.

For those who aren't in the know, the Aho-Corasick algorithm is for searching for multiple strings in a given input. In other words, it searches for one of multiple needles in a single haystack.

The pseudocode for the search itself would look something like this:

q = root
FOR i = 1 TO n
  WHILE q != fail AND g(q, text[i]) == fail
    q = h(q)
  ENDWHILE
  IF q == fail
    q = root
  ELSE
    q = g(q, text[i])
  ENDIF
  IF isElement(q, final)
    RETURN TRUE
  ENDIF
ENDFOR
RETURN FALSE

Of course, ETrie has a slight modification to the above algorithm, which is that instead of returning TRUE or FALSE, it returns a pointer to the beginning of the match or NULL if it find no matches.

The problem with this particular implementation, though, is that it would simply return a pointer to the offset into the string of the first match found. But what if you had multiple pattern strings that started with identical characters?

Like the ETrie implementation, the traditional Aho-Corasick algorithm (if I am remembering correctly) stops searching as soon as it finds any match. The difference, of course, is that given access to the current to the state / tree structure, a programmer could choose to continue matching to see if there is a more greedy match. With ETrie, however, in the interest of simplicity, it just returned the first / shortest match it got to.

In case I'm explaining this badly, say our search patterns are "the", "there", and "therefor". In the case of ETrie, it would always return that it matched "the" even when the "the" that it matched was the starting substring of the larger string, "therefor".

In something I was working on more recently (in c++), I wanted a greedier match (e.g. I wanted it to match "therefor" instead of "the").

In order to achieve this, I modified the implementation a slight bit:

const char *
Trie::Search (const char *buf, size_t buflen, int *matched_id)
{
    const char *inptr, *inend, *prev, *pat;
    size_t matched = 0;
    TrieMatch *m = 0;
    TrieState *q;
    char c;
    
    inend = buf + buflen;
    inptr = buf;
    
    q = &root;
    pat = prev = inptr;
    while (inptr < inend) {
        c = *inptr++;
        
        if (icase && (c >= 'A' && c <= 'Z'))
            c += 0x20;
        
        while (q != 0 && (m = g (q, c)) == 0 && matched == 0)
            q = q->fail;
        
        if (q == &root) {
            if (matched)
                return pat;
            
            pat = prev;
        }
        
        if (q == 0) {
            if (matched)
                return pat;
            
            q = &root;
            pat = inptr;
        } else if (m != 0) {
            q = m->state;
            
            if (q->final > matched) {
                if (matched_id)
                    *matched_id = q->id;
                
                matched = q->final;
            }
        }
        
        prev = inptr;
    }
    
    return matched ? pat : 0;
}

I have published the full source code for trie.cpp and trie.h on my website for your perusal.

Saturday, December 13, 2008

Herding Code: Episode 28 w/ Miguel de Icaza

Been waiting to listen to this for a few days now. Figured others might also be interested: Herding Code: Episode 28.

Discussion of the history of Mono and some of the new features that have been added recently.

Wednesday, December 3, 2008

Moonlight 1.0 beta 1

As most of you have probably heard by now, the first Beta release of Moonlight 1.0 has been announced, you can install it from http://www.go-mono.com/moonlight/.

Miguel's got a great post explaining how the multimedia stack works in Moonlight in case that sort of thing interests you.

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.