Andreas Fuchs' Journal

I hack on things for people.

IDNA Now Supports Punycode Decoding

My IDNA library now supports decoding IDNA strings via the to-unicode function:

    (to-unicode "xn--mller-kva.example.com")
    ;; =>  "müller.example.com"

That’s in addition to the regular encoding for unicode domain names:

    (to-ascii "müller.example.com")
    ;; => "xn--mller-kva.example.com"

Sadly, I haven’t managed to get the logic for case-sensitive punycode encoding to work yet. But fortunately, IDNA domain name encoding doesn’t require that! Anyone looking for some low-hanging fruit-shaped lisp projects is welcome to add that! (-:

Accessing the Stripe API From Lisp

Stripe is a new payment processor on the Web, and they seem to be a lot less insane than Paypal. On a whim, I made a little (almost completely untested, toy) CL library for accessing their HTTP API from Lisp. Maybe you’ll find it useful: cl-stripe.

This was pretty great fun! Thanks to their nice HTTP API, drakma, and alexandria, I have been able to write this with a minimum of horribly hacky code, in just 5 or 6 hours of working on it, on and off, this saturday afternoon.

If it still looks like fun, I think I may add some clucumber tests to it tomorrow. Stay tuned.

A Weird Problem With ASDF on NFS (and a Workaround)

Recently, we at Franz have been seeing weird failures when building a certain ASDF system on NFS: We were intermittently getting redefinition warnings when loading a system - not all the time, but more often when we compiled during certain timeslots.

This was a weird one, and I think it would be nice to record what this is, and how we figured out what’s going on (and how we arrived at our work-around).

Update 2011-10-11: Faré informs me that this problem is fixed (all the way, no need for a workaround) in ASDF 2.017.9!

The Symptom

We have a system and a specialized operation (concat-op, it generates a single loadable .fasl we can ship) that depends on asdf:load-op. In our build process, we first load the system, then generate some data with the loaded system, and then perform the custom operation on the system.

When performing that operation with a source tree that was checked out on NFS, the load-op that it depends on sometimes got performed a second time: The lisp loaded all the .fasls again, and for some constructs in some .fasls, signaled a WARNING, which made the build break.

Oddly enough, the failure happened only during certain time slots - we would see the build work between 3pm and 4:30pm, and starting at 4:30 it failed consistently until it was time to go home. Huh.

Aside: How ASDF Operates

Not everyone might be familiar with how ASDF works (if you are, feel free to skip to the next section, or stay and nitpick (-:), so here’s a small primer on what happens when you type (asdf:load-system :some-system). Here’s a little walkthrough:

  1. ASDF runs the generic function traverse with the system and the operation as parameters.

  2. traverse walks the dependencies of the system and the contents of the system itself, and determines which operations are not yet done.

    For a load-op on a CL source file, traverse will try to generate a load-op for the input-file of that load-op (the .fasl file), check if that .fasl file exists, and if it doesn’t, then it will also generate a compile-op for the corresponding .lisp file.

  3. As a result, traverse returns a list of operations that must be performed on each component (or module, or system). For a clean source tree, that list looks something like:

    ((compile-op . source-1) (load-op . source-1) 
     (compile-op . source-2) (load-op . source-2) ...)
    
  4. operate takes that list and just performes each operation on its component in order.

All this means that ASDF takes a two-step approach: It first determines what needs to be done, then does it. All the smarts in ASDF are in that traverse operation and the underlying mechanisms. The rest is just a dolist.

OK, with that out of the way:

The Hunt

I’d gotten this error before, but that was when I was running on a source tree checked out on an NFS-mounted file system on Windows. I didn’t pay it much mind, because, hey, it’s the NFS client on Windows.

But then this exact same problem started happening to a client using two Linux machines as the client and the server. We had a problem.

At first, we suspected that there was an issue with build order (that result list of traverse). This was a blind alley: The files were loaded in exactly the same order in the failing and working scenarios. No luck.

The next thing was to instrument operation-done-p before performing the operation, and there we saw what happened: operation-done-p reported that load-op had not been performed on a file. But that file had been loaded into this very same image just minutes before! Huh?

operation-done-p a generic function and has a method that attempts to handle the most common cases of operations on ASDF components: the method specialized on (operation component), which does the following in the branch that applies to load-op:

(defmethod operation-done-p (operation component))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  (let ((out-files (output-files o c))
        (op-time (component-operation-time o c)))
    (flet ((latest-in ()
             (reduce #'max (mapcar #'safe-file-write-date in-files))))
      (cond
        ;; ...[cut some branches]

        ((not out-files)
         ;; an operation without output-files is probably meant
         ;; for its side-effects in the current image,
         ;; assumed to be idem-potent,
         ;; e.g. LOAD-OP or LOAD-SOURCE-OP of some CL-SOURCE-FILE.
         (and op-time (>= op-time (latest-in))))

         ;; ...[some more branches here]
         )

This consults a registry of times when an operation was performed on a component: component-operation-time returns a universal-time, that is a number of seconds, and compares that to the file-write-date - also a universal-time - of the input file (the .fasl). After some tracing, we determined that for some reason, the .fasl file was one second younger than the time that ASDF thought the load-op had been performed on it. In other words, the compiler had written the file AFTER load had had a chance to read it. ASDF was reading a file from the future.

This was the time when we started scratching our heads.

First, we wrote a little test program to verify we weren’t crazy:

wtf.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <stdio.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/stat.h>

int main(int argc, char **argv) {
  int fd;
  struct timeval tv1, tv2;
  struct stat sb1, sb2;

  char buf[1024];

  while (1) {
    gettimeofday(&tv1, NULL);
    printf("%u.%u\n", tv1.tv_sec, tv1.tv_usec);

    fd=open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0660);

    write(fd, buf, sizeof(buf));
    write(fd, buf, sizeof(buf));
    write(fd, buf, sizeof(buf));

    gettimeofday(&tv1, NULL);
    stat(argv[1], &sb1);
    close(fd);
    gettimeofday(&tv2, NULL);
    stat(argv[1], &sb2);

    /* never seems to be triggered */
    if (sb2.st_mtime != sb1.st_mtime) {
      printf("mtime changed between last write (%d) and close (%u)\n",
       sb1.st_mtime, sb2.st_mtime);
      exit(1);
    }
    if (sb1.st_mtime > tv1.tv_sec) {
      printf("mtime after last write has a future timestamp (%u > %u)\n",
       sb1.st_mtime, tv1.tv_sec);
      exit(1);
    }
    if (sb2.st_mtime > tv2.tv_sec) {
      printf("mtime after close has a future timestamp (%u > %u)\n",
       sb2.st_mtime, tv2.tv_sec);
      exit(1);
    }
  }

  return 0;
}

When we ran it, after a while, at timestamps very close to the boundary to the next second, we’d get “mtime after close has a future timestamp”. What. The.

We checked that all machines were synchronized with NTP. They were, to the same machine on the local network. What is going on?

Luckily, my colleague Ahmon has a lot of experience with NFS. His expertise and ample use of tcpdump finally provided the final puzzle piece: NFS protocol 3 on Linux has a feature called “weak cache consistency”: information can be supplied by servers after most NFS calls (e.g., WRITE) and has the server’s take on file attributes (such as mtime). So if the time on the server is just a tiny bit ahead of the client, the server will report the file that the client just wrote is from the client’s future.

When one apparently time-traveling file appears in the source tree, the traverse method will consider the system to not have been loaded, and will reload the .fasl files starting at the time-traveling file. Anything after that file in the build order could (and did!) potentially mess up the lisp image. In the best case, it would just slow down the build a lot by re-loading a ton of .fasl files. Argh.

Fixing this Mess (aka, the Workaround)

Since ASDF consults a registry of times that a file was loaded, we decided it would be easiest to alter the method that records this timestamp: Instead of the current time, it should record whichever is later: the current time or the timestamp of the file that it loaded.

ASDF work-around for NFS files created in the future
1
2
3
4
(defmethod perform :after ((operation load-op) (c component))
  (setf (gethash (type-of operation) (component-operation-times c))
    (reduce #'max (cons (get-universal-time)
                        (mapcar #'safe-file-write-date (input-files operation c))))))

And that’s it - with this method in place, asdf can now accurately build our system repeatedly, on NFS, even if wtf.c triggers.

Lessons Learned

That was a pretty fun afternoon spent debugging our build process. As a result, we got a working build, and a few shiny new ideas in our heads:

One, a program should never rely on the system time and some file’s creation time being comparable. This just doesn’t work anymore in a distributed system, especially if you’re using full seconds to represent time.

Two, ASDF is pretty flexible (almost to the point of being too flexible). To diagnose ASDF’s internal state, all we had to do was trace some functions it defines, and we managed to put this workaround in without having to deeply modify any of its sources: All it takes is an additional :after method. Sweet.

And three, the Allegro CL fasl loader is very fast (at least it feels so to me, coming from SBCL): In that tiny window (less than 0.07 seconds of real time) it would load a pretty substantial .fasl file and asdf would register it as loaded. That’s pretty impressive! (-:

“Basic Lisp Techniques” for Sale on Amazon

I just saw that David Cooper’s Basic Lisp Techniques is now available on Amazon, as a Kindle edition, for $9.95.

Problem is, this book is freely available on the Franz web site, doesn’t seem like it is an authorized conversion, and judging from the free sample the Kindle edition it is a slightly crappy (footnotes didn’t convert properly, ToC is ugly) pdf->azw conversion of the PDF. I recommend using the free PDF (you can even convert it yourself using Calibre if you want the book on an e-reader, it’s not hard at all to get very readable results).

Switching This Blog to Octopress

I’ve just made the jump from a creaky, old, badly-styled and annoying Movable Type installation to Octopress. It looks pretty nice, and as of now works better than MT has ever worked here, although I hope I didn’t break too many things with the switch over. If you find any 404s where they shouldn’t be, please contact me.

SBCL Git Repo Is Now Official!

Almost exactly four years after I announced the availability of the first iteration of the SBCL cvs->git mirror (and a few more years after the cvs->arch mirror even!), we now have an official SBCL repository! Many thanks to Nikodemus for doing all the heavy lifting! (-:

I’ve stopped the (now defuct) cvs->git mirror. In its place, there is a new repository that mirrors the official git repo. To get the old, now static repo, see sbcl-cvs-import.

If you are a committer and have any old checkouts around, please make sure that you follow the procedure for eliminating unnecessary branches / tags.

Userscript for Nicer l1sp.org Search

I love the l1sp.org documentation redirection service. It is quick and easy to look stuff up there if you know the name, and it has a pretty good search if you don’t. However, the search results are not presented very nicely: They’re very close together, and the ones I’m looking for most often (mostly CLHS pages) are buried somewhere in the middle.

So I wrote a userscript (for Firefox through greasemonkey or Google Chrome/Chromium) to improve things a bit: It enables keyboard navigation (j/k or cursor-down/up select the next/previous result, enter opens the page, and / focuses the input field), and searches for the closest match from the results (ranks them by section and then selects the shortest matching entry).

Get the l1sp.org userscript here.

I do hope that Xach will consider adding these features to l1sp proper (maybe de-uglified a bit… that yellow highlight is really not too great). Until then, you can already enjoy the benefits of just a little javascript (-:

Visualizing the CLHS Interlinking Structure

Lately, I’ve been wondering what we could use the ~110k Hyperlinks in the CLHS for, other than click through the spec in a web browser. For example, given a glossary entry, how do you find out which functions refer to it?

So I wrote a little program that crawls (a local copy of) the HyperSpec, and creates an RDF graph of its link structure. This graph can be used to answer these questions, or it could be the basis of a useful research overlay on top of the HyperSpec, who knows.

Lots of tools exist to visualize and manipulate the facts in an RDF graph already; Here’s a screen shot of gruff with a small trace through the HyperSpec:

You can get the code to create the RDF graph at github.