Archive for October, 2007

Shuffling Algorithms

Tuesday, October 30th, 2007

Last week, my Ph.D. advisor needed to implement a randomized shuffle. The standard algorithm to do this is usually credited to Knuth, and is very easy to describe. Given an array x[] of length N

for i in 1..N:
    j = random number in range [i, N], inclusive
    swap x[i] with x[j]

This algorithm has $$N!$$ possible code paths, which conveniently corresponds to the number of unique permutations of x[]. All that remains is to prove that given a particular permutation of x[], you can reach it with N swaps, and you’ve shown that the above algorithm is complete (can generate any permutation) and unbiased (produces all permutations with equal probability.

Amazingly, any trivial change to the above algorithm will ruin it. The two common mistakes are:

  1. j = random number in range [1,N], inclusive: Here, you might imagine that allowing an element to swap with elements both before and after it will increase randomness somehow. It is true that this version has $$N^N$$ code paths, which is greater than $$N!$$, so all permutations are potentially reachable. However, not all permutations are equally likely (as we’ll see below), which results in a biased algorithm.
  2. j = random number in range [i+1,N], inclusive: By removing the possibility for j = i you might think you are also increasing randomness by never allowing an element to swap with itself, but that is not true. This incorrect version of the algorithm only has $$(N-1)!$$ possible code paths, but you know there are $$N!$$ possible permutations. Therefore some permutations must be unreachable, and this algorithm is therefore incomplete.

To explore these variations, I wrote a small Python program to test the correct and two incorrect versions of the algorithm and report some statistics. The variable $$n$$ sets the length of the sequence to shuffle. Since $$n^n$$ grows very quickly and this program stores every generated sequence in memory, you probably don’t want to try anything larger than $$n=7$$. Here are the results for $$n=7$$:

swap_incorrect: orderings, total = 823543 unique = 5040
repetitions = 163 +/- 50
variation = 30.495%
least common seqs = ['GABCDEF']

swap_nearly_correct: orderings, total = 720 unique = 720
repetitions = 1 +/- 0
variation = 0.000%
least common seqs = (all)

swap_correct: orderings, total = 5040 unique = 5040
repetitions = 1 +/- 0
variation = 0.000%
least common seqs = (all)

Note that swap_incorrect corresponds to case 1 above, and swap_nearly_correct corresponds to case 2. The first incorrect algorithm and the correct algorithm both produce the expected number of unique permutations, $$7! = 5040$$. However, the first incorrect algorithm has substantial variation the frequency of each ordering, with a standard deviation of 30%! The second incorrect algorithm (which never allows j = i) has no repetition, but fails to generate all the possible permutations.

A general result you find if you vary $$n$$ in the above program is that the least common sequence in the first incorrect algorithm is always the original sequence shifted to the right by one index, with the last element moved to the front. It’s not immediately obvious to me how one would prove that for all $$n$$, but it is interesting.

If your $$n$$ gets big enough, $$n!$$ might be larger than the periodicity of your psuedo-random number generator, which means that in practice, you might never be able to generate some orderings. You probably won’t care, since with $$n$$ that large, it would be tough to enumerate all possible orderings anyway. Still good to keep in mind, though…

How Much Is Your Data Worth?

Saturday, October 27th, 2007

Backups are one of those things you only take seriously after you experience serious data loss and realize the cost, monetary and otherwise, of losing files. In my case, I started thinking very hard about backups last year when a new hard drive in my MacBook died after a few weeks. I even had a backup, but it was incomplete and a month out of date. It was then I realized that my backup approach was haphazard, and not indicative of how important my data was to me.

Now I’ve decided that a good way to approach the problem is to imagine that one morning, you boot your computer and all of your files are gone. How much would you pay to get your data back? $100? $500? $1000? Without a backup strategy, you might find that when a thief or a malfunctioning disk head separates you from your data, no amount of money will bring it back. Only with some amount of planning can you have any control over how much data recovery will cost.

The types of failure modes you may have to deal with include:

  • Catastrophic hardware or software failure: Your hard drive, computer, or software suddenly and without (much) warning destroys some or all of your data. Moreover, it is obvious when this failure happens, so you can take immediate action. This is probably the most common failure, and is the first thing to be addressed by a backup system.
  • Theft: Someone steals your laptop, or breaks into your house and steals your computer.
  • Physical Disaster: Fire, flood, dropping your laptop on the pavement, backing over it with a car, etc.
  • Silent corruption: Malfunctioning software or hardware might corrupt data too slowly for you to notice immediately.
  • User Failure: This is when you accidentally delete a directory, overwrite an important document, or otherwise make some kind of localized, preventable mistake.

Combating these problems require balancing a number of backup tradeoffs:

  • Frequency: How often you backup determines the amount of recent data you will lose when a disk fails.
  • History: The number of backup revisions you save determines how quickly you need to discover the data loss. Disk failure and natural disasters are immediately obvious, but silent corruption, and even user failure, might take a while to identify.
  • Distance: The further away your backup media is from your computer, the less correlated backup failure is with the data loss event. Hard drive failure is very localized, but a thief will steal your entire laptop bag, including the backup drive in the side pocket. Fire can potentially destroy all devices (backup and computer) in your home, or even a larger area.
  • Convenience: You are more likely to backup if it is fast and easy to do. You also want to be able to restore your files quickly and get back to work.

It is interesting to note that there is interplay between these factors. High frequency backups need to be paired with deep history, or you will not be able to recover from silent corruption and some kinds of user failure. Distance and convenience are usually inversely related. Online backups put the storage media very far away, but can be less convenient to restore due to limited network bandwidth.

After balancing these factors, I have some suggestions for people with Mac laptops or desktops. You should consider these stages, stopping whenever you hit the value limit of your data. That is, stage 1 is the most important, then stage 2, and finally stage 3.

Stage 1: Bootable External Hard Disk (~$150)

Buy an external 3.5″ Firewire hard disk that is at least as big as the hard disk inside your computer. For most people, this should cost no more than $100-$150. I suggest Firewire since all Macs in the last 5 years can boot from an external Firewire disk. Intel Macs can now boot from USB 2.0 disks as well, but Firewire in my experience still performs better. Don’t skimp on the size either! Disks are cheap these days, so there is no excuse for not backing up your entire computer.

Purchase SuperDuper! for $28, or download Carbon Copy Cloner. CCC is free, but I haven’t tried out version 3.0, so I can’t comment on whether it has solved the usability problems with 2.0. I know SuperDuper! works, so that’s why I still recommend it.

Now use SuperDuper! (or CCC) to make bootable, full disk backups of your computer. Both programs have backup modes which quickly refresh the backup by only copying changed files. After your first backup, later backups will probably only require 20-30 minutes to complete. Most importantly, if your disk fails, you can boot your backup and keep working while you replace the hardware. If the whole computer is shot, you can boot your backup disk on another Mac and still keep working. This is also a great thing to have when you perform major software upgrades.

A bootable, full disk backup is easy to do, and covers probably 80% of possible problems. You should keep the disk close to your computer desk, but only plug it into the computer during the backup. This will isolate the backup media from transient software problems, or other bugs, that might affect disks connected to the computer.

Stage 2: Online Backup ($60 + friends, hopefully free)

The two major problems with the bootable disk backup is a lack of history, and a lack of distance. Without history, you can only recover files damaged since your last backup. That is sufficient in the case of sudden disk failure, but not so good when you realize you corrupted your photo database a week ago. And, if you keep your backup disk nearby for quick and easy backups, disaster may strike both your computer and backup disk at the same time.

To mitigate both of these risks, I’ve concluded that online backups provide a sensible tradeoff. In particular, CrashPlan has impressed me with an attractive, simple, cross-platform program that does almost exactly what I want in a backup utility. Unlike some other online backup utilities, CrashPlan lets you save your backup data (in compressed and encrypted form) on their servers and/or your friend’s computers. They don’t even have to pay for the program if they just store backups for others. You only buy licenses for computers that you actively backup. Note that only the $60 version of the program supports any kind of version history, which I consider essential in this case.

You should check out the feature details. Perhaps the smartest feature is the emphasis on diverse backup destinations. If you save your data on several friends computers, you don’t have to worry so much if one of them happens to be offline when you need a backup. Additionally, when restoring, the software can stream your data from several sources at once, so if you have lots of friends, your restoration will go faster. Of course, if you want at least one stable, always available backup destination, you can store data on the CrashPlan server for $0.10/GB/month, with a $5 minimum.

So in stage 2, the recovery strategy is: backup your entire computer to the bootable external disk, and continuously backup your irreplaceable files (documents, photos, etc) with CrashPlan to your friend’s computers. Then, if your hard disk dies, you first go to backup disk, and supplement with the more recent files saved online. If your external backup disk is stolen/destroyed/lost, then at least you can recover your irreplaceable files, even if it means you are having to download them for a week.

(Aside: I haven’t yet decided how to fit Time Machine in 10.5 into this strategy. Time Machine provides revision history, but requires an external drive plugged into your computer. That doesn’t provide any backup distance, and it isn’t clear how this will work with a laptop, where I don’t want to have any disks plugged in most of the time.)

Stage 3: Offsite External Backup Disk ($100)

This extension is pretty simple: Buy a second external disk, and do a full, bootable backup to it once a month. Store the disk somewhere away from your computer and home, like at school or work. Then, if your main backup disk is destroyed or stolen, you can still retrieve the offsite backup disk, and then supplement it with the last month’s worth of files from the online backup.

Conclusion

After reaching stage 3, I decided my paranoia had been satisfied. There is a clear recovery plan for all likely failure scenarios, and the cost is very reasonable. Nominally it only requires $300 for this kind of peace of mind, but it can even be cheaper if you have some spare disks laying around (as I did) that you can put into external Firewire enclosures. Considering how much of my work (and leisure) involves my laptop, I consider $300 a pretty reasonable price for my data.

Jonathan Coulton

Sunday, October 21st, 2007

The catchy credits song (which in a strange way could be considered a spoiler for the game ending) in Portal introduced me to Jonathan Coulton, a very talented musician. For lack of a better term, you could call his music “nerdfolk.” Think of They Might Be Giants, but with less abstract lyrics. Like TMBG, Coulton’s songs are about topics you might consider non-musical, but unlike TMBG, you can always figure out what the song is about. (Oh, and Coulton never uses an accordion, which, depending on your musical tastes, could be a plus or minus.)

He’s got a nice intro page here. Lots of good stuff there, but Code Monkey is my current favorite. Coulton portrays the most endearing programmer/primate ever, and wraps it up in a catchy tune. The Code Monkey doesn’t get the girl by the end of the song, but he still thinks he might, and that leaves you smiling. There’s heart and smirk in the song, unlike a lot of nerdfolk artist, who only remember the smirk.

Other good songs include Chiron Beta Prime (where I swear Coulton is channeling John Linnell), Re: Your Brains and Mandelbrot Set.

Parallel Processing in Python with processing

Sunday, October 7th, 2007

There seems to be lots of discussion these days about concurrency going around the Python blogs. I first clued into the arguments when I read Bruce Eckel’s critique of Python 3000 and Guido’s discussion of why the GIL persists in Python. Ignoring the philosophical question of processes vs. threads, you will, as a practical matter, probably need to use multiple processes in Python to scale your programs to more than one core.

I just finished reading the free first issue of Python Magazine, where Doug Hellmann reviewed three options for running processes in parallel in Python: the subprocess module, the parallel python package, and the processing module. (The article is very good, as is the rest of the magazine. Go read it! I only wish that the US dollar were doing a little better as the magazine is priced in CAN$.)

In a bit of serendipity, Fredrik Lundh posted an article where he optimizes a log parsing program in Python, starting from a single process implementation, moving up to threaded and multi-process implementations. Remembering the processing module from Doug’s article, I decided to take a whack at porting one of Fredrik’s versions to use it.

The processing module is designed to be a nearly drop-in replacement for threading, so I started with the threaded version of the log parser. The changes were very simple, as you can see from the diff:

--- wf-4.py 2007-10-07 00:08:04.000000000 -0500
+++ wf-4-processing.py 2007-10-07 00:37:12.000000000 -0500
@@ -1,6 +1,7 @@
import re, sys
from collections import defaultdict
-import threading, Queue
+import processing
+

# 1: 2.7 seconds
# 2: 2.5 seconds
@@ -28,14 +29,13 @@
if not s:
break

-class Worker(threading.Thread):
+class Worker(processing.Process):
def run(self):
while 1:
chunk = queue.get()
if chunk is None:
break
- result.append(process(*chunk))
- queue.task_done()
+ result_queue.put(process(*chunk))

# --------------------------------------------------------------------

@@ -51,7 +51,8 @@
except:
count = 2

-queue = Queue.Queue()
+queue = processing.Queue()
+result_queue = processing.Queue()
result = []

for i in range(count):
@@ -59,10 +60,13 @@
w.setDaemon(1)
w.start()

+chunk_count = 0
for chunk in getchunks(FILE):
queue.put((FILE, chunk))
+ chunk_count += 1

-queue.join()
+for i in range(chunk_count):
+ result.append(result_queue.get())

count = defaultdict(int)
for item in result:

Basically, I had to add a results queue to pass the parsed log results back to the parent process, since the child processes no longer had direct access to a list object in the parent memory space. I could have used processing.Namespace to create a shared list, but a queue to feed back to the parent seemed more natural. It also provided a convenient way to test when all of the workers were finished. (Send one chunk out, look for one item in the return queue)

Update: Based on Fredrik’s comment below, I’ve updated to his new test version which reports the wallclock time difference with both time() and clock(). clock() doesn’t work so well measure the right quantity on OS X, so with time() the numbers make more sense now.

The speed improvement was amazing pretty good! Here are the times on my MacBook 1.83 GHz Core Duo sytem for other versions that were posted, as well as the processing version:

wf-1-gala.py: 6.40 sec (original by Santiago Gala)
wf-2.py: 2.43 sec (optimized, single threaded version)
wf-3.py: 2.25 sec (chunked version)
wf-4.py: 2.10 sec (chunked version with threads)
wf-5.py: 4.95 sec (2 processes using the subprocess module)
wf-6.py: 1.28 sec (2 processes, subprocess module, using memory-mapped files)
wf-4-processing.py: 1.34 sec (chunked version with processing module, 2 worker processes)

I should note that to get wf-5.py and wf-6.py to run, I had to add a call to flush() on the file object in the putobject() function. The flush() is shown in the blog entry, but is not present in the actual source files. Flush problem fixed now! Strangely, wf-5.py does very poorly, but wf-6.py does much better.

So it looks like the processing version does almost as well as the final, fancy version of the log parser, but with a lot less code. I think that counts as a win. :)

You can download wf-4-processing.py here, and the README over here explains how to download the test data set.

Update: As another aside: I saw very little benefit, and usually some penalty, when increasing the number of processes in any of the multi-process tests beyond two. I suspect this is because, unlike Linux, OS X is a little sluggish with the process creation.

Update #2: I finally tried this code on Ubuntu (single proc virtual machine, so no benchmarks) and discovered that the default implementation of processing.Queue uses a POSIX message queue when available. The objects passed around in this program are too large for such a queue, so you need to explicitly ask for a PipeQueue. The code has been updated to reflect this.

Update #3: And as proof that you can screw up even dead-easy parallel tasks, the above code in fact has a deadlock. A threading.Queue object has no length limit, so put() never blocks. However, processing.PipeQueue.get() can block if the pipe fills up.

The obvious way around this would be fix processing.PipeQueue so that it has the same semantics as threading.Queue. Ironically, one could solve that problem by having an extra thread in each process that monitored the pipes. Perhaps something like select() could be used instead to avoid having to introduce threads. That’s probably not portable to Windows, so more thinking here might be required.

Update #4: The processing test code in fact includes a program called test_worker.py which shows how to avoid this problem using a thread in the main process. Would be nice if there was a way to hide that in the Queue implementation though.

Entries (RSS)