Archive for the ‘Computers’ Category

64-bit Python on Macs

Saturday, June 7th, 2008

There was a question recently on the ROOT mailing list where someone was having a problem using the python executable that comes with Mac OS X 10.5 and 64-bit libraries.  I went digging around, and noticed a strange discrepancy.  The compiled python libraries that ship with Leopard are four architecture universal binaries:

stan@Rover:/usr/lib/python2.5/config$ file libpython2.5.a 
libpython2.5.a: Mach-O universal binary with 4 architectures
libpython2.5.a (for architecture ppc7400):
    Mach-O dynamically linked shared library ppc
libpython2.5.a (for architecture ppc64):
    Mach-O 64-bit dynamically linked shared library ppc64
libpython2.5.a (for architecture i386):
    Mach-O dynamically linked shared library i386
libpython2.5.a (for architecture x86_64):
    Mach-O 64-bit dynamically linked shared library x86_64

(Reformatted to avoid spilling into my sidebar…)

However, the python executable is not compiled for 64-bit architectures:

stan@Rover:/usr/bin$ file python2.5
python2.5: Mach-O universal binary with 2 architectures
python2.5 (for architecture ppc7400):
    Mach-O executable ppc
python2.5 (for architecture i386):
    Mach-O executable i386

I hadn’t noticed this since my MacBook is the early Core Duo model, rather than Core 2 Duo, so the hardware does not support x86_64.  Apple may have good reasons to force all python scripts to run as 32-bit applications even on 64-bit systems, but I don’t know what they are.

If you find yourself wanting 64-bit python, it’s very easy to make your own, since all the Python libraries on Leopard are already 32/64-bit universal.  Just go grab the very short python64.c from the ROOT svn repository, and compile it like this:

gcc -arch ppc64 -arch x86_64 -arch i386 -arch ppc  \
    -o python64 -I/usr/include/Python2.5 -lpython2.5 python64.c

(Note that this has nothing to do with the ROOT libraries. If you have no idea what ROOT is, the above will still work.)

Now you can check the python64 executable:

Rover:tmp stan$ file python64
python64: Mach-O universal binary with 4 architectures
python64 (for architecture ppc64):
    Mach-O 64-bit executable ppc64
python64 (for architecture x86_64):
    Mach-O 64-bit executable x86_64
python64 (for architecture i386):
    Mach-O executable i386
python64 (for architecture ppc7400):
    Mach-O executable ppc

All four architectures are now present. I haven’t got a 64-bit Mac to try this out on, so I don’t know if it actually runs correctly there. Being universal, this binary works just fine on my 32-bit Mac, of course.

Four Languages

Saturday, May 31st, 2008

Inspired by the Slashdot story, Programming as Part of a Science Education?, I have to agree with the author, mostly. These days, every scientist and engineer should be fluent in four kinds of languages:

  1. Verbal: English still works, but the way things are going, Mandarin Chinese is becoming a useful #2.
  2. Pictorial: Each field has its own way to visualize information. In physics, the art of the histogram is paramount.
  3. Mathematical: Calculus the most relevant language for most sciences, but the language of graph theory and abstract algebra can be handy in some niches.
  4. Computational: FORTRAN was king for a while, and C++ is the current king, but on a very slow decline. Personally, I think all scientists would benefit from learning Python first, and then supplement with whatever compiled language is used by their colleagues.

Some people might argue that the first two are languages, and the second two are tools, but I disagree. Both math and programming are ways to express ideas, one symbolic and one procedural. (Unless you do your data analysis in PROLOG, in which case you are impressively loony.) On the physics experiment I work on, our 15-year-old simulation program (woo FORTRAN!) is in many respects the most precise, unambiguous description of how our experiment actually functions. Sure there are piles of words, plots, and equations also documenting everything, but sometimes when I want to know something behaves, I go straight for the code and read it.

In effect, we have built a procedural description of our particle detector, and it is worth recognizing the code as a form of communication. It certainly changes the way you think about programming.

OS X Leopard Roundup

Tuesday, November 6th, 2007

After spending 4 days with the new 10.5 release of Mac OS X, I’ve been pretty impressed. Visually, things have improved, except for the obvious problems with the dock and the menu bar. I had the same initial negative reaction to the translucent 3D dock that most other people did, but it has grown on me slowly. The translucent menu bar, however, is simply atrocious if you do not have a very color-neutral (black, white or grey) background. The most common way to correct this is to add a white stripe to the top of your background image. OpaqueMenuBar does this for you automatically whenever your background changes.

Looks aside, I think Leopard’s biggest advance is the amount of attention Apple has shown to developers. (Clearly, OS X is stealing the love from the iPhone…)

X11.app 2.0 is generally a huge improvement. Now based on the X.org 7.2 code base, it draws on the XDarwin code base for X/OS X integration, rather than the source of the old Panther/Tiger X11.app, which was based on XFree86. As a result, there have been a number of regressions, but it sounds like the future of X11.app will be much better. In particular, the source is now being hosted in the X.org git repository, and the main developer is committed to engaging with the user community.

There are a number of glitches, though. Using a fullscreen X desktop (which I suspect is not terribly common) is broken, as is dragging an X window to another display if you have two monitors on your computer. Most annoying, the patch to fix the yellow cursor bug was dropped on the floor, and didn’t make it into X11.app 2.0. The author has since fixed this in an alpha release on the XDarwin wiki page. The Xquartz binary he posts there works great for me, so I’m happy for now.

The launchd program, which is like init/rc/cron/at/inetd all rolled together, is used to pull off two neat tricks in Leopard. First, the $DISPLAY variable is set to a socket that launchd monitors, so the X server now starts automatically on demand. This means you can start up Terminal, do some work, and as soon as you start an X application, you’ll see X11.app appear. The second trick is now an ssh-agent is started on demand when you use SSH. Apple’s ssh-agent can fetch passphrases for your keys from the OS X Keychain as well. You don’t need to use SSHKeychain any more (which is good since it had a major memory leak on my system). The only downside to the ssh-agent Keychain support is there is no obvious way to expire the ssh keys in the agent when the Keychain locks. Once those keys are decrypted into the ssh-agent memory, they stay valid even after you lock your Keychain.

Python has been updated to 2.5.1, which is great because it solves a linker problem I had with the Python bindings for ROOT. The Leopard install of Python includes easy_install, numpy, twisted, and some other handy stuff. In addition, there are new Objective-C/Cocoa bindings, and it comes along with py2app, for generating proper-looking Mac applications entirely written in Python!

The Cocoa programmers are probably excited about Objective-C 2.0, which adds garbage collection and some other improvements, like a compact syntax for looping over an iterator. I’ve been reading up on Objective-C, and the message passing style of object-orientation reminds me greatly of Python’s duck-typing. I find the syntax unspeakably ugly looking, but that’s really just a matter of taste. You can get used to anything, really. :)

Shuffling Algorithms

Tuesday, October 30th, 2007

Last week, my Ph.D. advisor needed to implement a randomized shuffle. The standard algorithm to do this is usually credited to Knuth, and is very easy to describe. Given an array x[] of length N

for i in 1..N:
    j = random number in range [i, N], inclusive
    swap x[i] with x[j]

This algorithm has $$N!$$ possible code paths, which conveniently corresponds to the number of unique permutations of x[]. All that remains is to prove that given a particular permutation of x[], you can reach it with N swaps, and you’ve shown that the above algorithm is complete (can generate any permutation) and unbiased (produces all permutations with equal probability.

Amazingly, any trivial change to the above algorithm will ruin it. The two common mistakes are:

  1. j = random number in range [1,N], inclusive: Here, you might imagine that allowing an element to swap with elements both before and after it will increase randomness somehow. It is true that this version has $$N^N$$ code paths, which is greater than $$N!$$, so all permutations are potentially reachable. However, not all permutations are equally likely (as we’ll see below), which results in a biased algorithm.
  2. j = random number in range [i+1,N], inclusive: By removing the possibility for j = i you might think you are also increasing randomness by never allowing an element to swap with itself, but that is not true. This incorrect version of the algorithm only has $$(N-1)!$$ possible code paths, but you know there are $$N!$$ possible permutations. Therefore some permutations must be unreachable, and this algorithm is therefore incomplete.

To explore these variations, I wrote a small Python program to test the correct and two incorrect versions of the algorithm and report some statistics. The variable $$n$$ sets the length of the sequence to shuffle. Since $$n^n$$ grows very quickly and this program stores every generated sequence in memory, you probably don’t want to try anything larger than $$n=7$$. Here are the results for $$n=7$$:

swap_incorrect: orderings, total = 823543 unique = 5040
repetitions = 163 +/- 50
variation = 30.495%
least common seqs = ['GABCDEF']

swap_nearly_correct: orderings, total = 720 unique = 720
repetitions = 1 +/- 0
variation = 0.000%
least common seqs = (all)

swap_correct: orderings, total = 5040 unique = 5040
repetitions = 1 +/- 0
variation = 0.000%
least common seqs = (all)

Note that swap_incorrect corresponds to case 1 above, and swap_nearly_correct corresponds to case 2. The first incorrect algorithm and the correct algorithm both produce the expected number of unique permutations, $$7! = 5040$$. However, the first incorrect algorithm has substantial variation the frequency of each ordering, with a standard deviation of 30%! The second incorrect algorithm (which never allows j = i) has no repetition, but fails to generate all the possible permutations.

A general result you find if you vary $$n$$ in the above program is that the least common sequence in the first incorrect algorithm is always the original sequence shifted to the right by one index, with the last element moved to the front. It’s not immediately obvious to me how one would prove that for all $$n$$, but it is interesting.

If your $$n$$ gets big enough, $$n!$$ might be larger than the periodicity of your psuedo-random number generator, which means that in practice, you might never be able to generate some orderings. You probably won’t care, since with $$n$$ that large, it would be tough to enumerate all possible orderings anyway. Still good to keep in mind, though…

How Much Is Your Data Worth?

Saturday, October 27th, 2007

Backups are one of those things you only take seriously after you experience serious data loss and realize the cost, monetary and otherwise, of losing files. In my case, I started thinking very hard about backups last year when a new hard drive in my MacBook died after a few weeks. I even had a backup, but it was incomplete and a month out of date. It was then I realized that my backup approach was haphazard, and not indicative of how important my data was to me.

Now I’ve decided that a good way to approach the problem is to imagine that one morning, you boot your computer and all of your files are gone. How much would you pay to get your data back? $100? $500? $1000? Without a backup strategy, you might find that when a thief or a malfunctioning disk head separates you from your data, no amount of money will bring it back. Only with some amount of planning can you have any control over how much data recovery will cost.

The types of failure modes you may have to deal with include:

  • Catastrophic hardware or software failure: Your hard drive, computer, or software suddenly and without (much) warning destroys some or all of your data. Moreover, it is obvious when this failure happens, so you can take immediate action. This is probably the most common failure, and is the first thing to be addressed by a backup system.
  • Theft: Someone steals your laptop, or breaks into your house and steals your computer.
  • Physical Disaster: Fire, flood, dropping your laptop on the pavement, backing over it with a car, etc.
  • Silent corruption: Malfunctioning software or hardware might corrupt data too slowly for you to notice immediately.
  • User Failure: This is when you accidentally delete a directory, overwrite an important document, or otherwise make some kind of localized, preventable mistake.

Combating these problems require balancing a number of backup tradeoffs:

  • Frequency: How often you backup determines the amount of recent data you will lose when a disk fails.
  • History: The number of backup revisions you save determines how quickly you need to discover the data loss. Disk failure and natural disasters are immediately obvious, but silent corruption, and even user failure, might take a while to identify.
  • Distance: The further away your backup media is from your computer, the less correlated backup failure is with the data loss event. Hard drive failure is very localized, but a thief will steal your entire laptop bag, including the backup drive in the side pocket. Fire can potentially destroy all devices (backup and computer) in your home, or even a larger area.
  • Convenience: You are more likely to backup if it is fast and easy to do. You also want to be able to restore your files quickly and get back to work.

It is interesting to note that there is interplay between these factors. High frequency backups need to be paired with deep history, or you will not be able to recover from silent corruption and some kinds of user failure. Distance and convenience are usually inversely related. Online backups put the storage media very far away, but can be less convenient to restore due to limited network bandwidth.

After balancing these factors, I have some suggestions for people with Mac laptops or desktops. You should consider these stages, stopping whenever you hit the value limit of your data. That is, stage 1 is the most important, then stage 2, and finally stage 3.

Stage 1: Bootable External Hard Disk (~$150)

Buy an external 3.5″ Firewire hard disk that is at least as big as the hard disk inside your computer. For most people, this should cost no more than $100-$150. I suggest Firewire since all Macs in the last 5 years can boot from an external Firewire disk. Intel Macs can now boot from USB 2.0 disks as well, but Firewire in my experience still performs better. Don’t skimp on the size either! Disks are cheap these days, so there is no excuse for not backing up your entire computer.

Purchase SuperDuper! for $28, or download Carbon Copy Cloner. CCC is free, but I haven’t tried out version 3.0, so I can’t comment on whether it has solved the usability problems with 2.0. I know SuperDuper! works, so that’s why I still recommend it.

Now use SuperDuper! (or CCC) to make bootable, full disk backups of your computer. Both programs have backup modes which quickly refresh the backup by only copying changed files. After your first backup, later backups will probably only require 20-30 minutes to complete. Most importantly, if your disk fails, you can boot your backup and keep working while you replace the hardware. If the whole computer is shot, you can boot your backup disk on another Mac and still keep working. This is also a great thing to have when you perform major software upgrades.

A bootable, full disk backup is easy to do, and covers probably 80% of possible problems. You should keep the disk close to your computer desk, but only plug it into the computer during the backup. This will isolate the backup media from transient software problems, or other bugs, that might affect disks connected to the computer.

Stage 2: Online Backup ($60 + friends, hopefully free)

The two major problems with the bootable disk backup is a lack of history, and a lack of distance. Without history, you can only recover files damaged since your last backup. That is sufficient in the case of sudden disk failure, but not so good when you realize you corrupted your photo database a week ago. And, if you keep your backup disk nearby for quick and easy backups, disaster may strike both your computer and backup disk at the same time.

To mitigate both of these risks, I’ve concluded that online backups provide a sensible tradeoff. In particular, CrashPlan has impressed me with an attractive, simple, cross-platform program that does almost exactly what I want in a backup utility. Unlike some other online backup utilities, CrashPlan lets you save your backup data (in compressed and encrypted form) on their servers and/or your friend’s computers. They don’t even have to pay for the program if they just store backups for others. You only buy licenses for computers that you actively backup. Note that only the $60 version of the program supports any kind of version history, which I consider essential in this case.

You should check out the feature details. Perhaps the smartest feature is the emphasis on diverse backup destinations. If you save your data on several friends computers, you don’t have to worry so much if one of them happens to be offline when you need a backup. Additionally, when restoring, the software can stream your data from several sources at once, so if you have lots of friends, your restoration will go faster. Of course, if you want at least one stable, always available backup destination, you can store data on the CrashPlan server for $0.10/GB/month, with a $5 minimum.

So in stage 2, the recovery strategy is: backup your entire computer to the bootable external disk, and continuously backup your irreplaceable files (documents, photos, etc) with CrashPlan to your friend’s computers. Then, if your hard disk dies, you first go to backup disk, and supplement with the more recent files saved online. If your external backup disk is stolen/destroyed/lost, then at least you can recover your irreplaceable files, even if it means you are having to download them for a week.

(Aside: I haven’t yet decided how to fit Time Machine in 10.5 into this strategy. Time Machine provides revision history, but requires an external drive plugged into your computer. That doesn’t provide any backup distance, and it isn’t clear how this will work with a laptop, where I don’t want to have any disks plugged in most of the time.)

Stage 3: Offsite External Backup Disk ($100)

This extension is pretty simple: Buy a second external disk, and do a full, bootable backup to it once a month. Store the disk somewhere away from your computer and home, like at school or work. Then, if your main backup disk is destroyed or stolen, you can still retrieve the offsite backup disk, and then supplement it with the last month’s worth of files from the online backup.

Conclusion

After reaching stage 3, I decided my paranoia had been satisfied. There is a clear recovery plan for all likely failure scenarios, and the cost is very reasonable. Nominally it only requires $300 for this kind of peace of mind, but it can even be cheaper if you have some spare disks laying around (as I did) that you can put into external Firewire enclosures. Considering how much of my work (and leisure) involves my laptop, I consider $300 a pretty reasonable price for my data.

Entries (RSS)