Archive for March, 2005

Please put CERNLIB out to pasture

Thursday, March 10th, 2005

Since I haven’t actually said much physics related on this blog yet, I decided to post on CERNLIB. This topic is especially near to my heart this week as I try to compile it with Sun Studio 10 on a Solaris 10 worsktation.

CERNLIB is the granddaddy of all modern physics software. This library is decades old and has accumulated hundreds of thousands of lines of FORTRAN (and a little C) code to do just about everything you could possibly want. Need to take the cross product of two vectors? Sure! How about evaluate some gamma functions? No problem! It will even let you simulate the propagation of electrons through a complex device, AND then make pretty plots of the results too! It’s like a fully stocked hardware store loaded with tools, ready to reduce any physics programming problem to a small pile of rubble.

Well, if you can get it to compile, that is. While a brilliant piece of physics computing, from a software engineering standpoint, CERNLIB has become a disaster. On every modern OS and architecture I have tried, CERNLIB has failed to build out of the box. Having done battle with autoconf/automake, I can understand why some people complain about them for managing cross-platform builds. But after a session with CERNLIB’s imake-based system, I would take back every mean thing I’ve ever said about autoconf.

Though, imake isn’t half as bad as the in-code hacks used to “achieve” cross-platformness. Today’s inductee into the C hall of shame is tcpaw.c. I think the goal here was to create a uniform interface to the TCP/IP API of every OS ever used in scientific computing. The result is so many #ifdefs, it might take longer to preprocess this file than to compile it. Please, don’t ever, EVER, write something like this. When it fails to build on a platform I’m working on (as it always does), I have to spend 10 minutes just tracing #ifdef’s before I can even start fixing it.

The other exciting build annoyance of the day was the abuse of the #include directive in FORTRAN. From looking at the code, it seems that FORTRAN lacks the equivalent of the C/C++ macro substitution or the typedef facility. The problem is that if you want to define a type alias my_float which is single precision in one build configuration and double precision in another, or to deal with different syntaxes for specifying double precision on different platforms, then you have trouble in FORTRAN. The hack used in CERNLIB 2004 is to do the following:

  • Define an include file with the appropriate preprocessor logic to select the type expression you want. For example:
    *  def64.inc
    *
    #if !defined(CERNLIB_DOUBLE)
          REAL
    #endif
    #if (defined(CERNLIB_DOUBLE))&&(defined(CERNLIB_F90))
          REAL(2)
    #endif
    #if defined(CERNLIB_DOUBLE)
          DOUBLE PRECISION
    #endif
    
  • Wherever you need to use your user defined type, include the header file you defined earlier. Follow the include line immediately with a FORTRAN line continuation (by putting something in column 5), then put your variable names there:
          FUNCTION BINOM(X,K)
    #include "gen/def64.inc"
         + D,DBINOM
          SROUND(D)=D+(D-SNGL(D))
          BINOM=SROUND(DBINOM(DBLE(X),K))
          RETURN
          END
    

    Here, D and DBINOM will be declared with the appropriate expression for double precision.

Then, if you are really lucky, this will give you the desired effect. If you are not lucky, then this won’t compile at all. Sun Studio 10 (and probably g77 according to some random googling) explicitly forbid line continuations immediately after an #include statement. It took quite a lot of sed action to remove all instances of this idiom from the mathlib code.

I’m really just beating a dead horse here. CERNLIB has lots of known portability problems, and apparently some real 64-bit platform showstoppers. CERN voted to drop active support for it in 2003, and just limp along with a few more bug fix releases. CERNLIB’s replacement is ROOT, a set of C++ libraries along with an “analysis environment” that uses interpreted C++ (Really! Who’d a thunk it?). Some of the FORTRAN fans I know like to bash ROOT for being bloated, buggy, unreliable, and slow. Some of that is anti-C++ sentiment, which I can sympathize with a little (though not much). But I think most of it is that ROOT is not nearly as old as CERNLIB and still has a feeling of rapid development (i.e. instability) to it.

I don’t think I’ll mind seeing the days of FORTRAN and CERNLIB go. Besides, ROOT now has Python bindings, so I can use ROOT libraries in the One True Scripting Language.

udev + Software RAID + LVM2 == Storage Nirvana

Wednesday, March 9th, 2005

My storage rack is pretty much setup now, thanks to some really great Linux kernel technology. Here’s a summary of how I set it all up. All config files are for Gentoo, your distribution may vary (YDMV).

udev

udev is now recognizing all 4 of my IDE drives in the firewire rack and assigning them persistent device names. I do this with the following udev config file:

# /etc/udev/rules.d/10-local.rules
BUS="scsi", KERNEL="sd*", SYSFS{model}="WD2500JB-00FUA0 ",
NAME="%k",  SYMLINK="ext/a1d%n"
BUS="scsi", KERNEL="sd*", SYSFS{model}="WD2500PB-19FBA0 ",
NAME="%k", SYMLINK="ext/a2d%n"
BUS="scsi", KERNEL="sd*", SYSFS{model}="6L060J3         ",
NAME="%k", SYMLINK="ext/b1d%n"
BUS="scsi", KERNEL="sd*", SYSFS{model}="IC35L060AVER07-0",
NAME="%k", SYMLINK="ext/b2d%n"

This produces device nodes like /dev/ext/a1d for the disk and /dev/ext/a1d1 for the first partition, etc. The identification key is the model number, which is good enough for me, but possibly frustrating if you have several identical disks. I haven’t found a better way than this yet…

Software RAID

When I first learned about RAID, I thought hardware was the only way to go. Now, I’m convinced unless you are one of the big boys with the budget for a serious RAID controller, software is the only way to go. First of all, most consumer-level RAID controllers are crap. They aren’t true RAID controllers, and are significantly slower than using the native CPU, unless your computer is rather old. Just read through Jeff Garzik’s SATA RAID FAQ to get a flavor for the problems with most “RAID” controllers. Second, by implementing RAID in the operating system, your configuration is actually portable. Your computer, storage controller, whatever, can die, and you can take the whole array to a different computer, boot a similar Linux kernel, and *BAM*, your RAID is back in business. You can even move to a whole different bus (say, Firewire to IDE).

The biggest drawback I could think of to software RAID was actually pointed out somewhere in the HOWTO. For RAID 1 (the setup I’ll be using), writes take twice the I/O bandwidth that reads require because the system has to send the same block of data to both drives. A real hardware controller would do the copy internally, keeping the second copy of the data off your main bus. This could be a major problem doing NFS writes over gigabit where you would basically have three copies of the same data flying around. This won’t be a major problem for me since my current configuration uses Firewire 400, which easily limits the individual drive speeds to the point where redundant transfers on the PCI bus don’t matter. When I move to Firewire 800, the controller will run at least at 66 MHz PCI, and the GigE is on a separate PCI-X bus, so it still won’t be a problem.

The actual RAID setup was very, very easy. I used mdadm to build the RAID out of the two 250 GB drives: mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/ext/a[12]d1

Just one line, and then the md subsystem went off to build the RAID set by synchronizing both disks. You could monitor the progress by looking at /proc/mdstat:

Personalities : [linear] [raid0] [raid1] [raid5]
md0 : active raid1 sdc1[1] sdb1[0]
      244195904 blocks [2/2] [UU]
      [======>..............]  resync = 34.9%
(85238016/244195904) finish=215.2min speed=12307K/sec
unused devices: >none<

The cool part is that you can start using the array immediately without waiting for the synchronization to finish. The task just keeps running in the background, only using I/O bandwith when you aren’t doing anything else with it. Of course, at this stage, the array is vulnerable and a single disk failure would take you down, but I don’t mind living dangerously. :)

Then I wrote a config file to help the startup scripts find and activate the RAID:

#/etc/mdadm.conf
MAILADDR stan
PROGRAM /usr/sbin/handle-mdadm-events
DEVICE /dev/ext/a1d1 /dev/ext/a2d1
ARRAY  /dev/md0 devices=/dev/ext/a1d1,/dev/ext/a2d1

LVM2

Finally, I put LVM2 on the top layer to abstract filesystems from the physical storage. My plan is to have one file system span several RAID 1 disk pairs. That way, when I want to add more storage, I just get two disks, make them into a RAID 1 set, then add them to the LVM volume group and resize the filesystem to make use of them. LVM2 is dead easy to use. Here’s what I did (not quite in this order, but whatever) to make a logical volume out of my RAID 1 set, plus the other two disks (not RAID since I’m running short of disk space at the moment):

pvcreate /dev/md0
pvcreate /dev/ext/b1d1
pvcreate /dev/ext/b2d1
vgcreate firewire /dev/md0 /dev/ext/b1d1 /dev/ext/b1d2
lvcreate --size 200G --name storage firewire
mkreiserfs /dev/firewire/storage
mount /dev/firewire/storage /storage

Later, I resized it while it was mounted just for fun:

lvextend -L+100G firewire/storage
resize_reiserfs /dev/firewire/storage

Finally, I modified /etc/lvm/lvm.conf to help the startup scripts find things. The important line was:

filter = ["a|/dev/ext/.*|", "a|/dev/md.*|", "r|.*|"]

Happy, happy storage

All in all, it was very pleasant. One last tip for firewire hard drives: I’ve found that on just about every Linux system I’ve tried with a variety of firewire enclosures, I need to load the sbp2 module with the serialize_io=1 option. Without it, the firewire communication with the disk will occassionally stall and fail during heavy load. The system retries and things work again, so the problem is not fatal. Nevertheless, the stall-fail-retry sequence can slow things down a lot. I don’t know what the performance impact of serialize_io but it’s got to be better than the alternative.

A Real Physics Processor?

Tuesday, March 8th, 2005

I just saw a link on Slashdot to the Ageia PhysX chip, a “physics processor unit” (PPU). Clearly from their website they are aiming squarely for the gaming market. That’s a smart thing because the market there is large, but I am wondering if the PPU would be a useful coprocessor for real physics simulation work (and not just game physics).

I can think of a lot of things in Monte Carlo simulations that we do that could benefit from some dedicated hardware. Collision detection algorithms are all over our geometry code, and if that could be offloaded to hardware, there might be big benefits.

I’m still parsing the website to see what this PPU actually does. I should ask if they’ve considered the scientific computing market.

Lurch is dead, long live lurch!

Sunday, March 6th, 2005

Last night I finally switched my primary server, lurch (you know, the butler), from a little Athlon XP Shuttle PC to the Opteron. It actually went fairly smoothly. The I/O performance is much, much better serving up files from the Opteron with a PCI-X gigabit controller. bonnie++ was clocking 60 MB/sec write speed and 36 MB/sec read on NFS clients accessing the new server. The write speed is actually faster than the drive itself, but that’s because bonnie++ chose to do the test with a 1 GB file, and the NFS server has 1 GB of memory. I want to rerun with a 2 or 3 GB test file, but bonnie++ is being uncooperative.

I also started tinkering with putting drives in the Firewire rack, but got annoyed with the default Linux device naming scheme. A first-come, first-serve assignment of device names (sda, sdb, etc.) is very bad for a setup like this, especially when I could potentially be hotplugging or moving devices. This is a problem both for static device names and devfs (the default Gentoo scheme). I was willing to put up with that for a while, but then I discovered that device naming isn’t even stable from boot to boot. Sometimes disks #1 and #2 on the firewire chain would come up as sdb and sdc, but sometimes they would come up as sdc and sdb instead. Maybe this is some sort of race condition inherent in Firewire, though I would have guessed that drives would be discovered in daisy chain order. Regardless, it is unacceptable and currently prevents me from automatically mounting my firewire disks at boot.

Looks like it’s time to check out udev. From this comparison of udev and devfs, it sounds like udev is the answer to my problems with its persistent naming feature. I’m not quite sure why the author has had to put up with so much argument about devfs since it sounds like a no-brainer that udev is better. (Maybe it just seems that way since the udev author is the one who wrote the comparison.) I’m a little hesitant to switch to udev and potentially screw up my system, but there is a nice Gentoo udev howto to hold my hand.

Memo To Apple

Tuesday, March 1st, 2005

Dear Apple,

I was reading a PDF with my iBook on the bus today and once again was hit with a desperate desire for a tablet PC. I was able to calm myself by having Preview display the document rotated 90 degrees in fullscreen mode, and then physically rotating my iBook. It felt like I was holding a heavy, very unbalanced book, but nevertheless it made me feel better.

However, I would be much happier if you sold a product that I will call the “Mac Slate.” (”Tablet” is overused and conjurers up very clunky and fragile looking Windows machines.) I can even give you directions on how to make one:

  1. Start with a 12″ G4 iBook. Underclock it to 900 MHz. You can even use a G3 if the power consumption is better with the G3 vs. the G4.
  2. Rip the screen off. (You may need two hands for this part.)
  3. Rip the keyboard/trackpad off.
  4. Rip the CD-ROM drive out.
  5. Replace the 2.5″ hard drive with one of those sexy Toshiba 1.8″ hard drives. Ten GB should be plenty.
  6. Keep the Airport.
  7. Now rotate the base so that it is in portrait orientation.
  8. Pry off the I/O ports and attach them to the left side.
  9. Add a dock connector to the bottom.
  10. Now peel the LCD off the screen part (you didn’t throw that away yet, right?) and tape it to the base.
  11. Remove some of the little function and arrow keys from the old keyboard and glue them to the right edge of the screen.
  12. Find a sturdy rolling pin and squeeze about 1/8″ off the thickness.
  13. Now take your new Mac Slate to the design department and ask them to make it pretty.
  14. Once they finish, take it to the OS division and have them install OS X on it, but with a simplified interface to allow you to operate the thing with just a handful of buttons on the edge. Bonus points for using Rendezvous to access a “Bookshelf” on the local WiFi network.
  15. Mail the prototype to me for final approval and “testing.”

Oh, and a sleek docking station/charger for the thing would be nice. A USB keyboard and mouse plugged into the dock will allow you to use it like a normal computer when you aren’t on the go. Think “iPod on steroids” here.

I know this shouldn’t be too much trouble for you folks. Let me know when you’ll have it ready. Thanks.

Entries (RSS)