Distributed Computation (chunking)

Structure of a Simple Distributed Computing System

The question came up: how to structure a distributed computing system, such as a distributed ray tracer, or a cryptographic attack program?

You could use shell connections to retrieve the data to be worked on. Have the clients SSH to the server and "check out" a chunk of work -- no "custom" programming needed for that part. The only thing you need to worry about is two clients connecting to check out at the same time, so the "check out" piece needs to be synchronized.

Clients then ssh back in to "check in" the pieces they complete. The master would likely predict how long it "should" take a client to complete a piece of work, and if the client is way behind, it'd give that piece to someone else as well, and accept the first one to complete; this adds robustness in the face of client disconnection.

I think this would be pretty scalable (each client only needs to know about the one server), and actually easy to implement. The components are:

job setup
Run on the server, once; delimits the "chunks" of work and marks them all as started by 0 clients so far. Put all chunks in a queue.

chunk checkout
Serialized. A client connects, and the server takes the next un-finished chunk in the queue and hands it to the client, incrementing the "client count" of the cunk by one.

chunk checkin
Serialized. A client connects, and hands results back to the server. If the chunk is currently not complete, the data is accepted, and the chunk is removed from the incomplete queue. If all chunks are complete, you declare success. If the chunk was already complete, just ignore the check-in.

status
Used on the server to see what chunks are being processed, what chunks are done, and what chunks are left.

client
Whatever perl or shell script that runs on the client, connects to the server, and retrieves a chunk to work on. It completes work on the chunk, checks the results back in, and attempt to get another chunk, until there are no more chunks.

The hardest part would be implementing the serialization while clients check out or in chunks of work. Specifically, if you just use lock files, and a client dies while in the middle of checking out (or in), then the lock might stay there forever. However, even solving that problem isn't particularly challenging; there are several approaches, from easy (put process ID in lock file and re-validate on lock) to robust (use a separate lock server).