When considering the scalability of serving a text-based multiplayer game, here are the major steps I would be concerned about:

  • 1) Reading input from players.
  • 2) Mutating the world state based on that input.
  • 3) Sending out the observed changes to the world to each observing player.

In a typical MUD, you'll use TCP for the connections (because clients will telnet in), which makes step 1) a bit of a performance bottleneck.

In a typical multi-user game, there's a single, shared world state, which means that everyone may end up serializing on step 2). After all, the simulation needs to resolve who goes "first" if there are any resources under contention (be it space, items, engagements, chat, etc).

Sending out observed changes can be done in two ways: Either you send the observed changes to everyone who can observe the change, as it happens, Or you somehow diff the change between previous and current observed state for each observer, every so often (depending on what your update/pulse threshold is). Hybrids could also be used, such as accumulating all change events as they happen, but generating output only on pulse.

Threading is a typical response to problems in scaling in step 1), at least on Windows. I/O completion ports allow you to fairly efficiently leverage multiple CPUs to suck the data out of the kernel without too much management overhead. By comparision, non-threaded approaches, especially on Windows, may have trouble scaling, although alternatives on Linux and FreeBSD are much better at this. (There are some good papers on this pointed at in the references)

However, then the question is: what do you do with this data? For a shared world, you want to mutate that shared world. That means that each thread needs to somehow ensure serialization of those world changes -- i e, you start suffering locking overhead. In many cases, the additional threads will actually not achieve any parallelization across world updates, so the additional CPUs end up only solving the Windows-specific socket API scalability problem. Meanwhile, you are paying for the locking cost for each world update. Because locking needs global synchronized memory (often even a kernel trip!) the overhead of this operation might be comparable to the overhead of actually processing a world state update for a text-based MUD (depending on the MUD, of course).

Step 3) is never much of a scalability concern, because you can easily fit all the data you need to send to a client in the outgoing TCP buffer, and the sending of that data will be done by the kernel in response to a network adapter interrupt, not in response to any specific user thread. If the buffer fills up, you either drop data, or disconnect the user (those are really the only two options).

For a multi-threaded approach, especially on Windows, I'd expect scalability to be bounded mostly on the locking involved in mutating world state. The fact that you have mutliple, asynchronous request handlers doesn't take away the need to serialize on world state, nor the cost of doing so. To make it actually be a gain, you would need to design the world state mutation such that it achieves parallelism that is higher than the cost of the additional locking -- certainly a very tricky problem to solve!

As if that weren't enough, going higher than two CPUs for your server machine may end up not gaining you much, because those CPUs will just be fighting for the same memory bus. World state mutation is typically bound on memory latencies, as it's not a cache-friendly, compute-bound task. Thus, attempting to scale these servers by more and more CPUs sharing the same memory may quickly turn futile.

For a single-threaded approach, especially on various UNIX flavors, I'd expect scalability to be bounded mostly on the API for servicing all the client socket inputs. select() only scales so far; then you need to go to various kqueue/sys_epoll flavors (again, see the papers in the references). Meanwhile, world state update is inherently serialized without needing any locking primitives -- there's only one thread! Also, the single CPU ensures that multiple CPU cores will not contend for the same pieces of memory, thus ensuring that what memory bandwidth there is, doesn't get spent on cache line thrashing between CPUs.

So, the scalability (in number of MUD clients served) of the single threaded system may be higher than that of the multi-threaded system, if the shared state update cost savings outweigh the potential reduction in efficiency of serving many sockets.

When it comes to numbers, I haven't measured anything relating to multiple TCP streams in ages. Given available data and seat-of-the-pants calculations, I'd expect there to be many thousands of clients on a single-threaded server before it would start actually worrying about load -- assuming you use an efficient socket polling API.

Perhaps you can design a world update system that actually doesn't end up serializing on world state much; perhaps by freezing certain parts of world state (say, room connectivity) so they can be addressed without locking, and splitting mutable state in many, small fractions (say, one lock per room, per engagement, etc). The simpler your rules are, the easier this will be to accomplish; an advanced simulation would likely find it very challenging to constrain itself to those design conditions.

References

Usenix select() improvements paper (1998)
Various UNIX socket polling API benchmarks (2000)
Image of various system calls' scalability