Network Descriptions (The Machine Emulator (TME) 0.12rc10)

1.8.6 Network Descriptions

Using the tmesh machine description language, it’s possible to describe not only a single machine, but also an entire network of machines. This is done by creating a clever configuration of TAP and BPF network devices which allow multiple machines described by different tmesh scripts to communicate. You can connect the machines via their network devices using the very same statements that are used to connect their other components. In particular, you create a "master" machine that includes a tap connection that is then connected to by other "slave" machines using bpf filters. The master’s connection can act as an application-level gateway by including a nat configuration as well. You may then nat it to other subnets or to the external network as described above. The slaves simply connect to the tap device the same way they would to a real NIC. All slaves connected in this way can directly communicate with each other or the master machine using ethernet protocols. When configured correctly, they may also communicate with the external machine or other networks outside the internal network via nat.

The naive approach is to create the master tap system the same way you would any ordinary nat machine. In other words, the emulated NIC connected directly to a tap device that may be nat’d to an external network. Other machines could then be added to the internal network thus created, by using bpf to set a filter on the tap device. This almost works. All the machines in the network can communicate with each other. But only the master machine can communicate with the external network. Why is this? The answer is that writing to the bpf does not write back to the tap device. To communicate outside the network, it is necessary to write to the tap. But remember that the tap device is actually a software NIC that kind of works in reverse of how the bpf devices do. Writing to tap sends packets to the machine. Writing to bpf sends to the network. Reading is the opposite in both cases, i.e., tap receives packets from the network, bpf receives them from the machine. (It’s actually a little more complicated than that as bpf reads also read from the network, but that’s not an issue.) So, why is this a problem? Well, think about what happens in the case where one of the bpf machines sends a packet with a route through the tap device’s address (i.e., a default, gateway route). The tap will never see this packet! This is because the write to bpf sent it "out" on the tap device’s network, so tap never saw it.

Confusing, isn’t it? How do we resolve this dilemma? The solution is a little tricky, but it actually works out quite well. More importantly, it works within the original methodology of TME, once again validating its approach to describing complex systems through simple components and working without requiring any updates to it. Plus, once you figure it out, it’s quite straightforward and provides a way to build up ever more complex configurations from simple connections. You just have to understand how the information flows between network components in tmesh. Each network component is one side of a paired connection. There is a read path and a write path and they are both connected to the same other side. So when network components are connected, a bidirectional path is set up between them, much like a pipe. Writes to one end will be read from the other. This is why they must be paired. As it turns out, this is all you need to set up just about any network configuration. The key lies in how you set up the master tap connection to the network. The sample tap machine description simply paired the emulated NIC with the tap device. But, this is not good enough. We also have to write back packets sent through bpf filters set on the tap. As alluded to, this can already be accomplished by setting up another pairing. Instead of pairing the tap directly with the emulated NIC, we pair the tap with itself instead! What this accomplishes is to make the packets written out to tap get written back to the tap device, creating a loopback scenario. This allows tap to see the packets written by bpf. Here is a sample machine description that establishes the writeback tap necessary to create a fully nat’ing network of machines! Note that it is a simple modification of the standard sun4c description included with the latest tme distribution. In particular, the relevant lines describing this master configuration are changed to this:

tap0: tme/host/tun/tap interface tapA inet 10.0.77.1 netmask 255.255.255.0 bcast 10.0.77.255 tap0 at tap0 bpf0 at le0: tme/host/bsd/bpf interface tapA

There are several things to note here. First, the tap device, tap0, is no longer connected directly to the emulated NIC, le0. Instead, it is now a root node. Second, you may see duplicate ICMP packets when doing pings. This is not because of suboptimal routing, but because the bpf device detects both reads and writes of packets to the tap device. Since all the packets are written back, they are all duplicated. This is inefficient, since only packets destined for tap need be written back. Unfortunately, there’s no way around this in the current version of TME, because we can’t yet set a filter on tap devices directly. Future revisions should fix this issue. Finally, the first two lines are unique to this master configuration. The last line is shared among all the slave configurations as well as this one. The bpf filter is set on the tap device by naming its interface the same as tap0’s. It is paired with the emulated NIC (le0), effectively connecting the each slave machine to the tap network.

Note: The sample master config was for Linux. The interface name in particular might need to be changed depending on whether the platform supports that or not. In NetBSD, for instance, the default names start at tap0 and go up from there.

Note: The different machines should run in their own processes in order to be efficient. When looked at this way, the ethernet trunk is analogous to an IPC bus connecting the processes. Indeed, it might make a nice alternate mechanism for just such a purpose. The tap/bpf devices each run in a separate "thread" within each process, just like all the other components. The more components, the more threads needed; currently the threading model is cooperative using a coprocess technique that does not take full advantage of the multithreading capabilities of modern processors. This is one area for future improvement.