This is an HTML version of a presentation I gave. For far more details, see:
Virtualization and ClustersMany Systems Within One Computer
|
Run one operating system on top of another. It simulates the OS, you provide the application.
This provides "just enough" to allow applications to run in what appears to be their native OS. This may require something to take the place of the native DLLs if the emulated OS is Windows, since many applications expect to have the Windows API available.
Provides:
It emulates a hardware platform, you provide the operating system and applications. The guest OS can be anything that runs on an IA32 PC:
Provides:
The effect is as if there were a hub-based Ethernet LAN within the box, and a 2-port Ethernet switch between that and the outside. Within the system, all operating systems see all frames of all guests and the host. But only frames destined for exterior systems leave the box. The guests have routable IP addresses. Ethernet frames from them have unique MAC addresses, with the first 3 octets (the manufacturer code) one assigned to VMware.
+----------------------------------+ | | | Host OS | | | | | | 2-port | | Hub-based ---------- Ethernet ---> to real LAN | Ethernet LAN switch | | | | | | | | | | Guest Guest | | OS OS | | | +----------------------------------+
The host OS does network address translation for the guests. Within the system, all operating systems see all frames of all guests and the host. But only frames destined for exterior systems leave the box, bearing the source IP and MAC address of the host OS
+-----------------------------+ | | | Host | | Hub-based ---------- OS ---> to real LAN | Ethernet LAN (NAT) | | | | | | | | | | Guest Guest | | OS OS | | | +-----------------------------+
You specify disc type(s), IDE or SCSI, and size. Disc images are enormous files GuestOS-s00*.vmdk. Not really the data structure of the guest's native file system, but plaintext and modifiable with a hex editor.
You specify RAM. Save at least 50% for host OS. UNIX guests can get by with 128 MB or so.
Guest OS details are stored in GuestOS.vmx.
vmware service is started at boot, loads two kernel modules. Built from source code supplied by VMware.
Workstation version — login to host, run guest in X window.
Server version — guests run "invisibly", controllable through VMware management console.
Cluster may be a single cabinet, or a physically clustered set of cabinets.
Modules include CPU and some RAM, plus access to shared bus for RAM and disk storage. Each module runs a separate instance of an OS: VMS, Tru64 (Digital UNIX), Windows NT, Linux. Mix and match on one cluster.
File systems can be shared between OS instances. Several mount it simultaneously, issues with file locking! Ditto for physical memory (harder to manage!)
OS instances can be individually shut down and booted, and to some extent, hardware swapped while the rest keep running.
More formally called Solaris 10 N1 Grid Computing Environment. Start with one Solaris 10 OS running on one hardware platform. A zone is then a new instance of Solaris 10 started from within the primary (main, host) OS. The zone has its own:
Up to 8191 simultaneous zones per Solaris 10 host!
Multiple processors within one cabinet, supporting one instance of an OS.
Simplest case: dual-core CPU (Pentium 4). If kernel recognizes both, multi-threaded applications can use both cores.
More complex: two or more CPUs.
UMA = Uniform Memory Access. Also called SMP, Symmetric MultiProcessing. All CPUs have equal access to all memory.
CPU CPU | | ===bus=== | RAM
NUMA = Non-Uniform Memory Access. All CPUs have access to all memory, but some is closer.
CPU CPU | | RAM RAM | | ===bus===
CPUs in both UMA and NUMA have their own internal caches. There are issues of cache coherency in both cases!
Just how symmetric is that SMP, anyway? This has been a complaint with Linux, at least for people wanting large systems: "It doesn't scale well to a 128-CPU platform!"
Can my process be distributed across multiple CPUs? That depends on what it's doing!
Even it the OS does the right thing, and the process is distributable, see Amdahl's Law: Parts of a program may be parallel, but parts will always be serial. The serial parts will limit the amount of speedup.
In other words, just because you threw N processors at the problem, it won't run N times faster!
High performance computing (DOE doing 3D physics simulations, or protein modeling)
High availability: several copies of the same data or network service.
Use available hardware, unwanted PCs or SPARCs. Install a modified Linux kernel on each node. One is the controller or "head", the rest are compute nodes.
User process space now spans all platforms. Let's say I start a process and get PID #12345. If you find a PID #12345 running on any node, it is my process! Not some random process that happened to get that PID locally.
High speed switched Ethernet | |--- Node | |--- Node | |--- Node Public --- Head --| Network |--- Node | |--- Node | |--- Node |
The original popular method was the "Beowulf cluster", started around 1994 at NASA Goddard and starring Don Becker and Thomas Sterling. The Anglo-Saxon hero "had the strength of many".
Beowulf has the name recognition but it's no longer the overwhelmingly common choice. There is Scyld Beowulf from Penguin Computer (expensive!). And there are earlier and unsupported versions available on the Internet.
It has been and continues to be used to build several very capable supercomputers. Some for US DOE (LANL and Sandia). For some you get rather vague answers from the hardware suppliers, meaning some three-letter agencies. See the public list at: http://www.top500.org/
Some mechanism is needed to manage processes and inter-process communication. This requires a new compiler and shared libraries, plus more.
PVM (Parallel Virtual Machine) library was developed 1989 through mid-1990's at ORNL, UTenn, CMU, and Emory University.
MPI (Message Passing Interface) is a newer standard, and seems to be the method of choice. Also called LAM/MPI (Local Area Multicomputer / MPI). Maintained by Open Systems Laboratory at Indiana University.
Both are robust and useful, some ready-to-install cluster solutions install both.
An extension to Linux kernel plus user-space tools.
Processes can migrate transparently among nodes to balance the workload.
Granularity is the process — even a multi-threaded process is limited to one node at a time.
Process migration is automatic, but it can be explicitly controlled. Tools mps and mtop look like ps and top, but with an extra column showing node where process is running.
http://openmosix.sourceforge.net/
Pros
Cons
Open Source Cluster Application Resources. It's added to an existing Linux system and provides a new kernel and user-space tools.
It uses the traditional head - back-end LAN - nodes model.
You build the head by hand, then clone the nodes.
It supports both PVM and LAM/MPI.
http://www.openclustergroup.org/
NPACI Rocks (NPACI = National Partnership for Advanced Computational Infrastructure).
Also uses traditional head - back-end LAN - nodes model.
You install head from Rocks media, then clone the nodes.
Supports both PVM and MPICH (a variant of MPI)
Pro: designed to automated cluster building
Con: You must re-write the applications to use PVM or MPI!
Places like Google and Amazon use something like the following to provide high availability (and load balancing) at one location:
========= Heartbeat network ========== | | | | | | | | | | | | Public ---- Head Node Node Node Node Node Network | | | | | | | | | | | | ===== Data communication network =====
Linux solutions include:
NASA's situation:
The philosophy is to limit the load on any one server. Any Internet backbone traffic is the problem of the backbone!
The solution is to put one server in each facility, each mirroring the popular web site. The DNS server provides 6 IP addresses for www.nasa.gov, and the order in which they're provided is shuffled with each response.
You and I both want to see a NASA site, and our browsers (probably) get two different answers. The same set of IP addresses, but in different orders, and our browsers will use the first one. Neither of us will necessarily use the closest server (again, backbone traffic is the backbone's problem), but at least we are probably using different ones.
The DNS root server's situation:
The solution:
Speedera is a UK-based content-hosting company. They host the large parts of clients' pages: images, etc. They have some heavy web service load, and what seems to be a peculiar concern with page loading performance!
They distribute mirrors around the world, and attempt to figure out where you are so they can refer you to a close host. That means that two of us may get contradictory DNS info, as opposed to just differently order answers.
Their solution is based on their DNS servers logging all the DNS queries made by other people's DNS local servers. They then feed those address lists to probing systems scattered around the world with their web hosting mirrors. These probing systems have synchronized system clocks.
Once any one of your systems have looked at a Speedera-hosted page, and therefore your local DNS server has asked a Speedera DNS server for an IP address, you will see some strange DNS probes. Every 6 to 18 hours you will see a flurry of synchronized requests coming in from literally all over the world, all of them asking your server for the PTR record for its IP address. In other words, asking it "What is your name?", probably the most obvious question a DNS server should most often be able to answer. Their distributed system uses the round-trip travel times to decide which of their mirrors your DNS server is closest to. So when your DNS server next sends a query to a Speedera system, it will get an answer that Speedera thinks is most appropriate.
Given that web page loads are very different from DNS queries and replies, and also given that round-trip time to the other side of the world is usually under 0.25 seconds, this concern seems rather strange!
A control node divides some problem into pieces. Compute notes "out on the Internet" then request a piece. Their availability, speed, and even OS are not predictable. Will they:
There are also some interesting security problems.
SETI@home / BOINC is the obvious example
For more on grid computing see:
| Home Page | Site Map | Public Key |
|
|
|
|
|
|
| © Bob Cromwell Jul 2008. Created with /bin/vi, hosted on OpenBSD with Apache. Root password available here | ||||