What are they

A couple of my favourite definitions

  • A group of computers working together on a problem that have to communicate over a shared, unreliable network, and
  • You know you have a distributed system when some computer somewhere you didn’t know existed can take your program out :)

Why we need them

  • Reliability: single points of failure can be avoided. How much fun is it when the 1 web server responsible for running a critical piece of software goes down in the middle of the night?
  • Scalability: let us incrementally add capacity. Need stateless things
  • Durability: if your bits only exist on a single disk and that disk fails, you’re going to have a bad day. How well is your process for restoring your data written down? When was the last time you practiced it?
  • Performance efficiency: How well is the compute, network, disk and memory you have being utilized? Well designed distributed systems will let you add / remove capacity as needed without disrupting work in flight