Introduction

When personal computers were scattered machines, data ferried by hand from one to another on floppy disks or tapes, finding a piece of data meant remembering which physical media it was stored on.

Interconnection between these computers was slow, and grew slower with distance. These constraints placed control over applications and data in the operator's hands. Application providers made various attempts to control applications, mostly futile, but relatively few attempts to control or collect user data. Power of computing and data rested with the operators.

The high speed connections of the 2010s changed those constraints, and the ecosystem of software changed accordingly. An enormous amount of software now runs on computers in a datacenter somewhere, leaving us to interact with but a thin skin. The power, the control and utilization of data and applications, has shifted, for the most part, to application providers. In the connected world, finding a piece of data means remembering which device or cloud service which version is stored on.

People gave up that power for many reasons, but the most significant one is the ease of collaboration. It is orders of magnitude easier to work on a centralized document than to send different versions of that document around and keep track of changes. Distributed version control systems provide a theoretical solution, but in practice their use is mostly restricted to software development, and even there providers have leveraged significant centralization.

Where files are a basic interchange unit for individual local applications, nodes are a basic interchange unit for collaborative local applications.

  • References are one of the biggest differences between files and nodes.

    • Files are referenced by user-assigned names. Any global references must locate a file on a host and path within that host. These names are subject to "link rot" as the organization scheme changes, and make providing access to the same file at multiple locations difficult.

    • Nodes are referenced by globally unique unforegeable identifiers, which are the same no matter which host the node is stored on, or what organizational scheme is used. They may then additionally be given local names for user convenience.

  • Files and nodes can both reference other files and nodes.

    • References to other files must be in the data of the file. Extracting those references require understanding each file format, and that the data be decrypted.

    • Nodes store references to other nodes outside their data. This allows the data to be encrypted or garbage collected without preventing synchronization.

  • Version handling and conflict resolution is completely different.

    • Files only have one identifier, their name. There are myriad systems to track changes in files over time, for backup, concurrent modification, etc., all working around this central flaw. On the flip side, if history retention is not desired, and there's no need for collaboration, files are an excellent fit.

    • Versioned nodes have multiple identifiers, one for any version of that node (in practice this usually means the latest), and then one for each version. While this makes concurrent modification very conveniently detectable and resolvable, saving ephemeral data to nodes may require special handling.

Updates on versioned nodes can be passed from stranger to stranger or friend to friend, across the Internet or on external media. Subscriptions, asking for these updates, tie in nicely with existing relationships, between friends and family, colleagues, and collaborators. While strangers on the Internet can share with a more swarm-like protocol.

This handbook is a guide for working with versioned nodes, from their core data model and cryptography to containers and formats built with them, to protocols for building data networks for them.