Date: May 1, 2018
Name: Maxwell Bernstein
UTLN: mberns01

Beaker browser and the Dat protocol

An analysis for COMP 117: Internet-scale Distributed Systems

In this paper, I will introduce the Beaker browser and the Dat project upon which it is based. Then, I will critically analyze the design of the project using the principles and practices we have discussed throughout this course. Finally, I will provide references to related and future work on similar topics.

1 What is this?

The Beaker browser is a browser that is built to surf and publish peer-to-peer web pages. It relies on a networking stack and shared data layer provided by the Dat project.

1.1 Dat project

According to the authors, one of Dat’s goals is to solve the problem of “link rot and content drift as files are moved, updated or deleted” (Ogden et al. 2017, 1). In particular, the authors focus on the data used in scientific literature. This is a major problem on the existing internet, ameliorated somewhat by projects like the Internet Archive. When websites change locations or people stop paying for hosting, their websites die.

Additional goals include decentralizing the internet, which currently relies on several tech giants to survive, and encrypting all traffic. In the eyes of the creators of Dat, the network should not know what content a user is requesting, and the user should be able to verify that the content received was produced by the author of said content (i.e. it is signed).

Some challenges that Dat faces are the problems it aims to solve: dead links are also possible on the Dat network. Dead links are less likely, because anybody can serve anybody else’s content, but once the last host stops serving content, it is effectively dead.

Dat has less of a problem with files moving, or being updated, or being deleted; once an a user shares an object on the Dat network, it has a name that will always refer to that content.

Other challenges include peer discovery and making Dat hosting accessible for the “average” web user. Peer discovery is often tricky due to Network Address Translation (NAT). NAT ensures that from the perspective of the broader internet, many computers can live behind one external IP (generally assigned by an Internet Service Provider) and have their traffic switched by a router.

Figure 1: NAT traversal (created by Ingo Blechschmidt, CC by SA)
Figure 1: NAT traversal (created by Ingo Blechschmidt, CC by SA)

In this diagram, without the tunnel, the computers on the left side of the left NAT gateway and the computers on the right side of the right NAT gateway would not be able to “see” one another. Fortunately, Dat features three modes of discovery over a single interface (Ogden et al. 2017, 3):1

join(key, [port])
leave(key, [port])
foundpeer(key, ip, port)

A node calls join when it would like to be kept up-to-date on the whereabouts of the data named key. Additionally, it can provide port if it also has access to and wants to share that data. A node calls leave when it would like to stop being kept up-to-date. Additionally, it can provide port if it would like to stop sharing the data. The function foundpeer is a callback that will be called when a peer that has a given key is found.

The three modes of discovery that they have made fit this interface are DNS name servers, DNS multicast, and the Kademlia Distributed Hash Table, which are all explained in the next two sections.

1.1.1 DNS name servers and DNS multicast

The Dat team maintains a Node module called dns-discovery that implements a custom DNS server that adheres to the aforementioned interface. It also uses DNS multicast to find peers on the local area network (“Multicast Dns” 2018). To give an idea of its ubiquity, DNS multicast, or mDNS, is the same protocol that the Chromecast uses to find devices to cast to on the network.

1.1.2 Kademlia Distributed Hash Table

A Distributed Hash Table, or DHT, is a map of keys to values that is split across some amount of peers in a “swarm”. It is an oft-used piece of software in large decentralized systems, like file sharing networks. In a system like Dat, it would be important for discovering who can serve what content.

The BitTorrent team maintains a software package for DHT bootstrap nodes. Bootstrap nodes are used to introduce new nodes to the many other existing nodes in the “swarm”. The Dat team runs a DHT bootstrap node and the Dat software is configured to connect to it automatically to discover peers. The Dat software also wraps the DHT connection in the aforementioned interface.

1.2 Beaker browser

The Beaker browser is a front-end to the Dat API. It has the following features:

  1. Traditional web-browsing features such as navigating to a URL, navigating between pages, and all of the other standard functions one might expect. In addition, it supports browsing Dat archives.
  2. Publishing archives on the Dat network. Uploading content is as easy as clicking the “publish” button, and content is automatically available on the network. The Beaker browser integrates with Dat’s built-in “share” functionality.
  3. Versioning content and updates to Dat websites. Every click of the “Publish” button creates a new revision, which is then logged. This makes finding or reverting to previous revisions simple.
Figure 2: A picture of the Beaker browser in action (full size)
Figure 2: A picture of the Beaker browser in action (full size)

One interesting feature that helps ease the transition between the normal internet and Dat is Beaker’s ability to take an website served on a normal domain (say, https://bernsteinbear.com), and find the equivalent content on the Dat network. It achieves this by using the .well-known directory (described in RFC 5785 (Nottingham and Hammer-Lahav 2010)), whose Dat entry points the browser to the Dat version of the site. Beaker, therefore, considers an SSL-backed website with a Dat pointer to be the authority.

2 Analysis of the system

There are some features and drawbacks to using Dat. In this section, I will analyze what I think are several interesting facets of and principles embedded in both Dat and the Beaker browser.

2.1 Metcalfe’s Law

Given its relatively small history, the Dat protocol has limited adoption. While the Beaker browser interfaces well with the “broader internet”, the Dat network’s requirements for special software and constraints on content dampen its advantages.

For example, joining the Dat network currently requires either downloading the Beaker browser and using it like a “normal” user, or downloading the dat NPM module and serving content programmatically. Both of these are barriers to adoption. If Dat could be implemented as, say, a browser extension, this might not be so tricky. Nobody wants yet another WebKit engine running on their computer.

Additionally, many people use software like WordPress to build dynamic web-pages. Some report that WordPress accounts for up to 25 percent of websites on the internet (Gelbmann 2015). These websites cannot exist on Dat because Dat is built only to serve static files. Until there is a good enough website builder that reaches feature-parity with WordPress, those users will likely be hard to convert.

Unfortunately, the Dat network needs some amount of critical mass before it can reach its true potential. At present it is unlikely that any given website will be hosted by more peers than the other — discoverability is simply too low. Metcalfe’s law says that if more people were to join the network, however, there would be a quadratic growth in connections between peers, making more sites available.

The authors’ addition of “super-hosts” (explained in more detail later) somewhat sidesteps this hosting problem, but does not do away with it entirely; it makes the Dat network just like any other.

2.2 End-to-End Principle

In a network where any user can request some Dat content from any other user, it is important that all nodes agree on what to call a particular blob. The Dat protocol gives names to Dat objects based on their contents: according to the whitepaper, “Dat uses BLAKE2b cryptographically secure hashes to address content” (Ogden et al. 2017, 2). The hashing helps ensure that the content requested by name (URL) is the content retrieved — that there has been no tampering in the middle. This guarantee is particularly important for peer-to-peer systems that have potentially untrusted peers in the network.

Figure 3: A brief explanation of hashing (Wikimedia Commons 2015)
Figure 3: A brief explanation of hashing (Wikimedia Commons 2015)

Importantly, this method does not require that every node in the middle be trusted or verify the hash; it only requires a small check after receiving the content. This end-to-end check is what makes a network like Dat both interesting and reliable.

The end-to-end check does not hold only on individual files; it holds on entire repositories, too. Files on the Dat protocol can be grouped together in bunches called “repositories”. These naming conventions and content guarantees (content-addressable naming, etc) must be extended to repositories as well — the name of a repository should provide some information about its contents. Dat uses Merkle trees to provide this property. The hash of a repository is the hash of the hashes of all of its contents. If any one sub-node changes, the changes propagate back up the tree until the root hash changes, too.

These guarantees make maintaining some principles like the End-to-End Principle on the Dat network much easier. Software written for the Dat network need not verify that an entire Dat repository has arrived in once piece, file by file; instead, it can build up the Merkle tree and use the root hash to verify if the files have arrived. If there is a mismatch in the tree, the Dat software can work to correct the error.

It seems as though the authors of the Dat protocol and accompanying software kept the End-to-End Principle in mind. Dat makes good use of end-to-end correctness checks.

2.3 Name Everything

All Dat resources and their sub-components are named. This is not notably different from the traditional network model; users can still refer to web resources by URIs.

In Dat, naming is slightly different and subtly more powerful. A Dat name has no information about how to locate a resource. Instead, all content that is byte-for-byte the same will have the same name. There is no difference in name if some user A hosts a technical report about Dat or if the Tufts CS department hosts that same report; it is discoverable and can be fetched from either peer by the same name.

This means that names are even more global than in the World Wide Web. In order to see the real-world implications of this, Dat will need to grow and gain users.

2.4 This “hosting” situation

In the current internet, the creator of a web page must either host it themselves (i.e. personally run a server in their home) or pay somebody else to host it for them. Self-hosting has become trickier for two reasons:

  1. ISPs have gotten warier of letting people host from home, sometimes banning it
  2. There has been a global shift from desktops to more portable computers, like laptops, tablets, and phones

Dat partially solves the first point by changing the normal patterns for web hosting. Instead of a user opening port 80 on their router and allowing the big wide web to direct web traffic to their personal computer, they simply open either the dat command-line application or the Beaker browser. They do not need to open a port, nor will people send them HTTP messages over the wire. It is possible that ISPs start filtering or blocking Dat traffic in the future, though.

The second point is trickier to combat. Because many people own a laptop and therefore don’t leave their personal computer at home, there may not be a candiate machine to host their website in the traditional model. It’s possible to set up a Virtual Private Server or other web hosting deal with one company or another, so many people do that. Companies like DigitalOcean, Linode, or Vultr offer inexpensive server space. Dat again partially solves this problem by allowing other peers than the creator to re-host that content (which still comes with the security guarantees).

It is entirely possible that in the future such hosting providers will be unnecessary; when a piece of web content gets popular, it may be also hosted by the peers currently accessing that content, or the peers that choose to re-host it. But if Joe Schmoe wants to host a personal website and it so far is not popular, there is no choice but to find a computer to leave on all day — or risk the content being inaccessible.

Since Dat aims to decentralize the internet, the notion of “hosting providers” seems counter-productive. Paying Google, Amazon, or some other service to host personal web content re-centralizes what should be distributed. But this seems to be a solution that the Dat project is advertising: Paul Frazee and Tara Vancil, the creators of the Beaker browser, have also created Hashbase. According to its website, Hashbase is “Hosting for the peer-to-peer Web” (Frazee and Vancil, n.d.). On the About page, the authors clarify further:

“Hashbase acts as a ‘super peer’ and rehosts your Dat archives, so your files are always available, even when you’re offline.”

It is not clear how different this offering is from those of Google and Amazon, especially considering that there is are both paid and free service tiers. One notable difference, though, is that there is no vendor lock-in; since any computer running Dat software can act as a peer, any vendor could do.

Last, it is important to note that much software is being delivered over the internet as a web application written in JavaScript. These application bundles are constantly changing, rely on dynamic content from a central server, and often include analytics. This is a massive shift from the document-based internet and is hard for Dat to contend with. It’s impossible to host rich applications like Google Docs, Facebook, or TurboTax on the Dat internet — it does not support the kind of client-server interaction that these applications need. Dat does not offer an alternative for such websites.

While Dat appears to offer a more democratized internet, it lacks solutions that would enable a truly decentralized internet. Instead, it offers a centralized hosting system. As of yet, there is also no answer to the call for an internet of rich applications.

2.5 One Name for Each Thing

Since a Dat resource is content-addressable, and a resource is made up of content, each resource can only have one name. Different revisions of the “same resource” have different names because the content is different. Versions are the original content name tagged with a number that indicates the revision. A name without a version tag gets the latest content.

Because the names are hexadecimal numbers, they could theoretically be written and rendered in either upper case or lower case. In Dat, either is accepted, but lower case seems to be preferred; the Beaker browser automatically converts upper case URLs to lower case.

URLs must have a scheme section, as per the RFC (Berners-Lee, Fielding, and Masinter 2005). Dat uses the dat scheme. Since Dat aims to provide a bridge between the Dat network and the “old” internet, any Dat link with the name [key] can be referred to by one of the following four names (Ogden et al. 2017, 2):

  • [key]
  • dat://[key]
  • https://datproject.org/[key]
  • the normal DNS name, as long as it is configured properly

Since not all clients will be able to use the dat:// scheme, this improves accessibility. It increases aliasing, though, which increases the overall system complexity. Now any part of the Dat system that handles URLs must be able to handle any of the four types of resource names, and be able to convert between them.

2.6 Plan for security

Unlike our current internet, the Dat protocol has planned for security from the outset. There are several features that Dat provides that either the normal internet does not, or is still in the process of adding:

  • When requesting a resource, the user has a method for detecting if the resource was compromised in transit
  • When requesting a resource, the user has a method for verifying that the content was published by a given entity

These are both explained in more detail in the section about the end-to-end principle.

Unfortunately, there are some features that Dat has that could be considered detrimental. Since the network only serves static pages, it is not possible to have a server-side method of authentication like in the current version of the World Wide Web. The leadership behind the Dat project has decided that having a link to a resource means that one should be able to read that resource.

This method of permissions is commonly known as “security by obscurity” and widely considered a bad idea. It strikes a funny balance between two extremes:

  • Discoverable URLs with no authentication and no encryption
  • Hard-to-discover URLs with authentication and encryption

But as far as I am aware, there is no good method anywhere for dynamic permissions on static and freely discoverable resources.

However, the Dat protocol does one thing well. According to the author of the Beaker browser, pfraze, URLs do not leak when being requested on the network:

The public key of a Dat archive is hashed before querying or announcing on the discovery network, and then the traffic is encrypted using the public key as a symmetric key. This has the effect of hiding the content from the network, and thus making the public key of a Dat a “read capability”: you have to know the key to access its files.

(The public keys referenced here are Ed25519 keys (Daniel J. Bernstein et al. 2012). Ed25519 keys are focused on being fast, secure, small, and resistent to hash collision (Daniel J Bernstein et al. 2017).)

This is notably different from the HTTP model, where any plaintext request is clearly visible to other users on the same network and even to intermediate nodes delivering the page. This kind of baked-in privacy is a marked improvement.

This method also has one disadvantage: if a user knows the public key of some content, they know when anyone in the area is requesting that content. To some users, for example users behind a government firewall, this might be of some concern. This behavior is different from SSL in that users cannot introspect any other users’ SSL traffic.

Hacker News user skybrian poses several interesting comments.

skybrian: You can publish new versions to a URL until you somehow forget the private key, and then it’s fixed forever, so long as people hang onto copies.

[…]

Suppose someone chooses to publish a private key? Is it a world-writable URL? Hmm.

To address skybrian’s first point: this is not a problem unique to Dat, but it is not a problem that Dat solves particularly well. The same issue exists with PGP (if one loses their private key, they can no longer send messages as themself2), SSL (if one loses their cert private key, they can no longer prove that they are themself), etc. In Dat, though, there is:

  1. No means for revoking lost keys, like in PGP
  2. No backing central identity provider (like a Certificate Authority) that can re-issue a private key

Dat project and Beaker browser creators Max Ogden and Paul Frazee respond to skybrian:

maxogden: Great analysis. We anticipate that in order to fix these three usability issues around trust we will need to provide a centralized identity provider in the future.

Ogden’s response seems to somewhat contradict the original goals of the project: decentralize the internet. If the Dat protocol is based around a centralized identity server, it is possible that many of the benefits that could be reaped from decentralization (less worrying about downtime, no reliance on a central service, smaller attack surface) are reduced or eliminated. This proposed centralized identity server, coupled with already extant Hashbase make for a project that seems like it lacks a strong decentralized direction.

pfraze: All true, though if you leak the private key, what will happen is that (due to lack of strict consensus between the leaked-key users) conflicting updates will be published, causing a detectable split history. That’s a corruption event.

At the moment that would result in each leaked-key user maintaining a different history, with different peers only downloading the updates from the leak-author they happen to receive data from first. But in the future what will happen, once we get to writing the software for it, is the corruption event will be detected and recorded by all possible peers, freezing the dat from receiving future updates.

This is a thorny problem. pfraze’s proposed solution — detecting and freezing the corruption event — limits the affects of the fork, but it is imperfect. How will this software know the difference between two parties sharing a private key and contributing to the same dataset and one party accidentally leaking a key to another party? The former is a reasonable course of action (that is, until some kind of “contributor” vs “owner” distinction is implemented), whereas the latter should be stopped in its tracks.

Dat offers good but not great security as of right now. It could be improved by adding some means for access control into the content distribution layer (perhaps that an area that is ripe for development), but certainly not by centralizing user identity.

3 Conclusion

The Dat project seems well-reasoned and well-developed. Its principles are sound. The software works, and it works well. There is small but active community that continues to develop the software and its documentation. I have some recommendations, though.

I used to run a WordPress website, including a blog. I switched to Jekyll, but only because I found a script that converted my website to a static Jekyll site. If I had to manually move over all of my content, I would not have transitioned. So my recommendation to the Dat folks is thus: create some software that users of blogging engines, website builders, etc can use to easily and gracefully switch to Dat. It need not be complicated and it need not cover every case; as long as it is possible to convert one percent of site-builder users, it will be a huge win.

Even if the Dat project converts many users, those users may not have reasonable hosting options at home. Providing an easy-to-setup hosting alternative to WordPress/Wix/Squarespace is key. So I have several recommendations in this vein:

  • Provide pre-configured plug computers or small boards that run Dat. The team already has service discovery implemented, so it should be reasonable to plug a Raspberry Pi (for example) into the wall and the network, then have a Beaker browser find and talk to it. This small computer could be left on all the time to serve web-pages while the user brings their personal computer around.
  • For users wary of buying hardware devices, it could be worth investigating software packages to install on routers. Since these devices are already on all the time, they would be good Dat nodes. Installation on a huge variety of routers is probably tricky, though.
  • There is a large community of software pirates who run BitTorrent nodes all the time to “seed” content (i.e. act as peers that serve content to peers who request it). They do this because it is possible to compute a ratio of content served to content downloaded. In many sub-communities it is either considered rude or not allowed to fall below a certain ratio. If Dat can adopt similar motivation, the hosting problem might not be so dire; people would have incentive to leave computers on all day.

I hope the Dat project continues to attract new users and prosper. Until it gets a larger population, it will have trouble.

This paper can be found at dat://bernsteinbear.com/dat-paper.html or dat://54cb4accdabc258240d76df28ee66c0900aeaadddb822e2b99e0cef113ae128b/dat-paper.html. It will likely be down, though, since my laptop is normally closed and in my bag.

4 Related work

Here I have collected a short list of projects that likely inspired Dat, or have similar goals.

4.1 The Inter-Planetary File System

IPFS, as it is commonly shortened, also features content-addressable files and is also peer-to-peer. It also uses Merkle trees in a similar fashion as Git (featured below) and Dat.

The IPFS project is working on a naming system overlaid on top of the content-addressable web it has already created. This naming system, IPNS, is similar to DNS.

4.2 BitTorrent & the Kademlia DHT

BitTorrent is a file-sharing protocol that uses file chunking and peer-to-peer chunk delivery to speed up and also decentralize file sharing. Unlike its precursors (Napster, Usenet, Kazaa, etc) BitTorrent is different: it is (almost) completely decentralized, and also fetches the chunks of the files from multiple peers at once.

BitTorrent uses the Kademlia Distributed Hash Table to bootstrap a network of peers, similar to Dat.

4.3 Git and other version-control systems

Git is a version control system. It uses a block hashing method that is similar to that of the Dat protocol. Each commit is hashed. Each commit contains both metadata (including the previous commit hash) and the changes bundled. It is therefore possible to build a coherent and verifiable history of the repository, and a checksum that ensures that the repository was built correctly end-to-end.

Recently, Elie Burzstein and team broke SHA1, the hash function that Git uses (Stevens et al. 2017). This collision attack could theoretically be used agains the BLAKE2b hash function (Saarinen and Aumasson 2015) that Dat employs, as suggested by a researcher from Tsinghua University in Beijing (Hao 2015). Both the hash function BLAKE2b and the potential attack paper are new (2012 and 2014, respectively), so there will still be research to come.

4.4 Tox chat

Tox is a decentralized chat service. It uses DHT for swarm discovery and then once a connection to a peer is established, encrypts all traffic between the nodes (“A New Kind of Instant Messaging,” n.d.). It also uses Ed25519 via the libsodium library for public-key signatures.

References

“A New Kind of Instant Messaging.” n.d. Project Tox. https://tox.chat/faq.html.

Berners-Lee, Tim, Roy T. Fielding, and Larry Masinter. 2005. “Uniform Resource Identifier (Uri): Generic Syntax.” STD 66. RFC Editor; Internet Requests for Comments; RFC Editor. http://www.rfc-editor.org/rfc/rfc3986.txt.

Bernstein, Daniel J, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin Yang. 2017. “Ed25519: High-Speed High-Security Signatures.” Ed25519: High-Speed High-Security Signatures. https://ed25519.cr.yp.to/.

Bernstein, Daniel J., Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin Yang. 2012. “High-Speed High-Security Signatures.” Journal of Cryptographic Engineering 2 (2): 77–89. doi:10.1007/s13389-012-0027-1.

Frazee, Paul, and Tara Vancil. n.d. “Hosting for the Peer-to-Peer Web.” Hashbase. https://hashbase.io/.

Gelbmann, Matthias. 2015. “WordPress Powers 25.” W3Techs. https://w3techs.com/blog/entry/wordpress-powers-25-percent-of-all-websites.

Hao, Yonglin. 2015. “The Boomerang Attacks on Blake and Blake2.” Information Security and Cryptology Lecture Notes in Computer Science, 286–310. doi:10.1007/978-3-319-16745-9_16.

“Multicast Dns.” 2018. Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Multicast_DNS.

Nottingham, M., and E. Hammer-Lahav. 2010. “Defining Well-Known Uniform Resource Identifiers (Uris).” RFC 5785. RFC Editor; Internet Requests for Comments; RFC Editor. http://www.rfc-editor.org/rfc/rfc5785.txt.

Ogden, Maxwell, Karissa McKelvey, Mathias Buus Madsen, and Code for Science. 2017. “Dat - Distributed Dataset Synchronization and Versioning.” Code for Science & Society. https://datproject.org/paper.

Saarinen, Markku-Juhani, and Jean-Philippe Aumasson. 2015. The Blake2 Cryptographic Hash and Message Authentication Code (Mac): IETF Rfc 7693. Request for Comments 7693. Internet Engineering Task Force. doi:10.17487/RFC7693.

Stevens, Marc, Elie Bursztein, Pierre Karpman, Ange Albertini, Yarik Markov, Alex Petit Bianco, and Clement Baisse. 2017. “Announcing the First Sha1 Collision.” Google Online Security Blog. https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html.

Wikimedia Commons. 2015. “File:Hash Function.svg — Wikimedia Commons, the Free Media Repository.” https://commons.wikimedia.org/w/index.php?title=File:Hash_function.svg&oldid=172142077.


  1. In addition, there are more resources in the Dat docs, found in the docs repository.

  2. If the user has the forethought, however, to generate a revocation key ahead of time, they are in more luck. That revocation key is an easy way to mark the lost key as “revoked” in a trusted way.