Streaming Graph vs. Graph Database¶
Similarities¶
Quine is a streaming graph—and that's a brand new thing! But it's a good place to begin understand streaming graphs from the perspective of a graph database. They have a lot in common; they both store data and represent it as a native graph. They both get to that data through query languages. They can both be plugged into other tools to build a pipeline for data processing workflows.
Differences¶
Counting Is Hard¶
When learning a new database, the first thing most engineers would do is load in a couple rows of data and then query the system to count how many items were loaded. This is a common sanity check to make sure a really simple operation behaves as expected, then you build you understanding from there, knowing that it's working as you expected.
If you try to count nodes in Quine, you might be disappointed because the simple operation of counting can actually be quite expensive. Databases usually keep a running count of how many rows they manage. Every time a new row is added, that count is incremented by one. Remove a row and 1 is subtracted. For the sake of high-volume streaming performance, Quine works differently both with how nodes are created (see the next section) and how a node contributes to the count of "all nodes".
All Nodes Exist¶
A normal database starts empty and gets filled with data over time. When you start Quine for the first time, every node that could exist does exist! What, what? Yep. As a design principle, Quine never "creates" a node. Instead you simply address a node and start using it. This turns out to be a critical design principle for building a graph from streaming data. If data from two separate streams each refer to the same node:
- In a normal database you'd first have to check if the node exists (a read query), create it it doesn't exist (a write query) and then continue with your operation (another read, write, both, or more). That needs to be done for every single item arriving per stream! That turns one operation into three, and it interleaves reading and writing which harms lots of optimization techniques. These are real-world performance destroyers which most benchmarks just skip over.
- In Quine, you never create a node. The node you want is already there and waiting for you to use it. So if two streams reference the same node, they can both simply start using that node and avoid all the extra bookkeeping and error handling that normal databases would have to do.
This is a design principle; it does not require allocating all possible space. "All nodes exist"… by convention.
IDs Over Indices¶
In order to find starting points for queries, databases usually maintain indices (a.k.a. indexes). An index is a duplicate copy of some of the data which is organized in a different way to make finding specific values faster/easier. It comes at the cost of having to maintain that duplicate copy (the index) every time another value changes. That's why indices degrade performance. A smart database admin/user will carefully limit which values are indexed. The more that values change and the more values there are, the greater the performance cost.
Quine is built to operate on infinite streams of data. Literally. You can't realistically build an index over an infinite set of data. So how would Quine find starting points for queries? In a graph, every node could be a starting point, and the edge connections that make up the graph are the perfect way to navigate through the data. But how would you get started?
Quine is built so that addressing a node by its ID is an extremely fast and efficient operation. If you can find a node by identifying its ID, then every other part of the query can execute efficiently. So Quine includes a special query function called idFrom(…) to determine node IDs based on values—exactly the kind of values that would get indexed. But instead of paying the maintenance cost of indexing those values, the idFrom(…) function efficiently turns a set of values into a consistent node ID. It's similar to a consistent hash, but it produces a consistent node ID — even if you're using custom ID types or any of the other types of IDs that Quine ships with.
The key takeaway for understanding this difference is: when running ad hoc or ingest queries in Quine, you should always include at least one call to idFrom(…) to efficiently find a starting point for your query. Without a call to idFrom(…) Quine will have to scan all nodes to find starting points for your query—which will be very slow!
Tip
If you're writing a Standing Query, you do not need to use idFrom(…) because Standing Queries are an efficient way of scanning all nodes…
Standing Queries¶
The most unusual and most powerful capability in Quine is the ability to execute a graph query as a Standing Query. Standing Queries are just like normal ad hoc queries, except instead of running once and terminating when all results are found, a Standing Query continues to run forever or until its cancelled. A Standing Query efficiently monitors the entire graph, continually watching each change to determine if it produces a new result to the query. Standing Queries mean that you can subscribe to a stream of future results. Whenever the data changes in a way that produces a new result, the Standing Query output handler is triggered and a new result is produced.
Save History, Not State¶
Databases save the state of each item to disk. That's their job. To lean in to the world of infinite streaming and continuous updates, Quine takes a different approach. Quine doesn't save the state of each node in the graph it manages; Quine saves the history of changes to each node. It's a technique known as event sourcing, and its used widely in stream processing tools, but not so often with databases. Some databases produce a change data capture (CDC) stream as a secondary feature compared to their primary state storage job. Quine essentially inverts this picture. The log of changes is the primary data stored, and saving the rolled up materialized state of those changes is just an optimization.
Note
This is customizable in Quine. It's possible to disable saving journals (the append-only log of changes) and instead to save only the materialized state.
Read / Write Performance¶
Databases are usually optimized for either read-heavy workloads OR write-heavy workloads. Quine is designed to solve for both. How is that possible? Well, because Quine stores data as an append-only log, it's very fast to save new updates. Those updates have to be computed somewhere before they're written, so that is done in the graph, in memory. The graph-shaped representation of data is only held in memory, and each node is its own independent process. When new data comes in, each node computes what changes are needed and saves those deltas to disk asynchronously and with a timestamp. When a query arrives to read data, that query only needs to work with the in-memory state of the node which ends up behaving like a cache. So read operations (which require the complex structure of the graph) don't usually need to touch the disk. The only time a read causes a read from disk is when a node is expired out of memory and needs to be loaded back in to serve a query. This causes the log to be read (sped up by a snapshot if one is available) and the node to become cached in memory and available for other operations without touching disk until it is expired from the cache. This approach lets Quine use both the streaming activity and the graph structure itself to make smart choices about which nodes are worth keeping in memory for longer.
The Persistor Layer¶
Underneath the in-memory graph there is a layer of Quine known as the persistor. This is where data is actually stored to disk. And this is where we finally get to part of the Quine that is exactly like a database—because it IS a database. Which database? Well, you can choose. Quine supports many types of databases for the persistence layer and we add new ones all the time. You can even add your own!
As a general rule of thumb, the persistence layer should be at least a key-value store. Key-value stores are usually the fastest option at the expense of the ability to do complex queries. Well Quine handles the complex query in memory as mentioned above, so we don't need the database to do anything complex; just append values quickly and occasionally return a range of values in a chunk. Quine inherits much of the functionality of the underlying chosen database, and latency of these operations is usually the most significant factor. So scalable key-value stores (or column stores used simply) are usually the best choice. Quine has integrations with for Cassandra and ScyllaDB for high-volume networked use cases, but also ships with MapDB for in-process data storage (including in-memory only as an option) and RocksDB as the default persistor for persisting to disk on the same machine running Quine.