Another day, another mind-blowing show of capabilities from the graph database market. Neo Technology recently released their “world’s biggest graph database” report. This features the Linked Data Benchmark Council (LDBC) running 1 trillion relationships where throughput and latency improve linearly with the number of shards that it’s distributed across.

If it was uncertain, now it’s certain that Neo4J joined the trillion-scale graph party, along with their mind-blowing funding announcement. I wrote about another one recently as well –  a Stardog demonstration of a massive Knowledge Graph that consists of materialized and virtual graphs that span multiple cloud platforms. We show that it is possible to have a one trillion-edge Knowledge Graph with sub-second query times without storing all the data in a central location.

Several RDF-based systems have published benchmarking results for graphs with 1 trillion triples, for example Cambridge Semantics, Oracle and Cray. There are some key differences between the results to consider including the use of Distributed, i.e., Real-World Data, truly Randomized Queries, Inference and the Cost of Operations. Cost of operations should be considered for the time required to operate. If a system can complete the work faster, it obviously runs less and bills for less time.

The Cambridge Semantics benchmark helped start the “trillion” craze 5 years ago by reporting a completed load and query of one trillion triples on the Google Cloud Google Cloud Platform in just under two hours.

Still impressive by today’s standards, AnzoGraph’s benchmark performed complex analytic style queries that traversed large portions of the graph to create and return enormous result sets, highlighting its MPP performance and scale.  The benchmark also featured materialized inference of billions of triples in minutes to show the powerful ELT and data integration capabilities required for enterprise knowledge graph and data fabric use cases.

Graph technology has come a ways since then. As we noted about CSI’s trillion triple benchmark setup, $200/hr compute was required. Even though their price performance has improved since 2016, no infrastructure group would feel great about that running 24×7.  Now, with Anzo’s Kubernetes-based cloud automation, clusters are now effortlessly spun-up on demand, on any cloud to take advantage of competitive spot pricing.  Because the parallel load, transform and query times are so fast, these clusters do not require a long life to accomplish substantial analytic and data preparation tasks.

CSI has also added parallel virtualization capabilities to the AnzoGraph MPP engine, allowing a single SPARQL query to scan, join and aggregate data in memory, on disk, or fully virtualized against the source – without the query writer needing to worry about which parts of the graph are stored where and when.  The engine intelligently creates a plan across memory and virtual endpoints, pushing-down automatically generated native queries and filters to the source systems like SQL RDBMS and Elasticsearch.

Have you reconsidered what a graph database can do for your workloads, without limitation?Like this post?  Subscribe to our newsletter to stay alert to all things MCG.

McKnight Consulting Group