Graph storage
Real-time analytics, transactional graph storage
A recent trend in graph analytics has been real-time analytics on “fresh” data, that is, on dynamic graphs with recent information that is incrementally updated. Use cases include social networking systems, system security, recommendations, and NLP applications. These workloads can be defined as Hybrid Transactional-Analytical Processing (HTAP) and require specific system support.
In my recent work on LiveGraph, I designed the first graph storage system that supports transactional updates and fast snapshot analytics. The work showed that popular data structures used in database management systems and storage systems to support updates, such as B+ trees and Log-Structured Merge Trees (LSMTs), yield significantly worse performance in graph analytics than read-only graph-aware data structures like Compressed Sparse Rows (CSR). This is because the former require random access when scanning adjacency lists, i.e., the list of vertices neighboring a given vertex. Adjacency list scans are a key operation in graph analytics and they are executed very frequently. LiveGraph pro- posed a new graph-aware data structure called the Transactional Edge Log (TEL) to store adjacency lists. The TEL integrates a log-based sequential data layout with a low-overhead transactional concurrency control algorithm. The sequential data layout ensures fast adjacency list scans even while the graph is being updated. The concurrency control algorithm leverages the cache coherency features of the CPU to make sure that long-running reads do not interfere with each other and with writes. LiveGraph outperforms existing state-of-the-art storage and database management systems supporting transactions. It outperforms Facebook’s RocksDB by up to 7.45× using Facebook’s social graph benchmark. On real-time HTAP analytics workloads like LDBC SNB interactive, LiveGraph is up to 36.4× faster than the runner-up. Incrementally updating the graph storage enables running iterative graph analytics (like PageRank) on the latest snapshot without the cost of loading the graph, resulting in lower end-to-end processing time.