Systems for general graph mining
Programming abstractions, scalability, performance
Graph mining is an important class of graph analytics tasks such as frequent subgraph mining and motif counting. These algorithms are used to discover relevant patterns that comprise both structure-based and label-based properties of the graph. Graph mining is widely used for several applications, for example, discovering 3D motifs in protein structures or chemical compounds, mining attributed patterns over knowledge bases, finding communities in social networks, among others. Graph mining algorithms iteratively inspect subgraphs of increasing size, prune them if they are not interesting, or otherwise expand them further. This subgraph exploration process is a major potential bottleneck: for example, even a small citation graph with only 4000 vertices generates more than a billion six-vertex subgraphs.
I developed Arabesque, the first system to offer a simple and general programming model for graph mining and a runtime specifically optimized for distributed subgraph exploration. Arabesque simplifies and thus democratizes the design of scalable graph mining algorithms by offering a simple functional API that abstracts away parallelism. Data scientists can write highly performant distributed implementations of fundamental graph mining applications using only a handful of lines of code, ranging from 6 to 280 for some common algorithms we initially implemented. Among these algorithms was the first distributed implementation of frequent subgraph mining. Arabesque’s subgraph-centric programming model was based on the observation that existing vertex- and edge-centric graph processing systems, like Pregel or GraphX, were not a good fit for graph mining in terms of both expressivity and performance. Arabesque directly influenced the development of many other systems for graph mining that targeted the same class of applications and further optimized subgraph exploration. Independent work on NScale started from a similar observation and focused on the complementary problem of scaling the execution of subgraph-centric programs after relevant subgraphs are found.
I then developed QFrag, a system that optimized distributed subgraph isomorphism, a problem related to graph mining. Rather than using distributed joins, as done by previous work, QFrag replicates the graph at all workers and performs subgraph isomorphism in parallel. This strategy is potentially fast but it yields poor load balancing due to the skew present in natural graphs. QFrag addresses this problem by splitting up expensive tasks and distributing the splits across multiple workers, a technique called task fragmentation.
More recent graph mining systems use graph pattern matching as a fundamental underlying primitive. My recent work on GraphMini explores how to proactively prune graphs to speed up graph pattern matching by leveraging the structure of the query pattern and the input graph. We propose building auxiliary graphs, which are different pruned versions of the graph, during query execution. This requires careful balancing between the upfront cost of building and managing auxiliary graphs and the gains of faster set operations. GraphMini is a new system that uses query compilation and a new cost model to minimize the cost of building and maintaining auxiliary graphs and maximize gains. Our evaluation shows that using GraphMini can achieve one order of magnitude speedup compared to state-of-the-art subgraph enumeration systems on commonly used benchmarks. Our evaluation shows that GraphMini outperforms state-of- the-art systems like GraphPi and Dryadic by up to 30.6x and 60.7x respectively. In addition, the code generation algorithm has a minimal impact on the end-to-end query execution time.
References
2023
2017
-
Qfrag: Distributed Graph Search via Subgraph IsomorphismIn Proceedings of the 2017 ACM Symposium on Cloud Computing (SoCC), 2017