Da Yan, University of Alabama at Birmingham
The following Great Innovative Idea is from Da Yan, an assistant professor in the Department of Computer and Information Sciences (CIS) at the University of Alabama at Birmingham (UAB). Yan presented his poster, Big Data Frameworks: Bridging High Performance of HPC Community with Programming Friendliness of Data Science Community, at the CCC Symposium on Computing Research, October 23-24, 2017.
The Idea
Existing Big Data frameworks such as Hadoop, Spark and Google’s Pregel emphasize on programming simplicity, where a distributed algorithm can be written with just a few lines of code. However, they only target data-intensive analytics, where the workloads are mainly generated by data volume, and network communication is the performance bottleneck. For compute-intensive tasks where the workloads are mainly generated by the huge search space, existing systems need to materialize (and transmit) huge amounts of intermediate data, and severely underutilize the CPU resources.
As a researcher with five years of experience in data-intensive graph analytics (e.g., random walks and graph traversal), I observe that there lacks a scalable solution to compute-intensive graph analytic tasks like graph mining and subgraph matching, and I envision that Big Data frameworks for compute-intensive analytics would become the next trend of Big Data systems research. My group has developed a distributed system called G-thinker, which spawns subgraphs from individual vertices, and processes each subgraph in a single machine without incurring network communication. Our G-thinker prototype is orders of magnitude faster than existing systems, and can scale to graphs that are orders of magnitude larger given the same cluster environment.
Impact
Our data-intensive graph analytics systems have already been used by peer researchers in their work in first-tier conferences like SIGMOD and ICDE. We have also applied them in finding cybercriminals from online social networks and forums, by running random walk algorithms such as TrustRank, HITS and SALSA to get graph-based scores, and then integrating them for collective classification. This work is a collaboration with the UAB Center for Information Assurance and Joint Forensics Research (CIA|JFR). Another application is in large-scale genome assembly, where contigs (long DNA sequences) can be found from the de Bruijn graph using the list ranking algorithm on Pregel+.
Our new system, G-thinker, opens more opportunities for graph-based research. My group plans to use G-thinker to find communities involving cybercriminals from online social networks and forums, and visualize them using force-directed algorithms. This would assist computer forensic experts in UAB CIR|JFR in finding new cybercriminals. G-thinker will also enable numerous research involving compute-intensive graph analytics that was impossible before.
Other Research
My group works on scalable systems and algorithms for Big Data analytics, and their application in various real and often interdisciplinary projects. In addition to graph data, we also process geospatial data, and matrix/tensor data. Existing systems such as Apache SystemML perform matrix computations using data-intensive frameworks like Hadoop MapReduce and Spark, and we plan to design a dedicated platform for compute-intensive matrix/tensor computations. The target applications of our systems and algorithms include (but are not limited to) digital forensics, genomic and medical data analysis, etc.
Researcher’s Background
I’ve been an assistant professor at the CS department of the University of Alabama at Birmingham (UAB) since August 2016. I obtained my Ph.D. degree in CS from the Hong Kong University of Science and Technology in 2014, and my B.S. degree in CS from Fudan University in Shanghai. I am the sole winner of Hong Kong 2015 Young Scientist Award in Physical/Mathematical Science, due to my contribution to Big Data analytics. I developed a number of systems for Big Graph analytics, such as Pregel+, Blogel, Quegel and GraphD (collectively called BigGraph@CUHK). The systems are up to orders of magnitude faster than their competitors, and are widely used with a high reputation. Relevant publications include those in first-tier journals like PVLDB (4 papers) and TPDS, and first-tier conferences include SIGMOD, WWW and SoCC. I have published a survey book in Foundations and Trends in Databases, which summarizes the state-for-the-art systems for big graph analytics. More information about my research can be found at
http://www.cs.uab.edu/yanda.