Issued Patents

Books & Book Chapters

Heterogeneous Computing with OpenCL 2.0

D. Kaeli, P. Mistry, D. Schaa, D. P. Zhang Morgan Kaufmann Publishers, 2015, available in bookstores and Amazon.

Chapter 13: OpenCL Profiling and Debugging, Chapter 14: Performance Optimisation of an Image Analysis Application on dGPU and APUs; Heterogeneous Computing with OpenCL, 2nd Edition

D. P. Zhang, Morgan Kaufmann Publishers, 2012.

Multi-dimensional image segmentation and registration: Coronary artery segmentation and motion modelling

D. P. Zhang, LAP LAMBERT Academic Publishing, 2013.

Papers

Horton Tables: Fast Hash Tables for In-Memory Data-Intensive Computing

Alex Breslow, Dong Ping Zhang, Joseph Greathouse, Nuwan Jayasena, Dean Tullsen USENIX Annual Technical Conference 2016.

Abstract

Hash tables are important data structures that lie at the heart of key applications such as key-value stores and relational databases. Typically bucketized cuckoo hash tables (BCHTs) are used because they provide high- throughput lookups and load factors that exceed 95%. Unfortunately, this performance comes at the cost of re- duced memory access efficiency. Positive lookups (key is in the table) and negative lookups (where it is not) on average access 1.5 and 2.0 buckets, respectively, which results in 50 to 100% more table-containing cache lines to be accessed than should be minimally necessary.

To reduce these surplus accesses, this paper presents the Horton table, a revamped BCHT that reduces the ex- pected cost of positive and negative lookups to fewer than 1.18 and 1.06 buckets, respectively, while still achiev- ing load factors of 95%. The key innovation is remap entries, small in-bucket records that allow (1) more el- ements to be hashed using a single, primary hash func- tion, (2) items that overflow buckets to be tracked and rehashed with one of many alternate functions while maintaining a worst-case lookup cost of 2 buckets, and (3) shortening the vast majority of negative searches to 1 bucket access. With these advancements, Horton tables outperform BCHTs by 17% to 89%.

Paper PDF | Slides

HADM: Hybrid Analysis for Detection of Malware

Lifan Xu, Dong Ping Zhang, Nuwan Jayasena, John Cavazos IntelliSys 2016.

Abstract

Android is the most popular mobile operating system with a market share of over 80% [1]. Due to its popularity and also its open source nature, Android is now the platform most targeted by malware, creating an urgent need for effective defense mechanisms to protect Android-enabled devices.

In this paper, we propose a novel Android malware classifi- cation method called HADM, Hybrid Analysis for Detection of Malware. We first extract static and dynamic information, and convert this information into vector-based representations. It has been shown that combining advanced features derived by deep learning with the original features provides significant gains [2]. Therefore, we feed both the original dynamic and static feature vector sets to a Deep Neural Network (DNN) which outputs a new set of features. These features are then concatenated with the original features to construct DNN vector sets. Different kernels are then applied onto the DNN vector sets. We also convert the dynamic information into graph-based representations and apply graph kernels onto the graph sets. Learning results from various vector and graph feature sets are combined using hierarchical Multiple Kernel Learning (MKL) to build a final hybrid classifier.

Paper PDF

Dynamic Android Malware Classification Using Graph-Based Representations

Lifan Xu, Dong Ping Zhang, Marco A. Alvarez, Jose Andre Morales, John Cavazos IEEE CSCloud. 2016.

Abstract

Malware classification for the Android ecosystem can be performed using a range of techniques. One major technique that has been gaining ground recently is dynamic analysis based on system call invocations recorded during the executions of Android applications. Dynamic analysis has traditionally been based on converting system calls into flat feature vectors and feeding the vectors into machine learning algorithms for classification.

In this paper, we implement three traditional feature-vector- based representations for Android system calls. For each feature vector representation, we also propose a novel graph-based representation. We then use graph kernels to compute pair-wise similarities and feed these similarity measures into a Support Vector Machine (SVM) for classification. To speed up the graph kernel computation, we compress the graphs using the Com- pressed Row Storage format, and then we apply OpenMP to par- allelize the computation. Experiments show that the graph-based representations are able to improve the classification accuracy over the corresponding feature-vector-based representations from the same input. Finally we show that different representations can be combined together to further improve classification accuracy.

Paper PDF

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory

Paula Aguilera, Dong Ping Zhang, Nam Sung Kim, Nuwan Jayasena 18th Workshop on Advances in Parallel and Distributed Computational Models. 2016.

Abstract

Graphs are used in a wide variety of application domains, from social science to machine learning. Graph algorithms present large numbers of irregular accesses with little data reuse to amortize the high cost of memory accesses, requiring high memory bandwidth. Processing in memory (PIM) implemented through 3D die-stacking can deliver this high memory bandwidth. In a system with multiple memory modules with PIM, the in-memory compute logic has low latency and high bandwidth access to its local memory, while accesses to remote memory introduce high latency and energy consumption. Ideally, in such a system, computation and data are partitioned among the PIM devices to maximize data locality. But the irregular memory access patterns present in graph applications make it difficult to guarantee that the computation in each PIM device will only access its local data. A large number of remote memory accesses can negate the benefits of using PIM.

In this paper, we examine the feasibility and potential of fine-grained work migration to reduce remote data accesses in systems with multiple PIM devices. First, we propose a data-driven implementation of our study algorithms: breadth-first search (BFS), single source shortest path (SSSP) and betweenness centrality (BC) where each PIM has a queue where the vertices that it needs to process are held. New vertices that need to be processed are enqueued at the PIM device co-located with the memory that stores those vertices. Second, we propose hardware support that takes advantage of PIM to implement highly efficient queues that improve the performance of the queuing framework by up to 16.7%. Third, we develop a timing model for the queueing framework to explore the benefits of work migration vs. remote memory accesses. And, finally, our analysis using the above framework shows that naïve task migration can lead to performance degradations and identifies trade-offs among data locality, redundant computation, and load balance among PIM devices that must be taken into account to realize the potential benefits of fine-grain task migration.

Paper PDF

Scaling Deep Learning on Multiple In-Memory Processors

Lifan Xu, Dong Ping Zhang, Nuwan Jayasena WoNDP: 3rd Workshop on Near-Data Processing. 2015.

Abstract

Deep learning methods are proven to be state-of-the-art in addressing many challenges in machine learning domains. However, it comes at the cost of high computational requirements and energy consumption. The emergence of Processing In Memory (PIM) with die stacking technology presents an opportunity to speed up deep learning computation and reduce energy consumption by providing low-cost high-bandwidth memory accesses. PIM uses 3D die stacking to move computations closer to memory and therefore reduces data movement overheads. In this paper, we study the parallelization of deep learning methods on a system with multiple PIM devices. We select three typical layers: the convolutional, pooling, and fully connected layers from common deep learning models and parallelize them using different schemes. Preliminary results show we are able to reach competitive or even better performance using multiple PIM devices when comparing with traditional GPU parallelization.

Paper PDF

Realizing the Full Potential of Heterogeneity through Processing in Memory

Nuwan Jayasena, Dongping Zhang, Amin Farmahini-Farahani, Mike Ignatowski WoNDP: 3rd Workshop on Near-Data Processing. 2015.

Abstract

While many processing in memory (PIM) research studies demonstrate significant improvements in memory system energy efficiency, relatively little attention has been paid to the sources of overall energy efficiency of PIM systems. In this paper, we quantify the sources of energy efficiency of a GPU-based PIM design and show that selecting low-power operating points for the in-memory processors is an important aspect, accounting for a 1.9x improvement in energy efficiency compared to a mainstream implementation of the evaluated GPU design. Memory interface efficiency of PIM provides an additional 3.8x improvement over that. These results also demonstrate that, due to memory system inefficiencies, implementing high-performance and low-power heterogeneous cores on the same die attached to a conventional memory system can only realize a fraction of the overall improvement realized by PIM (52% in our study). While these results in general confirm conventional wisdom, we quantify the relative importance of these processor and memory efficiency factors across a wide range of benchmarks and encourage further research to enable and leverage the symbiosis between PIM and heterogeneous computing to further improve energy efficiency.

Paper PDF

TOP-PIM: Throughput-Oriented Programmable Processing in Memory

Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph Greathouse, Lifan Xu, Mike Ignatowski The 23rd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC), Best Paper Award Finalist, 2014.

Abstract

As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to reduce both energy and data movement overheads. We explore the use of 3D die stacking to move memory-intensive computations closer to memory. This approach to processing in memory addresses some drawbacks of prior research on in-memory computing and is commercially viable in the foreseeable future.

Because 3D stacking provides increased bandwidth, we study throughput-oriented computing using programmable GPU compute units across a broad range of benchmarks, including graph and HPC applications. We also introduce a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today’s GPU hardware. Our results show that, on average, viable PIM configurations show moderate performance losses (27%) in return for significant energy efficiency improvements (76% reduction in EDP) relative to a representative mainstream GPU at 22nm technology. At 16nm technology, on average, viable PIM configurations are performance competitive with a representative mainstream GPU (7% speedup) and provide even greater energy efficiency improvements (85% reduction in EDP).

Bibtex

@inproceedings{Zhang:2014:TTP:2600212.2600213, author = {Zhang, Dongping and Jayasena, Nuwan and Lyashevsky, Alexander and Greathouse, Joseph L. and Xu, Lifan and Ignatowski, Michael}, title = {TOP-PIM: Throughput-oriented Programmable Processing in Memory}, booktitle = {Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing}, series = {HPDC ’14}, year = {2014}, isbn = {978-1-4503-2749-7}, location = {Vancouver, BC, Canada}, pages = {85–98}, numpages = {14}, url = {http://doi.acm.org/10.1145/2600212.2600213}, doi = {10.1145/2600212.2600213}, acmid = {2600213}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {energy efficiency, gpgpu, performance modeling and analysis, processing in memory}, }

Paper | Slides

Efficient Parallel Image Clustering and Search on a Heterogeneous Platform

Dong Ping Zhang, Lifan Xu, Lee Howes 22nd High Performance Computing Symposium (HPC), Best Paper Award, 2014.

Abstract

We present a parallel image clustering and search framework for large scale datasets that does not require image annotation, segmentation or registration. This work addresses the image search problem while avoiding the need for user-specified or auto-generated metadata. Instead we rely on image data alone to avoid the ambiguity inherent in user-provided information. We propose a parallel algorithm exploiting heterogeneous hardware resources to generate global descriptors for the set of input images. Given a group of query images we derive the global descriptors in parallel. Secondly, we propose to build a customisable search tree of the image database by performing a hierarchical K-means (H-Kmeans) clustering of the corresponding descriptors. Lastly, we design a novel parallel vBFS algorithm to search through the H-Kmeans tree and locate the set of closest matches for query image descriptors.

To validate our design we analyse the search performance and energy efficiency under a range of hardware clock frequencies and in comparison with alternative approaches. The result of our analysis shows that the framework greatly increases the search efficiency and thereby reduces the energy consumption per query.

Bibtex

@inproceedings{Zhang:2014HPC, author = {Zhang, Dongping and Xu, Lifan and Howes, Lee}, title = {Efficient Parallel Image Clustering and Search on a Heterogeneous Platform}, booktitle = {22nd High Performance Computing Symposium (HPC)}, year = {2014}, }

Issued Patents

Books & Book Chapters

Papers

Invited talks

PhD Thesis

MSc Thesis