Yanqi Zhou - Academia.edu (original) (raw)

Papers by Yanqi Zhou

Most compilers for machine learning (ML) frameworks need to solve many correlated optimization pr... more Most compilers for machine learning (ML) frameworks need to solve many correlated optimization problems to generate efficient machine code. Current ML compilers rely on heuristics based algorithms to solve these optimization problems one at a time. However, this approach is not only hard to maintain but often leads to sub-optimal solutions especially for newer model architectures. Existing learning based approaches in the literature are sample inefficient, tackle a single optimization problem, and do not generalize to unseen graphs making them infeasible to be deployed in practice. To address these limitations, we propose an end-to-end, transferable deep reinforcement learning method for computational graph optimization (GO), based on a scalable sequential attention mechanism over an inductive graph neural network. GO generates decisions on the entire graph rather than on each individual node autoregressively, drastically speeding up the search compared to prior methods. Moreover, w...

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Information leaks based on timing side channels in computing devices have serious consequences fo... more Information leaks based on timing side channels in computing devices have serious consequences for user security and privacy. In particular, malicious applications in multiuser systems such as data centers and cloud-computing environments can exploit memory timing as a side channel to infer a victim's program access patterns/phases. Memory timing channels can also be exploited for covert communications by an adversary. We propose Camouflage, a hardware solution to mitigate timing channel attacks not only in the memory system, but also along the path to and from the memory system (e.g. NoC, memory scheduler queues). Camouflage introduces the novel idea of shaping memory requests' and responses' interarrival time into a predetermined distribution for security purposes, even creating additional fake traffic if needed. This limits untrusted parties (either cloud providers or coscheduled clients) from inferring information from another security domain by probing the bus to and from memory, or analyzing memory response rate. We design three different memory traffic shaping mechanisms for different security scenarios by having Camouflage work on requests, responses, and bi-directional (both) traffic. Camouflage is complementary to ORAMs and can be optionally used in conjunction with ORAMs to protect information leaks via both memory access timing and memory access patterns. Camouflage offers a tunable trade-off between system security and system performance. We evaluate Camouflage's security and performance both theoretically and via simulations, and find that Camouflage outperforms state-of-the-art solutions in performance by up to 50%.

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016

ACM SIGOPS Operating Systems Review, 2016

Industry is building larger, more complex, manycore processors on the back of strong institutiona... more Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016

ACM SIGOPS Operating Systems Review, 2016