Jianming Tong

Jianming Tong
jianming [dot] tong [at] gatech [dot] edu

Ph.D. at Georgia Tech starting from Spring 2021, Visitng Researcher at MIT

Main Developer for CROSS, FEATHER, SUSHI

My research is funded by Qualcomm Innovation Fellowship and SRC Jump 2.0

FPGA and FHE Lead in Synergy Lab @ Gatech

CV / Google Scholar / GitHub / LinkedIn / ORCID

Research Interest

I'm a Computer Architect, focusing on improving the efficiency of privacy-preserving AI systems via full-stack optimizations.

Model (Software): New ML Model with Privacy-preserving Capability by Design, e.g. Crypto-friendly ML Model. [SmartPAF-MLSys'24][Privatar]

System: Latency/Accuracy/Privacy Navigation for Multi-Query Streams. [SUSHI-IEEE Micro'23, MLSys'23]

Compilation: Convert Crypto Algorithm to be More Efficient on Existing Hardware. [CROSS]

Architecture (Hardware): Reconfigurable Dataflow Accelerator for ML and Privacy-preserving ML. [FEATHER-ISCA'24]

Performance Modeling: Performance Analysis Tool for ML Accelerators. [LayoutLoop-ISCA'24][ScaleSim-ISPASS'25]

News

[Jun. 2025] [Award] Our work CROSS: Enable AI Accelerator for Homomorphic Encryption won 2rd place at Unversity DEMO at DAC'25 ! U could run encrypted digit detction serving on Google Cloud with TPUv4 for free today!
[Jun. 2025] [DEMO] I will give a demo on CROSS: Enable AI Accelerator for Homomorphic Encryption at DAC'25 See u all at SF!
[May. 2025] [Talk] I gave a talk on CROSS: Enable AI Accelerator for Homomorphic Encryption and Zero Knowledge Proof at UMich hosted by Prof. Todd Austin!
[Mar. 2025] [Tool] LayoutLoop from FEATHER [ISCA'24] has been integrated into NVlabs/Timeloop, details could be found at this PR and this slide, enjoy precise layout modeling!
[Mar. 2025] [Paper] Our work Constrained Dataflow Accelerator for Real-Time Multi-Task Multi-Model Machine Learning Workloads has been accepted by ISPASS'25!
[Mar. 2025] [Paper] Our work Scale-sim V3 has been accepted by ISPASS'25!
[Jan. 2025] [Paper] Our work Leveraging ASIC AI Chips for Homomorphic Encryption has online released now!
[Nov. 2024] [Teaching] I gave a guest lecture of FEATHER at the "Advanced Computer Architecture for Machine Learning" course hosted by Prof. Tony Geng.
[Nov. 2024] [Talk] I give a talk on Leveraging AI ASIC for Homomorphic Encryption at NYU hosted by Prof. Brandon Reagon and Karthik Garimella!
[Nov. 2024] [Service] I co-organized JOBS Workshop to help faciliating new grads for job hunting - go JOBS!
[Nov. 2024] [Talk] I give a talk on FEATHER at WDDSA workshop co-located with MICRO'24 at Austin!
[Oct. 2024] [Talk] I demo FEATHER at SRC ACE annual review ACE at Chicago!
[Sep. 2024] [Paper] Our work Real-time Digital RF Emulation – II: A Near Memory Custom Accelerator is accepted to the IEEE Transactions on Radar Systems (TRadar'24).
[Aug. 2024] [Award] I was selected as the student for ACE Newsletter highlight by SRC!
[Aug. 2024] [Talk] I give a talk on FEATHER at SRC Liaison Meeting of ACE Center!
[Aug. 2024] [Career] I join Google as a student researcher in Phazon team of PSS, more realistic privacy-preserving acceleration are coming, stay tuned!
[Jul. 2024] [Talk] I give a talk on FEATHER at NVidia (HQ) and NVidia (Westford)!
[Jun. 2024] [Talk] We debut FEATHER A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching at ISCA, Buenos Aires!
[May. 2024] [Talk] I give a talk on FEATHER at MIT
[May. 2024] [Award] I am selected as "ML and System Rising Star" by ML Commons, excited to meet you all at Nvidia HQ at Jul 15~16.
[May. 2024] [Award] Our team "CipherFlitFort" is awarded Startup Launch by CreateX at Georgia Tech, Go Jackets!
[Apr. 2024] [Award] I am selected as DAC Young Fellow for DAC 2024.
[Mar. 2024] [Paper] Our work FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching is accepted to the International Symposium on Computer Architecture (ISCA'24).
[Feb. 2024] [Paper] Our work SmartPAF: Accurate Low-Degree Polynomial Approximation of Non-polynomial Operators for Fast Private Inference in Homomorphic Encryption is accepted to the Seventh Conference on Machine Learning and Systems (MLSys'24).
[Feb. 2024] [Service] We started course 6.192 Constructive Computer Architecture in three schools together this year (MIT, EPFL, GaTech) - recordings available online, Go Architects!
[Jan. 2024] [Service] I served AEC for ISCA'24.
[Nov. 2023] [Service] I join Computer Architecture Student Association ( CASA ) steering team, from the architects for the architects.
[Oct. 2023] [Talk] I gave a talk on SUSHI and PAF-FHE at HAN Lab @ MIT.
[Sep. 2023] [Award] I won Best Poster Award for presenting our work SUSHI at (IAP Workshop@MIT).
[Sep. 2023] [Paper] Our work Hardware-Software co-design for real-time latency-accuracy navigation in tinyML applications is accepted to the Journal (IEEE micro).
[Sep. 2023] [Career] I join MIT as a visiting researcher in CSAIL hosted by Dr. Arvind.
[Aug. 2023] [Paper] Our work SNATCH: Stealing Neural Network Architecture from ML Accelerator in Intelligent Sensors is accepted to the IEEE SENSORS conference (SENSORS'23).
[Jul. 2023] [Paper] Our work On Continuing DNN Accelerator Architecture Scaling Using Tightly-coupled Compute-on-Memory 3D ICs is accepted to the IEEE Transactions on Very Large Scale Integration Systems (TVLSI'23).
[Jul. 2023] [Award] I win 2023 Qualcomm Innovation Fellowship, thank you Qualcomm!
[Jul. 2023] [Service] I serve as AEC for ASPLOS'24.
[Jun. 2023] [Talk] I gave a talk on SUSHI and PAF-FHE at CAG Lab @ XJTU University.
[May. 2023] [Paper] Our work A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching accepted to the 3rd On-Device Intelligence Workshop (ODIW'23@MLSys'23).
[May. 2023] [Paper] Our work ReLU-FHE: Low-cost Accurate ReLU polynomial approximation in Fully Homomorphic Encryption Based ML Inference accepted to the 3rd On-Device Intelligence Workshop (ODIW'23@MLSys23) .
[Apr. 2023] [Paper] Our work SUSHI: SubGraph Stationary Hardware-Software Inference Co-design accepted to the Sixth Conference on Machine Learning and Systems (MLSys'23).
[Apr. 2023] [Paper] Our work FPGA-Based High-Performance Real-Time Emulation of Radar System using Direct Path Compute Model accepted to the International Microwave Symposium (IMS'23).
[Mar. 2023] [Talk] I give a talk on Enable Best ML Inference and Training: A systematic Approach at EIC Lab @ Georgia Tech.
[Mar. 2023] [Paper] Our work A High Performance Computing Architecture for Real-Time Digital Emulation of RF Interactions accepted to the In Proc of IEEE Radar Conference (RadarConf'23).
[Nov. 2022] [Talk] I give a talk on Full-Stack ML Dataflow, Mapping and SW/HW Co-Design and Search at NICS-EFC Lab @ Tsinghua University.
[Jul. 2022] [Tutorial] I give a tutorial on MAERI 2.0: An End-to-end framework to explore architecture design space on FPGA at ICS 2022.
[Jul. 2022] [Talk] I present our work FastSwtich: Enabling Real-time DNN Switching via Weight-Sharing at the 2nd Architecture, Compiler, and System Support for Multi-model DNN Workloads Workshop Workshop @ ISCA'23 .
[Apr. 2022] [Award] I receive Finalist in Qualcomm Innovation Fellowship, thank you Qualcomm!
[Mar. 2022] [Award] I win 2nd place in SCS Poster Competition at Georgia Tech, thank you SCS!
[Nov. 2021] [Paper] Our work A Configurable Architecture for Efficient Sparse FIR Computation in Real-time Radio Frequency Systems accepted to International Microwave Symposium (IMS'21).
[Aug. 2021] [Paper] Our work ac2SLAM: FPGA Accelerated High-Accuracy SLAM with Heapsort and Parallel Keypoint Extractor accepted to FPT'21.[code]
[Mar. 2021] [Paper] Our work SMMR-explore: Submap-based multi-robot exploration system with multi-robot multi-target potential field exploration method accepted to ICRA'21.[code][demo]
[Mar. 2021] [Book] Our translated book On-chip Network publicly released [purchase translated version] [English version -- Free for University]
[Feb. 2021] [Paper] Our work PIT: Processing-In-Transmission with Fine-Grained Data Manipulation Networks accepted to ToC'21.
[Jan. 2021] [Career] I kick-off my Ph.D. career at Georgia Tech, go Yellow Jackets!
[Dec. 2020] [Paper] Our work COCOA: Content-Oriented Configurable Architecture Based on Highly-Adaptive Data Transmission Networks accepted to GLSVLSI'21.

Leading Publications (* Equal Contribution)
As Principal Contributor and Leading Author

Leveraging ASIC AI Chips for Homomorphic Encryption

Jianming Tong, Tianhao Huang, Leo De Castro, Anirudh Itagi, Jingtian Dang, Anupam Golder, Asra Ali, Jevin Jiang, Arvind, G. Edward Suh, Tushar Krishna

Preprinted, Jan 2025.

[abstract] [paper] [code] [bibtex]

++CROSS is Deployed in Google TPU Cloud

++CROSS won 2rd place at DAC university demo

FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, Tushar Krishna

International Symposium on Computer Architecture (ISCA), Jun 2024.

[abstract] [paper] [code] [slide] [isca_talk] [deep_dive_talk] [LayoutLoop] [bibtex]

++LayoutLoop is Integrated into NVLabs/Timeloop

@inproceedings{tong2024FEATHER, author = {Tong, Jianming and Itagi, Anirudh and Chatarasi, Parsanth and Krishna, Tushar}, title = {FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching}, year = {2024}, publisher = {Association for Computing Machinery}, address = {Argentina}, abstract = {The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator.}, booktitle = {Proceedings of the 51th Annual International Symposium on Computer Architecture}, keywords = {flexible accelerator, dataflow-layout coswitching}, location = {Argentina}, series = {ISCA '24} }

SmartPAF: Accurate Low-Degree Polynomial Approximation of Non-polynomial Operators for Fast Private Inference in Homomorphic Encryption

Jianming Tong*, Jingtian Dang*, Anupam Golder, Callie Hao, Arijit Raychowdhury, Tushar Krishna

In Proc of Seventh Conference on Machine Learning and Systems, (MLSys), May 2024.

[abstract] [paper] [code] [bibtex]

Hardware-Software co-design for real-time latency-accuracy navigation in tinyML applications

Payman Behnam*, Jianming Tong*, Alind Khare, Yangyu Chen, Yue Pan, Abhimanyu Bambhaniya, Pranav Gadikar, Tushar Krishna, and Alexey Tumanov

(IEEE micro), Sep 2023.

[abstract] [paper] [bibtex]

@ARTICLE {10257666, author = {P. Behnam and J. Tong and A. Khare and Y. Chen and Y. Pan and P. Gadikar and A. Bambhaniya and T. Krishna and A. Tumanov}, journal = {IEEE Micro}, title = {Hardware-Software co-design for real-time latency-accuracy navigation in tinyML applications}, year = {5555}, volume = {}, number = {01}, issn = {1937-4143}, pages = {1-7}, abstract = {tinyML applications increasingly operate in dynamically changing deployment scenarios, requiring optimizing for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space—insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware-software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel FPGA implementation and a software scheduler that controls which SubNets to serve and what SubGraph to cache in real-time. SUSHI yields up to 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% saved off-chip energy across several neural network architectures.}, keywords = {kernel;training;real-time systems;optimization;neural networks;system-on-chip;software}, doi = {10.1109/MM.2023.3317243}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, month = {sep} }

SUSHI: SUbgraph Stationary Hardware-software Inference Co-design

Payman Behnam*, Jianming Tong*, Alind Khare, Yangyu Chen, Yue Pan, Abhimanyu Bambhaniya, Pranav Gadikar, Tushar Krishna, and Alexey Tumanov

In Proc of Sixth Conference on Machine Learning and Systems (MLSys), Jun 2023.

++Qualcomm Innovation Fellowship

++Best Poster Award (IAP2023@MIT)

[abstract] [paper] [bibtex]

SMMR-explore: Submap-based multi-robot exploration system with multi-robot multi-target potential field exploration method

Jincheng Yu*, Jianming Tong*, Yuanfan Xu, Zhilin Xu, Haolin Dong, Tianxiang Yang, and Yu Wang.

IEEE International Conference on Robotics and Automation (ICRA), 2021. Oral

[abstract] [paper] [code] [demo] [bibtex]

SCALE-Sim v3: A Modular Cycle-Accurate Systolic Accelerator Simulator for End-to-End System Analysis

Ritik Raj, Sarbartha Banerjee*, Nikhil Srinivas*, Zishen Wan*, Jianming Tong*, Ananda Samajdar, Tushar Krishna

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Sep 2025.

[abstract] [code]

Constrained Dataflow Accelerator for Real-Time Multi-Task Multi-Model Machine Learning Workloads

Jamin Seo, Jianming Tong, Hyoukjun Kown, Tushar Krishna and Saibal Mukhopadhyay.

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Sep 2025.

Collaborative Publications (* Equal Contribution)
As Collaborator or Mentor

Real-time Digital RF Emulation – II: A Near Memory Custom Accelerator

Xiangyu Mao, Mandovi Mukherjee, Nael Mizanur Rahman, Coleman B DeLude, Joseph W. Driscoll, Sudarshan Sharma, Payman Behnam, Uday Kamal, Jongseok Woo, Daehyun Kim, Sharjeel M. Khan, Jianming Tong, Jamin Seo, Prachi Sinha, Madhavan Swaminathan, Tushar Krishna, Santosh Pande, Justin Romberg, and Saibal Mukhopadhyay.

IEEE Transactions on Radar Systems (TRadar), Sep 2024.

[abstract] [paper] [bibtex]

SNATCH: Stealing Neural Network Architecture from ML Accelerator in Intelligent Sensors

Sudarshan Sharma, Uday Kamal, Jianming Tong, Tushar Krishna, and Saibal Mukhopadhyay.

IEEE SENSORS conference(SENSORS), Aug 2023.

[abstract] [poster]

On Continuing DNN Accelerator Architecture Scaling Using Tightly-coupled Compute-on-Memory 3D ICs

Gauthaman Murali, Aditya Iyer, Lingjun Zhu, Jianming Tong, Francisco Munoz Martinez, Srivatsa Rangachar Srinivasa, Tanay Karnik, Tushar Krishna Sung Kyu Lim

IEEE Transactions on Very Large Scale Integration Systems (TVLSI), Jul 2023.

[abstract] [paper] [bibtex]

FPGA-Based High-Performance Real-Time Emulation of Radar System using Direct Path Compute Model

Xiangyu Mao*, Mandovi Mukherjee*, Nael Mizanur Rahman*, Uday Kamal, Sudarshan Sharma, Payman Behnam, Jianming Tong, Jongseok Woo, Coleman B DeLude, Joseph W. Driscoll, Jamin Seo, Santosh Pande, Tushar Krishna, Justin Romberg, Madhavan Swaminathan, and Saibal Mukhopadhyay.

International Microwave Symposium (IMS), Jun 2023.

[abstract] [paper] [bibtex]

A High Performance Computing Architecture for Real-Time Digital Emulation of RF Interactions

Mandovi Mukherjee*, Nael Mizanur Rahman*, Coleman B. DeLude*, Joseph W. Driscoll*, Uday Kamal, Jongseok Woo, Jamin Seo, Sudarshan Sharma, Xiangyu Mao, Payman Behnam,, Sharjeel M. Khan, Daehyun Kim, Jianming Tong, Prachi Sinha, Santosh Pande, Tushar Krishna, Justin Romberg, Madhavan Swaminathan, and Saibal Mukhopadhyay.

In Proc of IEEE Radar Conference, (RadarConf), May 2023.

[abstract] [paper] [bibtex]

A Configurable Architecture for Efficient Sparse FIR Computation in Real-time Radio Frequency Systems

Jamin Seo, Nael Mizanur Rahman, Mandovi Mukherjee, Coleman DeLude, Jianming Tong, Justin Romberg, Tushar Krishna, and Saibal Mukhopadhyay.

International Microwave Symposium (IMS), 2021.

[abstract] [paper] [bibtex]

ac2SLAM: FPGA Accelerated High-Accuracy SLAM with Heapsort and Parallel Keypoint Extractor

Cheng Wang, Yinkun Liu, Kedai Zuo, Jianming Tong, Yan Ding, and Pengju Ren.

International Conference on Field-Programmable Technology (FPT), 2021. Full Paper

[abstract] [paper] [code] [bibtex]

PIT: Processing-In-Transmission with Fine-Grained Data Manipulation Networks

Pengchen Zong*, Tian Xia*, Haoran Zhao, Jianming Tong, Zehua Li, Wenzhe Zhao, Nanning Zheng, and Pengju Ren.

IEEE Transactions on Computers (TOC), 2021.

[abstract] [paper] [bibtex]

COCOA: Content-Oriented Configurable Architecture Based on Highly-Adaptive Data Transmission Networks

Tian Xia, Pengchen Zong, Haoran Zhao, Jianming Tong, Wenzhe Zhao, Nanning Zheng, and Pengju Ren.

Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI), 2020.

Insight: Adding NoC between Mem-Cache-CPU for supporting Sorting, Ordering and Multicasting (SOM) could boost 25X CPU perfromance for matrix inversion.

[abstract] [paper] [bibtex]

@inproceedings{10.1145/3386263.3406924, author = {Xia, Tian and Zong, Pengchen and Zhao, Haoran and Tong, Jianming and Zhao, Wenzhe and Zheng, Nanning and Ren, Pengju}, title = {COCOA: Content-Oriented Configurable Architecture Based on Highly-Adaptive Data Transmission Networks}, year = {2020}, isbn = {9781450379441}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3386263.3406924}, doi = {10.1145/3386263.3406924}, abstract = {In domain of parallel computation, most works focus on optimizing PE organization or memory hierarchy to pursue the maximum efficiency, while the importance of data contents has been overlooked for a long time. Actually for structured data, insights on data contents (i.e. values and locations within a structured form) can greatly benefit the computation performance, as fine-grained data manipulation can be performed. In this paper, we claim that by providing a flexible and adaptive data path, an efficient architecture with capability of fine-grained data manipulation can be built. Specifically, we propose COCOA, a novel content-oriented configurable architecture, which integrates multi-functional data reorganization networks in traditional computing scheme to handle the contents of data during the transmission path, so that they can be processed more efficiently. We evaluate COCOA on various problems: complex matrix algorithm (matrix inversion) and sparse DNN. The results indicates that COCOA is versatile enough to achieve high computation efficiency in both cases.}, booktitle = {Proceedings of the 2020 on Great Lakes Symposium on VLSI}, pages = {253–258}, numpages = {6}, keywords = {transmission network, computing architecture, high-performance computing, data reorgonization}, location = {Virtual Event, China}, series = {GLSVLSI '20} }

Workshops

A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

Jianming Tong, Anirudh Itagi, Tushar Krishna

The 3rd On-Device Intelligence Workshop@MLSys'23

[abstract] [poster]

ReLU-FHE: Low-cost Accurate ReLU Polynoimal Approximation in Fully Homomorphic Encryption Based ML Inference

Jingtian Dang*, Jianming Tong*, Anupam Golder, Callie Hao, Tushar Krishna

The 3rd On-Device Intelligence Workshop@MLSys'23

[abstract] [paper] [code] [bibtex]

Machine learning (ML) is getting more pervasive. Wide adoption of ML in healthcare, facial recognition, and blockchain involves private and sensitive data. One of the most promising candidates for inference on encrypted data, termed Fully Homomorphic Encryption (FHE), preserves the privacy of both data and the ML model. However, it slows down plaintext inference by six magnitudes, with a root cause of replacing non-polynomial operators with latency-prohibitive 27-degree Polynomial Approximated Function (PAF). While prior research has investigated low-degree PAFs, naive stochastic gradient descent (SGD) training fails to converge on PAFs with degrees higher than 5, leading to limited accuracy compared to the state-of-the-art 27-degree PAF. Therefore, we propose four training techniques to enable convergence in the post-approximation model using PAFs with an arbitrary degree, including (1) Dynamic Scaling (DS) and Static Scaling (SS) to enable minimal approximation error during approximation, (2) Coefficient Tuning (CT) to obtain a good initial coefficient value for each PAF, (3) Progressive Approximation (PA) to simply the two-variable regression optimization problem into single-variable for fast and easy convergence, and (4) Alternate Training (AT) to retraining the post-replacement PAFs and other linear layers in a decoupled divide-and-conquer manner. A combination of DS/SS, CT, PA, and AT enables the exploration of accuracy-latency space for FHEdomain ReLU replacement. Leveraging the proposed techniques, we propose a systematic approach (PAF-FHE) to enable low-degree PAF to demonstrate the same accuracy as SotA high-degree PAFs. We evaluated PAFs with various degrees on different models and variant datasets, and PAF-FHE consistently enables low-degree PAF to achieve higher accuracy than SotA PAFs. Specifically, for ResNet-18 under the ImageNet-1k dataset, our spotted optimal 12-degree PAF reduces 56% latency compared to the SotA 27-degree PAF with the same post-replacement accuracy (69.4%). While as for VGG-19 under the CiFar-10 dataset, optimal 12-degree PAF achieves even 0.84% higher accuracy with 72% latency saving. Our code is open-sourced at: https://github.com/TorchFHE/PAF-FHE.

@misc {PPR:PPR658940, Title = {PAF-FHE: Low-Cost Accurate Non-Polynomial Operator Polynomial Approximation in Fully Homomorphic Encryption Based ML Inference}, Author = {Dang, Jingtian and Tong, Jianming and Golder, Anupam and Raychowdhury, Arijit and Hao, Cong and Krishna, Tushar}, DOI = {10.21203/rs.3.rs-2910088/v1}, Abstract = {Machine learning (ML) is getting more pervasive. Wide adoption of ML in healthcare, facial recognition, and blockchain involves private and sensitive data. One of the most promising candidates for inference on encrypted data, termed Fully Homomorphic Encryp-tion (FHE), preserves the privacy of both data and the ML model. However, it slows down plaintext inference by six magnitudes, with a root cause of replacing non-polynomial operators with latency-prohibitive 27-degree Polynomial Approximated Function (PAF). While prior research has investigated low-degree PAFs, naive stochastic gradient descent (SGD) training fails to converge on PAFs with degrees higher than 5, leading to limited accuracy compared to the state-of-the-art 27-degree PAF. Therefore, we propose four training techniques to enable convergence in the post-approximation model using PAFs with an arbitrary degree, including (1) Dynamic Scaling (DS) and Static Scaling (SS) to enable minimal approximation error during approximation, (2) Coefficient Tuning (CT) to obtain a good initial coefficient value for each PAF, (3) Progressive Approximation (PA) to simply the two-variable regression optimization problem into single-variable for fast 1 and easy convergence, and (4) Alternate Training (AT) to retraining the post-replacement PAFs and other linear layers in a decoupled divide-and-conquer manner. A combination of DS/SS, CT, PA, and AT enables the exploration of accuracy-latency space for FHE-domain ReLU replacement. Leveraging the proposed techniques, we propose a systematic approach (PAF-FHE) to enable low-degree PAF to demonstrate the same accuracy as SotA high-degree PAFs. We evaluated PAFs with various degrees on different models and variant datasets, and PAF-FHE consistently enables low-degree PAF to achieve higher accuracy than SotA PAFs. Specifically, for ResNet-18 under the ImageNet-1k dataset, our spotted optimal 12-degree PAF reduces 56% latency compared to the SotA 27-degree PAF with the same post-replacement accuracy (69.4%). While as for VGG-19 under the CiFar-10 dataset, optimal 12-degree PAF achieves even 0.84% higher accuracy with 72% latency saving. Our code is open-sourced at: https://github.com/TorchFHE/PAF-FHE}, Publisher = {Research Square}, Year = {2023}, URL = {https://doi.org/10.21203/rs.3.rs-2910088/v1}, }

FastSwtich: Enabling Real-time DNN Switching via Weight-Sharing

Jianming Tong, Yangyu Chen, Yue Pan, Abhimanyu Bambhaniya, Alind Khare, Taekyung Heo, Alexey Tumanov, and Tushar Krishna

The 2nd Architecture, Compiler, and System Support for Multi-model DNN Workloads Workshop@ISCA'22

[abstract] [paper] [bibtex]

Education

Georgia Institute of Technology, USA Ph.D. in Computer Science • Jan. 2021 to Present Advisor: Prof. Tushar Krishna
Georgia Institute of Technology, USA MS. in Computer Science • Jan. 2021 to May 2024 Advisor: Prof. Tushar Krishna
Xi'an Jiaotong University, China B.E. in Electrical Engineering and Automation (EE) • Sep. 2016 to Jun 2020 Advisor: Prof. Pengju Ren

Experience

Google, USA Student Researcher • Aug. 2024 to Apr. 2025 Host: Asra Ali Jevin Jiang
Massachusetts Institute of Technology, USA Visiting Researcher • Sep. 2023 to Present Advisor: Prof. Tushar Krishna , Host: Prof. Arvind
Rivos Inc., Mountain View CA Ph.D. Intern in Computer Architecture • May. 2023 to Aug 2023
Pacific Northwest National Lab (PNNL), Battelle WA Research Intern in Computer Architecture • Jun. 2022 to Aug 2022
Alibaba DAMO Academy, Beijing Research Intern in Fully Homormophic Encryption Accelerator • Jul. 2021 to Aug. 2021
Tsinghua University, Beijing (Vitising Student) Research Assitant in Robotics • Aug. 2020 to Jan. 2021 Advisor: Prof. Yu Wang

Book

On-chip Network (Chinese)
Translator
Abstract
This book targets engineers and researchers familiar with basic computer architecture concepts who are interested in learning about on-chip networks. This work is designed to be a short synthesis of the most critical concepts in on-chip network design. It is a resource for both understanding on-chip network basics and for providing an overview of state of-the-art research in on-chip networks. [purchase translated version] [English version -- Free for University] [obtain original version]

Honors and Awards

ML and System Rising Star, USA • Jul. 2024
DAC Young Fellow, USA • Apr. 2024
Best Poster Award, USA • Sep. 2023
Winner in Qualcomm Innovation Fellowship, USA • Jul. 2023

Services

Artifact Evaluation Committee in ASPLOS'24, USA • Jul. 2023

Artifact Evaluation Committee in ISCA'24, USA • Feb. 2024

Computer Architecture Student Association (CASA) , USA • Since Sep. 2023

Life

I love writing songs, playing piano, guitar, singing and fitting in. I'm available on major music distributor like Apple Music, Spotify, QQ music and NetEase etc (Search my name in platforms to find me XD). Some thing about me could be also found here
[Apple Music] [Spotify] [QQ Music] [Netease Music] [Youtube] [Youtube - Magic Mushroom] [Bilibili - Magic Mushroom]

Last Updated: Jun 20, 2025