TY - GEN
T1 - High-Performance NTT Hardware Accelerator to Support ML-KEM and ML-DSA
AU - Kundi, Dur E.Shahwar
AU - Bermudo Mera, Jose M.
AU - Strub, Pierre Yves
AU - Hutter, Michael
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/11/19
Y1 - 2024/11/19
N2 - Large polynomial multiplications are crucial for Post-Quantum Cryptography standards like Module-Lattice-based Key Encapsulation Mechanism (ML-KEM) and Module-Lattice-based Digital Signature (ML-DSA). These multiplications, being complex, are often accelerated using the Number Theoretic Transform (NTT). This work presents a novel architecture of a high-performance NTT accelerator capable of performing both NTT and inverse NTT operations using a single set of hardware resources. The design makes use of a single butterfly configuration unit to reduce resource requirements and improve critical path. The Multi-path Delay Commutator (MDC) strategy is employed to enable fully pipelined and parallel processing of multiple coefficients, supporting both ML-KEM and ML-DSA computations. Practical results show that our proposed NTT engine requires 3,821 LUTs, 2970 FFs, 20 DSPs, and 5 BRAMs on an AMD Zynq UltraScale+ FPGA, and can run up to 322 MHz. Our design provides the best Area-Time Product (ATP) among current NTT architectures.
AB - Large polynomial multiplications are crucial for Post-Quantum Cryptography standards like Module-Lattice-based Key Encapsulation Mechanism (ML-KEM) and Module-Lattice-based Digital Signature (ML-DSA). These multiplications, being complex, are often accelerated using the Number Theoretic Transform (NTT). This work presents a novel architecture of a high-performance NTT accelerator capable of performing both NTT and inverse NTT operations using a single set of hardware resources. The design makes use of a single butterfly configuration unit to reduce resource requirements and improve critical path. The Multi-path Delay Commutator (MDC) strategy is employed to enable fully pipelined and parallel processing of multiple coefficients, supporting both ML-KEM and ML-DSA computations. Practical results show that our proposed NTT engine requires 3,821 LUTs, 2970 FFs, 20 DSPs, and 5 BRAMs on an AMD Zynq UltraScale+ FPGA, and can run up to 322 MHz. Our design provides the best Area-Time Product (ATP) among current NTT architectures.
KW - CRYSTALS-Dilithium
KW - CRYSTALS-Kyber
KW - ML-DSA
KW - ML-KEM
KW - Multi-path Delay Commutator (MDC)
KW - NTT
KW - Polynomial Multiplication
UR - https://www.scopus.com/pages/publications/85214111735
U2 - 10.1145/3689939.3695785
DO - 10.1145/3689939.3695785
M3 - Conference contribution
AN - SCOPUS:85214111735
T3 - ASHES 2024 - Proceedings of the 2024 Workshop on Attacks and Solutions in Hardware Security, Co-Located with: CCS 2024
SP - 100
EP - 105
BT - ASHES 2024 - Proceedings of the 2024 Workshop on Attacks and Solutions in Hardware Security, Co-Located with
PB - Association for Computing Machinery, Inc
T2 - 2024 Workshop on Attacks and Solutions in Hardware Security, ASHES 2024
Y2 - 14 October 2024 through 18 October 2024
ER -