Hyunjun Kim

About Me

Experienced in writing high performance GPU kernels, tuning NPU compiler optimizations, and deploying on-device AI models, backed by a deep understanding of hardware architectures. Skilled at bridging software and hardware to deliver performance-optimized solutions across heterogeneous platforms through advanced profiling and performance analysis.

Experience

Staff Researcher Samsung Advanced Institute of Technology (SAIT)

Feb 2022 - Present

Led research on optimizing distributed LLM inference, developing automated pipelining frameworks to mitigate inter-GPU communication bottlenecks.
Developed a fully-automated performance tuning framework for mobile NPU compilers (MLIR-based and in-house compilers), achieving significant inference throughput gains on commercial devices.
Tuned on-device LLM inference acceleration through compiler and runtime optimizations, enabling the first deployment on Samsung Galaxy smartphones.

Staff Engineer Samsung Electronics

Sep 2020 - Jan 2022

Improving on-device AI model inference by training with self-annotated datasets, applying quantization, and leveraging heterogeneous compute accelerators (e,g, GPU, NPU, DSP).

Postdoctoral Researcher IT Convergence Center @ SKKU

Mar 2020 - Aug 2020

Conducted postdoctoral research on Unified Virtual Memory (UVM) for GPUs, investigating page eviction, prefetch, and throttling strategies to reduce overhead.

Research Assistant ARCS Lab. @ SKKU

Mar 2013 - Feb 2020

Investigated GPU microarchitecture to optimize kernel performance, implementing a source-to-source transpiler and using low-level profilers for bottleneck analysis.

Education

Ph.D. in Computer Engineering Sungkyunkwan University (SKKU)

Mar 2013 - Feb 2020

B.S. in Computer Engineering Sungkyunkwan University (SKKU)

Mar 2007 - Feb 2013

Recent Publications

Hong, S., Kim, H. & Han, H. Compile-Time QoS Scheme for Deep Learning Inferences. The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'25), Nov 2025.
Kim, H. & Han, H. GPU thread throttling for page-level thrashing reduction via static analysis. Journal of Supercomputing, May 2024.
Jung, H. et al. Accelerating Deep Neural Networks on Mobile Multicore NPUs. IEEE/ACM International Symposium on Code Generation and Optimization (CGO' 23), Feb 2023.
Lee, H. et al. Idempotence-Based Preemptive GPU Kernel Scheduling for Embedded Systems. IEEE Transactions on Computers, Vol. 70(3), pp. 332-346, Mar 2021.
Park, D., Kim, H. & Han, H. Page Reuse in Cyclic Thrashing of GPU Under Oversubscription: WiP. Int'l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), Sep 2020.
Kim, H., Kim, H. & Han, H. Effective Profiling for Data-Intensive GPU Programs: WiP. Int'l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), Sep 2020.
Kim, H. et al. Static Code Transformations for Thread-Dense Memory Accesses in GPU Computing. Concurrency and Computation: Practice and Experience, Vol. 32(5), e5512, Wiley, Mar 10, 2020.
Kim, H. Compiler Orchestrated On-Chip Memory Optimizations for GPU Computing. PhD Thesis, Dept. of ECE, SKKU, Feb 2020.
Kim, H. et al. Compiler-Assisted GPU Thread Throttling for Reduced Cache Contention. Int'l Conf. on Parallel Processing (ICPP), Aug 2019.
Kim, H. et al. Compiler-Assisted Preloading in the Shared Memory for Thread-Dense Memory Requests. Workshop on Extreme-Scale Programming Tools held in conjunction with SC17, also in LNCS Vol. 11027, Nov 2017.
Hong, S., Kim, H. & Han, H. Balanced Cache Bypassing for Critical Warp Reduction: WiP. Int'l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), Oct 2017.

View full publication list on Google Scholar

Skills & Keywords

AI accelerators (GPU, Mobile NPU) Kernel programming (CUDA, Triton) Inference engine (vLLM) Compiler (MLIR, LLVM) Profiler (Nsight, Nvprof, Perfetto) Python, C/C++ Optimization (DP, ILP, GA, Beam-search)