Minsung Jang — AI Systems & Accelerated Computing

Research identity

One workload — large-model training and inference — driven down the full stack, from accelerator runtime to cluster, network, and a secure cloud foundation.

1 · Accelerator-aware inference & runtimes

Real-hardware processing-in-memory scheduling and matrix mapping, low-precision kernels, and LLM serving.

PAISE · Omni-MP · FireQ · DSDE

2 · GPU-cluster & systems-for-ML training

Convergence-safe, efficient training across heterogeneous GPU fleets; GPU-as-a-Service.

JABAS · x.Cloud

3 · AI-datacenter networking

End-host RDMA/RoCEv2 congestion control for GPU interconnect at fabric scale.

CORN (AMD hardware-NIC prototype)

4 · Cloud, OS & secure systems

Cloud platforms and storage, persistent memory, and confidential computing for secure AI execution.

Samsung Cloud Platform · TIPS · Nested Enclaves

Diagram: one workload — large-model training and inference — flows down four layers: accelerator-aware runtimes, GPU-cluster training, AI-datacenter networking, and cloud and secure systems.

Selected research

PAISE IEEE HPCA 2025 Peer-reviewed

Problem. Decoder-only LLM inference is memory-bound: attention re-reads a growing KV cache and stalls the GPU on HBM bandwidth.

Contribution. A per-layer GPU-vs-PIM offloading scheduler with data-layout adjustment and an interleaved-batched GEMM kernel on real HBM-PIM — up to 48.3% faster than GPU-only.

Role. Senior (last) author; set the direction and the offloading-scheduling algorithm and cost model.

PAISE schedules memory-bound attention to HBM-PIM and compute-bound layers to the GPU.

Publication →

Omni-MP MCCSys @ ACM ICS 2026 · to appear Peer-reviewed

Problem. Real PIM accepts only fixed matrix tiles, so prior PIM-for-LLM work accelerated only feed-forward layers, not attention.

Contribution. Generalized batched PIM-GEMV mapping (interlacing/cutting), zero-tile skipping, and PIM-aware KV concatenation — 40.2% average end-to-end speedup on real hardware.

Role. Senior co-author and real-PIM program lead; set the generalization goal.

Omni-MP reshapes arbitrary LLM matrices to fit the fixed PIM compute tile and skips zero-padded tiles.

JABAS EuroSys 2025 · with UNIST Peer-reviewed

Problem. Training on heterogeneous GPU clusters wastes resources, and the usual remedy breaks the i.i.d. assumption and hurts convergence.

Contribution. IIDP preserves convergence via virtual stream workers while batching and GPU allocation are tuned jointly — −33.3% time, −54.2% cost, no accuracy loss.

Role. Industry co-author (collaboration with UNIST); supplied the production heterogeneous-GPU problem and real-cluster validation context — did not lead the algorithm design.

JABAS keeps a shared local batch size across heterogeneous GPUs via virtual stream workers and jointly tunes batching and allocation.

Publication → Code →

CORN IEEE IPCCC 2023 Peer-reviewed

Problem. RoCEv2 leans on switch-level PFC, which causes head-of-line blocking and PFC storms at the scale of AI GPU fabrics.

Contribution. End-host, RTT/BDP-driven congestion control with selective-ACK recovery makes RDMA run PFC-free and switch-transparent — zero PFC events in evaluation.

Role. Senior (last) author; set the cloud-RDMA direction and the RTT-based congestion-detection idea. Implemented as a hardware-NIC prototype in collaboration with AMD and demonstrated at SC25.

CORN places a shim in the end-host NIC between RDMA transport and UDP, using RTT signals so RDMA runs over Ethernet with no switch PFC.

Publication →

DSDE IEEE BigData 2025 Peer-reviewed

Problem. A fixed speculation length in speculative decoding wastes compute and creates batch stragglers under diverse serving load.

Contribution. A training-free, KLD-variance-driven per-sequence speculation length with an adaptive cap, implemented in vLLM — robust in low-acceptance-rate regimes.

Role. Senior co-author and research lead; drove the adaptive speculation-length cap and the vLLM integration.

Paper →

Selected publications

Peer-reviewed

H. Lee, D. Baek, J. Son, J. Choi, K. Moon, M. Jang. PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM. IEEE HPCA 2025.
G. Yun, J. Kang, H. Jeong, S. Eom, M. Jang, Y.-r. Choi. JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs. EuroSys 2025. with UNIST
D. Baek, J. Son, J. Choi, K. Bin, S. Choi, K. Moon, M. Jang, H. Lee. Omni-MP: Practical PIM Matrix Mapping for Accelerating LLM Inference on Real PIM Hardware. MCCSys @ ACM ICS 2026. to appear
M. Yang, J.-Y. Choi, K. Moon, M. Jang, E. Jeon. DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving. Proceedings of IEEE BigData 2025.
J.-H. Cha, S. Kang, Y. Kang, H. Seo, J. Lee, J. Kim, M. Jang. CORN: Cloud-optimized RDMA Networking. IEEE IPCCC 2023.

Preprints & under review

K. Bin, S. Choi, J. Son, J. Choi, D. Bae, D. Baek, K. Moon, M. Jang, H. Lee. FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving. preprint · under review · arXiv:2509.06261

Full list: Google Scholar · DBLP

Selected talks

SC25 (2025) — CORN implemented as a hardware-NIC prototype in collaboration with AMD, demonstrated at SC25.
Intel AI Summit Seoul (2025) — Performance analysis of Intel Gaudi 3 for LLM inference.
Seoul National University (2024) — Trends in AI infrastructure technologies (invited).
NVIDIA GTC (2021) — AI model training cluster on Kubernetes (x.Cloud).

Research & industry leadership

In AI Systems Research at Samsung SDS, set a full-stack AI-infrastructure research agenda — GPU-as-a-Service / x.Cloud, PIM-accelerated inference, and AI-datacenter networking — and built a publishing systems-research program (HPCA, IEEE BigData, MCCSys@ICS). Earlier, conducted GPU-accelerated networking and 5G vRAN research at AT&T Labs Research.

Research collaborations: UC Berkeley Sky Computing Lab — represented and led Samsung SDS's participation as a founding sponsor · UNIST · KAIST · AMD · Intel · Ultra Ethernet Consortium.

Open source Personal research artifacts

Local-Inference-System (LIS) Apache-2.0 — CPU-only local inference runtime focused on correctness, inspectability, reproducible artifacts, and transparent diagnostics.
llm-avx-lab MIT — educational AVX2/FMA kernels for understanding LLM serving primitives from scalar C to SIMD.

About

Minsung Jang is a systems researcher whose work spans the full AI-infrastructure stack. He conducts AI systems research at Samsung SDS and currently serves as an Executive Advisor. He was previously at AT&T Labs Research, and earned his Ph.D. in Computer Science at the Georgia Institute of Technology (advisor: Prof. Karsten Schwan), with earlier degrees from Yonsei University.

Samsung SDS: AI and Cloud Systems Research · 2021–present
AT&T Labs Research: GPU-accelerated NFV, 5G vRAN, persistent memory · 2015–2020
Georgia Institute of Technology: Ph.D., Computer Science · 2008–2015
Yonsei University: B.S. 1998, M.S. 2000

Contact: mdpe36kr@gmail.com