Optimizing vLLM Inference Latency: A Deep Dive into PagedAttention
Exploring how memory fragmentation impacts large model serving and how PagedAttention solves the KV cache bottleneck in production environments.
Read on MediumBridging Academic Research and Industry Engineering to build AI systems that understand, reason, and collaborate with humans.
I am a Master's student in Artificial Intelligence at the University of Pennsylvania. My research focuses on Large Language Models (LLMs) and Brain-Computer Interface (BCI), aiming to improve LLM reasoning, alignment, and interaction quality — building models that understand, respond, and collaborate with humans more effectively across modalities.
In industry, I work as a Machine Learning Engineer specializing in Large Language Model systems, building production-grade GenAI applications using vLLM, TensorRT-LLM, RAG pipelines, and agentic workflows. I design scalable LLM inference infrastructure and full-stack solutions across Azure, AWS, and GCP, integrating efficient serving, orchestration, and cloud deployment into real-world AI systems.
Showcase of my production-grade solutions
Deep dives into LLM Systems, Engineering, and AI Research.
Exploring how memory fragmentation impacts large model serving and how PagedAttention solves the KV cache bottleneck in production environments.
Read on MediumWhy simple vector search isn't enough. Implementing hybrid search, query rewriting, and re-ranking pipelines for enterprise-grade QA systems.
Read on MediumDesigning a high-throughput API gateway for multiple LLM providers with rate limiting, fallback strategies, and semantic caching.
Read on MediumWhether you're interested in research collaboration, industry projects, or just want to discuss AI and LLMs — I'd love to hear from you.