Thesis supervision topics & Working with me
I am currently offering thesis topic supervision to students at TU Darmstadt. If you are not a student here but are interested in collaboration, feel free to reach out for a chat.
Currently, I am working closely with GPU data processing such as
and file formats such as Apache Parquet
.
If you are interested in any of these topics or would like to propose a custom one, please contact me with a brief self-introduction. I value motivation and prior experience more than grades. I look forward to working with motivated students!
GPU-Accelerated Data Systems
I have developed a multi-GPU data processing system prototype, which can be extended with additional dimensions through the following topics. Each topic shares the same prerequisite: a solid understanding of NVIDIA GPU architecture and CUDA programming (such as memory management, streams, kernels, etc.)
Topic: GPU Cost-Performance in Cloud Query Processing
- While prior work has explored cost-optimal query processing using CPU-based cloud instances, there is limited understanding of how GPU hardware affect cost-performance trade-offs in cloud environments.
- Resources: [VLDB 2021, CODAC], Slides, [CIDR 2021, CODAC], [HPTS 2024, Sirius], [CIDR 2026, Sirius], [CIDR 2026, CODAC]
Topic: Explore Cloud Object Storage with GPU Data Processing
- Although remote I/O to cloud object storage is supported, there is currently no direct I/O: data is typically read into host memory and then asynchronously copied into device memory. There has not yet been a clear comparison in query processing with such a setup.
- Resources: [VLDB 2023, AnyBlob], [NVIDIA Blog] High-Performance Remote IO With NVIDIA KvikIO
Topic: Clever Handling of Spilling in GPU OOM Scenarios
- Due to GPU memory limitations, OOM issues are unavoidable in GPU data processing. While NVIDIA provides unified or managed memory, naive spilling can introduce significant overhead because of PCIe traffic back and forth. We are seeking a clear strategy to overcome this challenge.
- Resources: [SIGMOD 2025, Umami], [RMM FEA Issue] Expose CUDA 13 async pools for managed and pinned memory
File formats
There has been an ongoing discussion about the configuration of Parquet and to what extent it resembles a proprietary file format, mainly between Xiangpeng, Andrew, and myself. Several possible research directions are outlined in the story issue. If you are interested in this topic, we can define a concrete angle that you would enjoy working on.
Requirements:
- System programming on CPU with C++ or Rust
- Experience with Parquet, arrow-rs, or Apache DataFusion is a strong plus