Hi, I'm Mengyu. I completed my master’s degree in Mechatronics and Robotics at the Technical University of Munich. During my studies, I followed the guidance of Dr. Hu Cao on neuromorphic vision for autonomous driving and was supervised by Prof. Alois Knoll at the Lehrstuhl für Robotik, Künstliche Intelligenz und Echtzeitsysteme. I also took part in the DISRUPT project with Prof. Klaus Kefferpütz and worked with Longfei Han at Fraunhofer IVI.

My academic interests center on multi-modal fusion for scene understanding, especially within autonomous driving and robotics. I am also curious about how large language models (LLMs) can support more reliable and safer scene analysis. I am currently looking for a PhD position where I can continue exploring multi-modal fusion and contribute to advancing research in scene understanding.

Research

RGB-X Fusion

Multimodal Fusion of RGB and Complementary Modalities for Semantic Segmentation

Mengyu Li et al.
Preprint, submitted to ICLR 2026 (under review)

Multi-modal semantic segmentation augments RGB imagery with an auxiliary sensing stream X (RGB-X)—such as thermal, depth, LiDAR, event, polarization, or light field—to enhance robustness under adverse illumination and motion blur. However, sensor heterogeneity often leads to misaligned features and unstable fusion.

To alleviate these issues, we propose a bidirectional polarity-aware cross-modality fusion (BPCF) module that effectively captures complementary cues while enhancing feature alignment. We evaluate the framework on six modality pairings—RGB-Thermal, RGB-LiDAR, RGB-Depth, RGB-Event, RGB-Polarization, and RGB-Light Field—and achieve state-of-the-art results on eight public datasets, including MFNe, KITTI-360, DELIVER, DDD17, DSEC, MCubeS, ZJU, and UrbanLF.

Master's Thesis

Multi-Modal Fusion of Image Sequences for Dense Prediction with RGB and Event Cameras in Autonomous Driving

Mengyu Li
Master's Thesis, Technical University of Munich

Integrating RGB and event camera data through multi-modal fusion in autonomous driving significantly enhances dense prediction tasks such as depth estimation and object detection. RGB cameras provide high-resolution color imagery crucial for visual perception. In contrast, event cameras offer high temporal resolution and dynamic range, capturing pixel-level changes caused by motion even in challenging lighting conditions.

The work can construct a more comprehensive and robust representation of the environment by fusing the continuous visual stream of RGB with the asynchronous intensity changes captured by event cameras. This thesis mainly focuses on combining the multi-modal features for semantic segmentation.

TUMTraf EMOT

Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios

Mengyu Li
Semester Thesis, Technical University of Munich

In Intelligent Transportation Systems (ITS), multi object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce a dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking.

Based on this dataset, we establish a tracking-by-detection benchmark with a specialized feature extractor. The experimental results demonstrate the excellent performance of our method. We hope our work can facilitate further research on the use of event cameras for ITS.

Internship

Decentralized Tracking in the Context of the DISRUPT Project

Mengyu Li
Internship, Fraunhofer IVI

During my internship at Fraunhofer IVI, I worked on making multi camera object tracking in urban traffic more robust and scalable. I built a decentralized tracking framework that keeps object identities consistent across cameras, even when objects are occluded or leave and re enter the scene. I also improved the way information from several cameras is fused so that the system remains stable and reliable as more sensors are added.

In addition, I designed the communication and deployment setup so this tracking system can run both on real vehicles and in cloud based simulations. I created a lightweight telemetry and ROS2 based communication layer that supports wireless connections between cars and the backend, and packaged the whole stack into containers for easy reuse by project partners. I then evaluated network latency in different settings to show that the architecture is suitable for real time, distributed perception research.