Ahmed Sam Fouad
Senior Software Engineer in HPC, AI, Numerics, and Graphics
Summary
9+ years of experience in HPC, Graphics, and AI optimization. Proven expertise in developing high-fidelity render engines, and designing safety-critical automotive architectures and optimizing large AI models for edge computing. Extensive experience in multi-core/multi-node simulations and high-performance communication systems, with strong focus on optimizing complex data structures, building scalable databases, and implementing efficient search and compression algorithms.
Experience
Senior ML Optimization Engineer / Core AI
CARIAD, Volkswagen, Munich, Germany | 09/24 - Present | Full-time
- Optimized ONNX models for edge deployment through post-training quantization (PTQ) and pruning techniques, and orchestrating multi-node reduction op on Qualcomm Hexagon NPU (experimental) HW
- Developed and optimized custom operations for Qualcomm's Hexagon NPU (HTP, HVX, HMX) cores, achieving compute-bound performance at near-theoretical peak efficiency using efficient 4-way SIMD VLIW vector ops
- Enhanced QNN AI model inference runtime on edge hardware through state-of-the-art quantization and pruning, integrating experimental TVM workflows to compile optimized kernels
- Maintained and refactored vision models' multi-accelerator offloading environment supporting CUDA, HTP, CPU, and ONNX ops, experimenting with XLA for JIT compilation of tensor operations
- Prototyped distributed training setups using NCCL for multi-GPU synchronization, adapting an MPI-based single CPU-GPU offloading training setup to collective operations for direct multi-GPU offload from a single CPU host, reducing sync barriers by nearly 40%.
- Provided technical support for feature teams in vision models deployment and performance optimization
Senior Software Engineer / HMI Infotainment Cluster
CARIAD, Volkswagen, Munich, Germany | 03/22 - 08/24 | Full-time
- Architected and implemented robust HMI backend algorithms, optimizing HD complex maps data processing
- Implemented zero-copy inter-process communication using Iceoryx and low-latency RPC with Flatbuffers for the HMI abstraction backend module in Simulation-in-the-Loop (SiL)
- Reduced interface maintenance by 90% through automated interface generation pipeline using Jinja2 templates
- Optimized HMI backend performance, achieving 2x core kernel speedup and 30% memory footprint reduction via advanced C++ template meta-programming, enabling multi-target deployment and requiring significantly less maintenance
- Led development of HMI infotainment digital twin cluster initiative in Qt SDK
- Spearheaded dynamic ADAS HMI communication stack development, adopted across multiple projects
- Eliminated heavy and repetitive usage of interfaces adapters via lightweight proper variant interface decomposition
- Designed a unified HMI backend interface that statically carries predefined and compiled custom payloads between multiple customer functions
- Developed enabler tooling GUI apps to visually design a custom payload for the HMI interface using React SDK and React Flow
- Migrated the development environment from VM based development to local devhost inside WSL2 and docker devcontainers
Graduate Research Assistant / ExaHype2
Technische Universität München, Germany | 04/23 - 12/23 | Part-time
- Accelerated ExaHyPE Rusanov solver performance with CUDA-based Euler kernels, profiling the bandwidth bound problem, achieving near theoretical peak performance
- Improved GPU offloading efficiency for complex grid geometries using CUDA
- Achieved a 28x speedup over baseline OpenMP implementations
Software Engineer / ADAS Simulation
Altran Deutschland, Munich, Germany | 10/18 - 02/22 | Full-time
- Co-developed Linux Model Environment (LiME) server for robust ADAS sensor simulation
- Built LiME Juicer, a high-fidelity 3D environment rendering engine based on Open Simulation Interface (OSI)
- Integrated OSI, FMI, and ProtoBuff standards into LiME Juicer
- Design the multi-thread/multi-process render architecture for real-time steady 60 FPS Sensor data and point-cloud OSI payloads render engine using CUDA, OpenMP, MPI and OpenGL
- Achieved 6x rendering speedup through CUDA kernel optimization
- Developed reliable low-latency communication drivers using TCP/IP v4, SomeIP, FMU, and ZMQ
Software Engineer / Automotive Steering Development
Robert Bosch, Germany/Vietnam | 12/17 - 09/18 | Full-time
- Developed and deployed functional Auto Steering module with comprehensive SWRE requirement analysis that is ASIL-D compliant
- Implemented robust CAN and UDS protocols with 100% test coverage
- Reviewed and maintained functional safety requirements and software for ASIL-D, MISRA, and ISO26262 standards
- Developed user-friendly GUI/CLI applications using Tkinter and Qt for configuration management and for automation of repetitive tasks
Embedded Systems Engineer
Fpt Software Co./Renesas RVC, Vietnam | 07/15 - 11/17 | Full-time
- Developed lightweight Car Rear View Camera driver
- Optimized SPI and MCU drivers for iMX6, reducing latency by 10% on baremetal x86_64 Linux
- Migrated legacy modules to AUTOSAR BSW drivers
Field Maintenance Engineer
General Motors, Egypt | 02/15 - 05/15 | Full-time
- Maintained plant operations in PLC and developed efficient maintenance plans to reduce production downtime
- Proposed and implemented IP to integrate advanced sensors into production lines, improving ergonomics and safety
Projects
Surflow - Chrome Extension
Personal Project | Chrome Store
- Efficiently streamline the browsing and tab management experience
- Built with TypeScript, React, and Chrome Extension APIs
Distributed ONNX Training with NCCL
Personal Project | 09/23
- Developed a PoC for distributed training of an ONNX-based CNN model across two local GPUs using NCCL’s AllReduce and Broadcast ops
- Achieved ~1.5x speedup over single-GPU baseline by optimizing inter-GPU networking with NCCL over NVLink
TVM and XLA Optimization for TPU-Accelerated Inference proto
TUM | Advanced Computer Architecture Seminar | 02/23
- Developed a demo using TVM and XLA to compile and optimize an ONNX model for edge HW inference in runtime (Google's TPU), integrating PyTorch with XLA’s JIT compilation
- Used AutoTVM to tune convolution kernels and XLA to fuse tensor operations, reducing inference latency improving throughput by 10% compared to baseline ONNX runtime
Gem5 simulation model to evaluate Intel Skylake-based CPU
Case study
- Developed a gem5 simulation model to evaluate Intel Skylake-based CPU performance for HPC workloads
- Implemented a custom out-of-order core model in C++ within gem5’s x86 framework, simulating a 4-core Skylake processor with a 3-level cache hierarchy
- Configured gem5 to run an OpenMP-parallelized matrix multiplication benchmark and an MPI-based stencil computation workload
- Analyzed cache miss rates and branch prediction accuracy, optimizing prefetching to reduce memory latency by 12% in simulations
Publications
Guided Research
TUM | Optimizing GPU Offloading with CUDA for a Patch-based Hyperbolic Finite Volume Solver in ExaHyPE
Education
M.Sc. Computational Science and Engineering (Part-time)
Technical University of Munich, Germany | 10/20 - 09/25 | Grade: 1.7/1.0
Specialized in HPC, High Dimensional Data Reduction, Deep Learning, and Scientific Computing
B.Sc. Mechatronics Engineering
Ain Shams University, Egypt | 09/07 - 06/13 | GPA: 2.7/4.0
With graduation project GPA of 4.0/4.0.
Secondary School - Natural Sciences
Al-Shati Secondary School, Saudi Arabia | 09/04 - 06/07 | GPA: 3.9/4.0
Graduated with distinction, studied natural and engineering sciences
Languages
- Arabic: Native
- English: Full fluency (C1.2)
- German: Upper-Intermediate (B2.1)
Email
+49 160 57
54 511
GitHub
LinkedIn
Resume