Ahmed Sam Fouad

Senior Software Engineer in HPC, AI, Numerics, and Graphics

+49 160 57 54 511 GitHub Icon

GitHub

Resume

Summary

9+ years of experience in HPC, Graphics, and AI optimization. Proven expertise in developing high-fidelity render engines, and designing safety-critical automotive architectures and optimizing large AI models for edge computing. Extensive experience in multi-core/multi-node simulations and high-performance communication systems, with strong focus on optimizing complex data structures, building scalable databases, and implementing efficient search and compression algorithms.

C++ Python Qt ROS/ROS2 CUDA OpenMP MPI SIMD AVX Intrinsics Qualcomm HTP TensorFlow Pytorch ONNX OpenGL HPC ADAS AUTOSAR Edge Computing CI/CD Jenkins Github Workflows Microservices docker React SQLite HTML CSS

Experience

Senior ML Optimization Engineer / Core AI

CARIAD, Volkswagen, Munich, Germany | 09/24 - Present | Full-time

Optimized ONNX models for edge deployment through post-training quantization (PTQ) and pruning techniques, and orchestrating multi-node reduction op on Qualcomm Hexagon NPU (experimental) HW
Developed and optimized custom operations for Qualcomm's Hexagon NPU (HTP, HVX, HMX) cores, achieving compute-bound performance at near-theoretical peak efficiency using efficient 4-way SIMD VLIW vector ops
Enhanced QNN AI model inference runtime on edge hardware through state-of-the-art quantization and pruning, integrating experimental TVM workflows to compile optimized kernels
Maintained and refactored vision models' multi-accelerator offloading environment supporting CUDA, HTP, CPU, and ONNX ops, experimenting with XLA for JIT compilation of tensor operations
Prototyped distributed training setups using NCCL for multi-GPU synchronization, adapting an MPI-based single CPU-GPU offloading training setup to collective operations for direct multi-GPU offload from a single CPU host, reducing sync barriers by nearly 40%.
Provided technical support for feature teams in vision models deployment and performance optimization

C++17/20 Python Assembly VLIW QNN CUDA ONNX Qualcomm HTP TensorFlow Pytorch Microbenchmarking CNN Vision Models NCCL TVM

Senior Software Engineer / HMI Infotainment Cluster

CARIAD, Volkswagen, Munich, Germany | 03/22 - 08/24 | Full-time

Architected and implemented robust HMI backend algorithms, optimizing HD complex maps data processing
Implemented zero-copy inter-process communication using Iceoryx and low-latency RPC with Flatbuffers for the HMI abstraction backend module in Simulation-in-the-Loop (SiL)
Reduced interface maintenance by 90% through automated interface generation pipeline using Jinja2 templates
Optimized HMI backend performance, achieving 2x core kernel speedup and 30% memory footprint reduction via advanced C++ template meta-programming, enabling multi-target deployment and requiring significantly less maintenance
Led development of HMI infotainment digital twin cluster initiative in Qt SDK
Spearheaded dynamic ADAS HMI communication stack development, adopted across multiple projects
Eliminated heavy and repetitive usage of interfaces adapters via lightweight proper variant interface decomposition
Designed a unified HMI backend interface that statically carries predefined and compiled custom payloads between multiple customer functions
Developed enabler tooling GUI apps to visually design a custom payload for the HMI interface using React SDK and React Flow
Migrated the development environment from VM based development to local devhost inside WSL2 and docker devcontainers

C++14/17 QT AOS ROS Perf OpenMP SIMD Qualcomm GTEST-GMOCK JS React Jinja2 Python

Graduate Research Assistant / ExaHype2

Technische Universität München, Germany | 04/23 - 12/23 | Part-time

Accelerated ExaHyPE Rusanov solver performance with CUDA-based Euler kernels, profiling the bandwidth bound problem, achieving near theoretical peak performance
Improved GPU offloading efficiency for complex grid geometries using CUDA
Achieved a 28x speedup over baseline OpenMP implementations

C++20 Python OpenMP CUDA Perf PDE Jinja2 Numerical Solvers

Software Engineer / ADAS Simulation

Altran Deutschland, Munich, Germany | 10/18 - 02/22 | Full-time

Co-developed Linux Model Environment (LiME) server for robust ADAS sensor simulation
Built LiME Juicer, a high-fidelity 3D environment rendering engine based on Open Simulation Interface (OSI)
Integrated OSI, FMI, and ProtoBuff standards into LiME Juicer
Design the multi-thread/multi-process render architecture for real-time steady 60 FPS Sensor data and point-cloud OSI payloads render engine using CUDA, OpenMP, MPI and OpenGL
Achieved 6x rendering speedup through CUDA kernel optimization
Developed reliable low-latency communication drivers using TCP/IP v4, SomeIP, FMU, and ZMQ

C++14/17 Qt Bazel ROS/ROS2 OpenGL CUDA GTEST-GMOCK AVX intrinsics SIMD

Software Engineer / Automotive Steering Development

Robert Bosch, Germany/Vietnam | 12/17 - 09/18 | Full-time

Developed and deployed functional Auto Steering module with comprehensive SWRE requirement analysis that is ASIL-D compliant
Implemented robust CAN and UDS protocols with 100% test coverage
Reviewed and maintained functional safety requirements and software for ASIL-D, MISRA, and ISO26262 standards
Developed user-friendly GUI/CLI applications using Tkinter and Qt for configuration management and for automation of repetitive tasks

C C++99/14 Assembly Qt Canoe Canalyzer CAPL Lauterbach Debug DOORS

Embedded Systems Engineer

Fpt Software Co./Renesas RVC, Vietnam | 07/15 - 11/17 | Full-time

Developed lightweight Car Rear View Camera driver
Optimized SPI and MCU drivers for iMX6, reducing latency by 10% on baremetal x86_64 Linux
Migrated legacy modules to AUTOSAR BSW drivers

C C++ NXP iMX8 AUTOSAR Python

Field Maintenance Engineer

General Motors, Egypt | 02/15 - 05/15 | Full-time

Maintained plant operations in PLC and developed efficient maintenance plans to reduce production downtime
Proposed and implemented IP to integrate advanced sensors into production lines, improving ergonomics and safety

PLC Fuzzy Logic Python SolidWorks

Projects

Surflow - Chrome Extension

Personal Project | Chrome Store

Efficiently streamline the browsing and tab management experience
Built with TypeScript, React, and Chrome Extension APIs

JavaScript React Chrome API SQLite HTML CSS

Distributed ONNX Training with NCCL

Personal Project | 09/23

Developed a PoC for distributed training of an ONNX-based CNN model across two local GPUs using NCCL’s AllReduce and Broadcast ops
Achieved ~1.5x speedup over single-GPU baseline by optimizing inter-GPU networking with NCCL over NVLink

Python Pytorch ONNX NCCL CUDA

TVM and XLA Optimization for TPU-Accelerated Inference proto

TUM | Advanced Computer Architecture Seminar | 02/23

Developed a demo using TVM and XLA to compile and optimize an ONNX model for edge HW inference in runtime (Google's TPU), integrating PyTorch with XLA’s JIT compilation
Used AutoTVM to tune convolution kernels and XLA to fuse tensor operations, reducing inference latency improving throughput by 10% compared to baseline ONNX runtime

Python Pytorch ONNX TVM XLA Microbenchmarking

Gem5 simulation model to evaluate Intel Skylake-based CPU

Case study

Developed a gem5 simulation model to evaluate Intel Skylake-based CPU performance for HPC workloads
Implemented a custom out-of-order core model in C++ within gem5’s x86 framework, simulating a 4-core Skylake processor with a 3-level cache hierarchy
Configured gem5 to run an OpenMP-parallelized matrix multiplication benchmark and an MPI-based stencil computation workload
Analyzed cache miss rates and branch prediction accuracy, optimizing prefetching to reduce memory latency by 12% in simulations

C++ gem5

Publications

Guided Research

TUM | Optimizing GPU Offloading with CUDA for a Patch-based Hyperbolic Finite Volume Solver in ExaHyPE

CUDA PTX Microbenchmarking

Education

M.Sc. Computational Science and Engineering (Part-time)

Technical University of Munich, Germany | 10/20 - 09/25 | Grade: 1.7/1.0

Specialized in HPC, High Dimensional Data Reduction, Deep Learning, and Scientific Computing

B.Sc. Mechatronics Engineering

Ain Shams University, Egypt | 09/07 - 06/13 | GPA: 2.7/4.0

With graduation project GPA of 4.0/4.0.

Secondary School - Natural Sciences

Al-Shati Secondary School, Saudi Arabia | 09/04 - 06/07 | GPA: 3.9/4.0

Graduated with distinction, studied natural and engineering sciences

Languages

Arabic: Native
English: Full fluency (C1.2)
German: Upper-Intermediate (B2.1)