Profile Picture

Ahmed Sam Fouad

Senior Software Engineer in HPC, AI, Numerics, and Graphics

Email Icon Email PhoneIcon +49 160 57 54 511 GitHub Icon GitHub LinkedIn Icon LinkedIn Resume Icon Resume

Summary

9+ years of experience in HPC, Graphics, and AI optimization. Proven expertise in developing high-fidelity render engines, and designing safety-critical automotive architectures and optimizing large AI models for edge computing. Extensive experience in multi-core/multi-node simulations and high-performance communication systems, with strong focus on optimizing complex data structures, building scalable databases, and implementing efficient search and compression algorithms.

C++ Python Qt ROS/ROS2 CUDA OpenMP MPI SIMD AVX Intrinsics Qualcomm HTP TensorFlow Pytorch ONNX OpenGL HPC ADAS AUTOSAR Edge Computing CI/CD Jenkins Github Workflows Microservices docker React SQLite HTML CSS

Experience

Senior ML Optimization Engineer / Core AI

CARIAD, Volkswagen, Munich, Germany | 09/24 - Present | Full-time

  • Optimized ONNX models for edge deployment through post-training quantization (PTQ) and pruning techniques, and orchestrating multi-node reduction op on Qualcomm Hexagon NPU (experimental) HW
  • Developed and optimized custom operations for Qualcomm's Hexagon NPU (HTP, HVX, HMX) cores, achieving compute-bound performance at near-theoretical peak efficiency using efficient 4-way SIMD VLIW vector ops
  • Enhanced QNN AI model inference runtime on edge hardware through state-of-the-art quantization and pruning, integrating experimental TVM workflows to compile optimized kernels
  • Maintained and refactored vision models' multi-accelerator offloading environment supporting CUDA, HTP, CPU, and ONNX ops, experimenting with XLA for JIT compilation of tensor operations
  • Prototyped distributed training setups using NCCL for multi-GPU synchronization, adapting an MPI-based single CPU-GPU offloading training setup to collective operations for direct multi-GPU offload from a single CPU host, reducing sync barriers by nearly 40%.
  • Provided technical support for feature teams in vision models deployment and performance optimization
C++17/20 Python Assembly VLIW QNN CUDA ONNX Qualcomm HTP TensorFlow Pytorch Microbenchmarking CNN Vision Models NCCL TVM

Senior Software Engineer / HMI Infotainment Cluster

CARIAD, Volkswagen, Munich, Germany | 03/22 - 08/24 | Full-time

  • Architected and implemented robust HMI backend algorithms, optimizing HD complex maps data processing
  • Implemented zero-copy inter-process communication using Iceoryx and low-latency RPC with Flatbuffers for the HMI abstraction backend module in Simulation-in-the-Loop (SiL)
  • Reduced interface maintenance by 90% through automated interface generation pipeline using Jinja2 templates
  • Optimized HMI backend performance, achieving 2x core kernel speedup and 30% memory footprint reduction via advanced C++ template meta-programming, enabling multi-target deployment and requiring significantly less maintenance
  • Led development of HMI infotainment digital twin cluster initiative in Qt SDK
  • Spearheaded dynamic ADAS HMI communication stack development, adopted across multiple projects
  • Eliminated heavy and repetitive usage of interfaces adapters via lightweight proper variant interface decomposition
  • Designed a unified HMI backend interface that statically carries predefined and compiled custom payloads between multiple customer functions
  • Developed enabler tooling GUI apps to visually design a custom payload for the HMI interface using React SDK and React Flow
  • Migrated the development environment from VM based development to local devhost inside WSL2 and docker devcontainers
C++14/17 QT AOS ROS Perf OpenMP SIMD Qualcomm GTEST-GMOCK JS React Jinja2 Python

Graduate Research Assistant / ExaHype2

Technische Universität München, Germany | 04/23 - 12/23 | Part-time

  • Accelerated ExaHyPE Rusanov solver performance with CUDA-based Euler kernels, profiling the bandwidth bound problem, achieving near theoretical peak performance
  • Improved GPU offloading efficiency for complex grid geometries using CUDA
  • Achieved a 28x speedup over baseline OpenMP implementations
C++20 Python OpenMP CUDA Perf PDE Jinja2 Numerical Solvers

Software Engineer / ADAS Simulation

Altran Deutschland, Munich, Germany | 10/18 - 02/22 | Full-time

  • Co-developed Linux Model Environment (LiME) server for robust ADAS sensor simulation
  • Built LiME Juicer, a high-fidelity 3D environment rendering engine based on Open Simulation Interface (OSI)
  • Integrated OSI, FMI, and ProtoBuff standards into LiME Juicer
  • Design the multi-thread/multi-process render architecture for real-time steady 60 FPS Sensor data and point-cloud OSI payloads render engine using CUDA, OpenMP, MPI and OpenGL
  • Achieved 6x rendering speedup through CUDA kernel optimization
  • Developed reliable low-latency communication drivers using TCP/IP v4, SomeIP, FMU, and ZMQ
C++14/17 Qt Bazel ROS/ROS2 OpenGL CUDA GTEST-GMOCK AVX intrinsics SIMD

Software Engineer / Automotive Steering Development

Robert Bosch, Germany/Vietnam | 12/17 - 09/18 | Full-time

  • Developed and deployed functional Auto Steering module with comprehensive SWRE requirement analysis that is ASIL-D compliant
  • Implemented robust CAN and UDS protocols with 100% test coverage
  • Reviewed and maintained functional safety requirements and software for ASIL-D, MISRA, and ISO26262 standards
  • Developed user-friendly GUI/CLI applications using Tkinter and Qt for configuration management and for automation of repetitive tasks
C C++99/14 Assembly Qt Canoe Canalyzer CAPL Lauterbach Debug DOORS

Embedded Systems Engineer

Fpt Software Co./Renesas RVC, Vietnam | 07/15 - 11/17 | Full-time

  • Developed lightweight Car Rear View Camera driver
  • Optimized SPI and MCU drivers for iMX6, reducing latency by 10% on baremetal x86_64 Linux
  • Migrated legacy modules to AUTOSAR BSW drivers
C C++ NXP iMX8 AUTOSAR Python

Field Maintenance Engineer

General Motors, Egypt | 02/15 - 05/15 | Full-time

  • Maintained plant operations in PLC and developed efficient maintenance plans to reduce production downtime
  • Proposed and implemented IP to integrate advanced sensors into production lines, improving ergonomics and safety
PLC Fuzzy Logic Python SolidWorks

Projects

Surflow - Chrome Extension

Personal Project | Chrome Store

  • Efficiently streamline the browsing and tab management experience
  • Built with TypeScript, React, and Chrome Extension APIs
JavaScript React Chrome API SQLite HTML CSS

Distributed ONNX Training with NCCL

Personal Project | 09/23

  • Developed a PoC for distributed training of an ONNX-based CNN model across two local GPUs using NCCL’s AllReduce and Broadcast ops
  • Achieved ~1.5x speedup over single-GPU baseline by optimizing inter-GPU networking with NCCL over NVLink
Python Pytorch ONNX NCCL CUDA

TVM and XLA Optimization for TPU-Accelerated Inference proto

TUM | Advanced Computer Architecture Seminar | 02/23

  • Developed a demo using TVM and XLA to compile and optimize an ONNX model for edge HW inference in runtime (Google's TPU), integrating PyTorch with XLA’s JIT compilation
  • Used AutoTVM to tune convolution kernels and XLA to fuse tensor operations, reducing inference latency improving throughput by 10% compared to baseline ONNX runtime
Python Pytorch ONNX TVM XLA Microbenchmarking

Gem5 simulation model to evaluate Intel Skylake-based CPU

Case study

  • Developed a gem5 simulation model to evaluate Intel Skylake-based CPU performance for HPC workloads
  • Implemented a custom out-of-order core model in C++ within gem5’s x86 framework, simulating a 4-core Skylake processor with a 3-level cache hierarchy
  • Configured gem5 to run an OpenMP-parallelized matrix multiplication benchmark and an MPI-based stencil computation workload
  • Analyzed cache miss rates and branch prediction accuracy, optimizing prefetching to reduce memory latency by 12% in simulations
C++ gem5

Publications

Education

M.Sc. Computational Science and Engineering (Part-time)

Technical University of Munich, Germany | 10/20 - 09/25 | Grade: 1.7/1.0

Specialized in HPC, High Dimensional Data Reduction, Deep Learning, and Scientific Computing

B.Sc. Mechatronics Engineering

Ain Shams University, Egypt | 09/07 - 06/13 | GPA: 2.7/4.0

With graduation project GPA of 4.0/4.0.

Secondary School - Natural Sciences

Al-Shati Secondary School, Saudi Arabia | 09/04 - 06/07 | GPA: 3.9/4.0

Graduated with distinction, studied natural and engineering sciences

Languages

  • Arabic: Native
  • English: Full fluency (C1.2)
  • German: Upper-Intermediate (B2.1)

Last Update: March 2025