Why work at Nebius
Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside some of the most experienced and innovative leaders and engineers in the field.
Where we work
Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team of over 1400 employees includes more than 400 highly skilled engineers with deep expertise across hardware and software engineering, as well as an in-house AI R&D team.
The role
At Nebius, weβre building a next-generation AI compute platform for large-scale ML training and inference β from a few nodes to thousands of GPUs.
Weβre looking for a Technical Product Manager to lead Mission Control β the product area responsible for reliability and performance across the full infrastructure stack.
As PM for Mission Control, you will own foundational capabilities that determine how well AI infrastructure performs in real-world training and inference workloads β from bare metal and networking to scheduler/runtime behavior and user-facing outcomes. This is a deeply technical PM role.
Prior PM title is not mandatory: strong candidates from HPC, ML infrastructure, distributed systems, SRE, cloud engineering, or ML solution architecture who want to grow into product are welcome.
Your responsibilities will include:
β’ Own reliability and performance opportunities across the Nebius stack: from bare metal to applications.
β’ Define product direction end-to-end: problem discovery β design β delivery β adoption.
β’ Drive cross-functional execution across compute, networking, storage, observability, platform, and hardware teams.
β’ Lead deep problem research using customer interviews, analytics, workload studies, and logs investigations.
β’ Identify and prioritize bottlenecks affecting large-scale training/inference performance and stability.
β’ Translate advanced ML/infrastructure research into practical, scalable product capabilities.
β’ Define and operationalize product metrics for cluster experience (e.g. reliability, efficiency, latency-to-start, utilization, throughput).
We expect you to have:
β’ 3β5+ years of experience in one or more of: product management, HPC, ML infrastructure/MLOps, distributed systems, SRE, cloud architecture, or GPU platforms.
β’ Strong technical foundation in distributed systems, cloud infrastructure, or ML platforms.
β’ Hands-on familiarity with ML orchestration environments (e.g. Slurm, Kubernetes, Ray, or similar).
β’ Experience delivering technically complex initiatives with multiple engineering teams.
β’ Strong communication skills and ability to influence engineering, research, and customer stakeholders.
β’ Experience using analytics and data to prioritize roadmap decisions.
β’ High ownership, learning speed, and comfort in fast-evolving AI infrastructure environments.
It will be an added bonus if you have:
β’ Experience with GPU platforms and HPC technologies (InfiniBand/RDMA, topology-aware systems).
β’ Familiarity with modern ML training stacks (PyTorch, DeepSpeed, FSDP/ZeRO, NCCL).
β’ Understanding of training efficiency metrics and operational signals (Goodput, MFU, scheduling quality, health checks).
β’ Exposure to large-scale LLM training or inference systems.
β’ Background in observability, performance tuning, or reliability engineering.
β’ Customer-facing technical experience supporting ML or infrastructure workloads.
About Nebius
Nebius AI is an AI cloud platform with one of the largest GPU capacities in Europe. Launched in November 2023, the Nebius AI platform provides high-end, training-optimized infrastructure for AI practitioners. As an NVIDIA preferred cloud service provider, Nebius AI offers a variety of NVIDIA GPUs for training and inference, as well as a set of tools for efficient multi-node training.
Nebius AI owns a data center in Finland, built from the ground up by the companyβs R&D team and showcasing our commitment to sustainability. The data center is home to ISEG, the most powerful commercially available supercomputer in Europe and the 16th most powerful globally (Top 500 list, November 2023).
Nebiusβs headquarters are in Amsterdam, Netherlands, with teams working out of R&D hubs across Europe and the Middle East.
Nebius AI is built with the talent of more than 500 highly skilled engineers with a proven track record in developing sophisticated cloud and ML solutions and designing cutting-edge hardware. This allows all the layers of the Nebius AI cloud β from hardware to UI β to be built in-house, distictly differentiating Nebius AI from the majority of specialized clouds: Nebius customers get a true hyperscaler-cloud experience tailored for AI practitioners. Weβre growing and expanding our products every day.
What we offer
β’ Competitive salary and comprehensive benefits package.
β’ Opportunities for professional growth within Nebius.
β’ Flexible working arrangements.
β’ A dynamic and collaborative work environment that values initiative and innovation.
Weβre growing and expanding our products every day. If youβre up to the challenge and are excited about AI and ML as much as we are, join us!