Job Description
Job Description
As a Principal AI Infrastructure Abstraction Engineer , you will design and implement the foundational systems that make shared AI compute environments scalable, secure, and developer-friendly. Your work will focus on creating abstractions that hide hardware complexity while providing predictable, cloud-native interfaces for AI workloads.
This position bridges infrastructure and applied AI—turning raw GPUs and accelerators into programmable, elastic, and multi-tenant resources for both internal developers and enterprise clients.
Key Responsibilities
- Architect abstractions that map logical compute constructs (vGPUs, GPU pools, workload queues) to physical devices.
- Build APIs, services, and control planes that expose GPU and accelerator resources with strong isolation and quality-of-service guarantees.
- Develop mechanisms for secure GPU sharing, including time-slicing, partitioning, and namespace isolation.
- Work with orchestration and scheduling systems to ensure intelligent mapping of resources based on utilization, priority, and network topology.
- Define policies for quotas, fair allocation, and resource elasticity in shared environments.
- Integrate with AI / ML frameworks (PyTorch, TensorFlow, Triton, etc.) to optimize model training and inference workflows.
- Deliver observability and monitoring capabilities that trace resource usage from logical abstractions to hardware.
- Partner with platform security teams to strengthen access controls, onboarding processes, and tenant isolation.
- Support internal developer adoption of abstraction APIs while maintaining high performance and low overhead.
- Contribute to long-term compute platform strategy with a focus on modularity, abstraction, and scale.
Minimum Qualifications
Bachelor’s degree with 15+ years of experience, Master’s with 12+ years, or PhD with 8+ years.Proven track record building production-grade infrastructure systems, preferably in Go, Python, or C++.Strong experience with containerization and orchestration platforms (Kubernetes, Docker, KubeVirt).Background in designing logical abstractions for compute, storage, or networking in multi-tenant systems.Familiarity with integrating with machine learning platforms (e.g., PyTorch, TensorFlow, Triton, MLFlow).Preferred Qualifications
Hands-on experience with GPU sharing, scheduling, or isolation (MIG, MPS, vGPUs, time-slicing, or device plugin models).Deep knowledge of resource management : quotas, prioritization, fairness, elasticity.Strong ability to think across hardware / software boundaries and design abstractions that scale.