Overview 
Dev.Pro Bogota, D.C., Capital District, Colombia — We invite a skilled Kubernetes Developer to join our fully remote, international team.
In this role, you'll build and optimize the Kubernetes orchestration platform and develop custom operators to run HPC/AI workloads efficiently on GPU clusters.
You'll enhance infrastructure performance and reliability, create internal tools to improve the developer experience, and ensure multi-tenant HPC workloads remain secure and compliant.
What’s in it for you 
- Work on cutting-edge GPU infrastructure and next-gen HPC/AI workloads 
- Build a Slurm-on-Kubernetes product from scratch and shape its architecture 
- Collaborate with a top-tier international team and grow through continuous learning and conference participation 
Key Responsibilities 
- Design, develop, and manage Kubernetes platforms for GPU-intensive AI/HPC workloads 
- Design and build a Slurm-like orchestration layer on Kubernetes for HPC/AI workloads 
- Develop custom operators and controllers for GPU job scheduling and execution 
- Integrate batch schedulers with Kubernetes to provide a hybrid HPC/Cloud product 
- Implement advanced GPU resource management and multi-tenant isolation policies 
- Build internal tools and a self-service platform to simplify AI/HPC job deployment and management 
- Monitor GPU clusters, troubleshoot production issues, and ensure high availability, fault tolerance, and disaster recovery 
- Develop CI/CD pipelines for GPU-intensive workloads 
- Ensure compliance with data sovereignty and international regulations 
Qualifications 
- 3+ years of hands-on Kubernetes experience in production 
- Experience with HPC schedulers (Slurm, PBS, LSF, Volcano) 
- Strong background in GPU resource management and distributed systems 
- Experience with cloud/hybrid cloud architectures (AWS, GCP, Azure, on-prem GPU clusters) 
- Knowledge of Kubernetes operators, CRDs, scheduling, networking, and storage 
- Deep knowledge of HPC job scheduling and workload orchestration 
- Expertise in IaC (Terraform, Helm, or GitOps: ArgoCD/Flux) and monitoring & observability (Prometheus, Grafana, Jaeger, ELK) 
- Programming skills in Go, Python, Bash/Shell 
- Familiarity with PyTorch, TensorFlow, distributed training, and model serving 
- Skills in Linux administration, performance tuning, and advanced networking (RDMA, InfiniBand, TCP/IP, DNS, load balancing) 
- Experience in storage management and optimization for large datasets 
Note:  This role is fully remote and international, with a focus on collaboration across time zones.
  #J-18808-Ljbffr