Nuvepro - Task Intelligence for the Enterprise
OpenAI· Frontiers Clusters· San Francisco

Software Engineer, Hardware Health

Comp$250K – $445K

Classified Tasks (18)

Automate 0%Augment 83%Human-Only 17%

Augment (15)

AI assists, human decides

Maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions

operational

Define health signals across GPUs, CPUs, networking, and platform infrastructure

technical

Maintain health signals across GPUs, CPUs, networking, and platform infrastructure

technical

Build systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure

technical

Build and evolve health checks that detect, remediate, and verify failures at scale

technical

Ensure critical health checks execute with minimal latency

technical

Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes

operational

Build automation and tooling to enable global cluster management with minimal manual intervention

technical

Build automated remediation systems that operate across millions of GPUs globally

technical

Integrate health signals into training and inference systems in partnership with workload, reliability, and provider teams

communication

Minimize downtime of large-scale training and inference workloads

operational

Improve fleet efficiency through operational tooling and remediation systems

operational

Ensure compute resources remain continuously available to researchers and product teams

operational

Develop scalable operational tooling to support hardware health and observability at hyperscale

technical

Enable frontier model training and inference workloads to run reliably at hyperscale

operational

Human-Only (3)

Requires human judgment

Own the end-to-end health lifecycle of the global compute fleet

leadership

Investigate hardware failures and system-level issues across large-scale compute environments

analytical

Debug failures in production and research workloads across the compute fleet

technical

Job description

Software Engineer, Hardware Health | OpenAI Careers ## Software Engineer, Hardware Health Frontiers Clusters - San Francisco Apply now(opens in a new window) ## **About the Team** The Hardware Health and Observability team owns the end-to-end health lifecycle of OpenAI’s global compute fleet. Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling. We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably at hyperscale. We are the last line of defense for the success of OAI’s production and research workloads. ## **About the Role** On the Hardware Health and Observability team, you’ll build critical infrastructure that keeps OpenAI’s largest compute clusters healthy and operational at scale. Even small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet efficiency, and ensuring compute resources remain continuously available to researchers and product teams. Engineers on this team own problems end-to-end, from defining health signals and debugging failures to building automated remediation systems that operate across millions of GPUs globally. In this role, you will: * Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure. * Build and evolve health checks that detect, remediate, and verify failures at scale. * Ensure critical health checks execute with minimal latency to maximize workload uptime. * Investigate hardware failures and system-level issues across large-scale compute environments. * Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes. * Build automation and tooling that enables global cluster management with minimal manual intervention. * Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems. ## **You might thrive in this role if you have:** * 7+ years of industry experience in software or infrastructure engineering. * Strong proficiency with Python and shell scripting. * Experience building large-scale distributed systems or infrastructure platforms. * Comfort digging into noisy operational data using SQL, PromQL, or similar tooling. * Experience building reproducible analyses and operational tooling. * Strong systems debugging and operational instincts with an ownership mindset. ## **Bonus if you have:** * Experience with low-level hardware systems and Linux tooling (e.g. PCIe, InfiniBand, RoCE, networking, power management, kernel performance tuning, FW/SW debugging). * Experience operating or debugging large-scale GPU or accelerator clusters. * Expertise in network operations, observability, or systems telemetry. * Experience with automated remediation systems or fleet lifecycle management. * Experience improving reliability, utilization, or workload uptime in distributed compute environments. **About OpenAI** OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability,
Source: OpenAI careers · scraped 2026-05-22
Apply at OpenAI