About Us 100ms operates two product lines at scale: a real-time Live Video platform powering latency-sensitive, high-concurrency video experiences, and an AI Agents platform that automates complex patient access workflows in U.S. healthcare. Both products run on a shared, robust infrastructure foundation. You'll be joining the central platform team responsible for keeping both running reliably, securely, and at scale — serving developers and healthcare operators who depend on us around the clock. What Will You Do • Own and operate production infrastructure across multiple GKE clusters supporting both real-time video workloads and AI agent pipelines — with HA, autoscaling, and full observability tuned to the demands of each. • Manage GitOps workflows using Argo CD for automated, version-controlled, and auditable deployments across both product lines. • Maintain and optimize monitoring & alerting stacks using Open Source Monitoring Tools — with product-specific SLOs for low-latency video (jitter, packet loss, stream health) and AI workflow reliability (task throughput, failure rates, retry queues). • Implement infrastructure as code using Terraform for GCP resources and helm chart for Kubernetes manifests, with a strong bias toward repeatability and auditability. • Support the unique infrastructure demands of real-time video — including media server scaling, WebRTC infrastructure, low-latency networking, and high-throughput data paths. • Support AI agent workloads — including LLM inference infrastructure, async task queues, and integration pipelines with external healthcare systems. • Lead or support incident response, cluster upgrades, and disaster recovery procedures across both platforms. • Own the security posture of our infrastructure — enforce least-privilege access controls, manage secrets hygiene, and drive security hardening across clusters and services. • Implement and maintain compliance-aligned controls relevant to healthcare data environments (e.g., encryption at rest/in transit, audit logging, network segmentation). • Collaborate with product and engineering teams to embed security early in the development lifecycle — shift-left on vulnerability scanning, dependency audits, and policy enforcement. Who Can Apply • Computer Science / Engineering degree or equivalent practical experience. • Minimum 3 years of hands-on experience with Kubernetes in a production environment. • Strong knowledge of CI/CD pipelines and GitOps workflows using Argo CD or similar tools. • Proficient in infrastructure automation using Terraform and Helm. • Experience in managing open source monitoring and logging stacks (Prometheus, Loki, Grafana, Alertmanager etc). • Working knowledge of cloud security principles — IAM, network policies, pod security, RBAC, and secrets management. • Comfortable with Linux systems, shell scripting, and basic networking — including an understanding of UDP/TCP behaviour relevant to real-time media or distributed systems. Good to Have • Prior experience managing large-scale, multi-tenant or mixed-workload infrastructure. • Exposure to real-time media infrastructure — WebRTC, SFUs, TURN/STUN servers, or media server orchestration. • Hands-on experience with secrets management tools such as HashiCorp Vault or Sealed Secrets. • Familiarity with security scanning and policy tools (e.g., Trivy, OPA/Gatekeeper, Falco). • Experience with GCP and GKE specifically. • Exposure to compliance frameworks relevant to healthcare or regulated industries (HIPAA awareness is a plus). • Experience with AI/ML inference workloads or async pipeline infrastructure (queues, workers, orchestrators). • Experience with open source contributions. • Strong inclination to stay current with evolving infrastructure, security, and platform engineering practices — and a willingness to share ideas internally or externally. • Ability to communicate fluently and clearly in English, written and spoken. Why 100ms • You'll work on genuinely varied infrastructure — real-time video at scale and AI-driven healthcare automation are both hard problems with different constraints, and you'll own both. • You'll be part of a small, high-ownership team at a fast-growing, engineering-first startup with a meaningful mission — powering real-time experiences and helping patients access treatment faster. • You'll work alongside engineers with deep experience in distributed systems, real-time media, AI infrastructure, and platform engineering at scale. • You'll have the freedom to grow as an individual contributor or step into a team leadership role — with room to define your own goals and impact. • Security and infrastructure are first-class concerns here, not support functions — your work directly shapes the trust and reliability our customers depend on. Additional Information • We place a strong emphasis on in-office collaboration to maintain a tight feedback loop and a strong engineering culture. • Employees are expected to work from the office at least three days a week. Website • [Upgrade to PRO to see link] • [Upgrade to PRO to see link]

Platform Engineer — Core Infrastructure at 100ms

Similar Engineering Jobs

IT Technical Architect II (Security Operations)

Product Manager CX

Senior Embedded Engineer - Connectivity

Share this job

About 100ms

Sr. Machine Learning Engineer

Sr. Software Engineer, Backend- Rotten Tomatoes

Manager, Software Engineering (Back End & Integrations) - Fandango

Translation Jobs

Popular Skills

Jobs by Salary

For Job Seekers

For Employers