Tanmay Sahay · तन्मय

Reliability for AI Systems
Software Engineer, SRE — Jules · Google, Mountain View
oagiaamch9satl@.m4aynym +1-6775650-5-10
Abstract. Customer-obsessed engineer with 8+ years of experience making service operability sustainable. Currently focused on reliability, observability, and operability of Jules — Google's autonomous AI coder. Expert in automating complex infrastructure, reducing operational toil by >50%, and pioneering AI-powered troubleshooting. Proven track record of saving 30+ SWE-years through efficiency optimizations.

1Positions

Software Engineer, SRE (Jules — Autonomous AI Coder)
Google · Feb '26 - Present · US-MTV
tl;drBuilding and maintaining critical reliability infrastructure & automation for Jules — Google's autonomous AI coding agent (jules.google.com) — ensuring a global userbase can rely on it to be reliable, scalable, and highly performant.
Building observability pipelines to surface agent health, task success rates, and failure modes — enabling data-driven reliability decisions.
Designing operability frameworks that make Jules continually easier to operate, maintain, and on-call for.
Automating turnup processes for new regions and capacity, reducing manual toil and accelerating global expansion.
Driving change management and capacity management practices to ensure safe, predictable rollouts at scale.
Software Engineer, SRE (Gemini & Vertex AI)
Google · Apr '25 - Jan '26 · US-PIT
tl;drLeading reliability for Google Cloud's LLM offerings (Gemini, Veo, Imagen).
Automated months-long turnup process for Vertex AI in new regions.
Saved Google ~30 SWE-years and relinquished ~7.6k TPUs by enforcing Vertex Endpoint Health.
Pioneered 'Gemini Powered Vertex Operations', using AI to reduce Mean Time To Mitigate (MTTM).
Designed capacity presubmits preventing Vertex capacity overconsumption.
Software Engineer, SRE (Network Infrastructure)
Google · Feb '24 - Apr '25 · US-PIT
tl;drEnsuring reliability for Google's global backbone network telemetry and monitoring systems.
Navigated chaotic environment with constantly changing requirements while maintaining system stability.
Successfully collaborated with NetInfra Telemetry teams to improve network observability.
Built tooling to enhance network monitoring and alerting capabilities.
Contributed to infrastructure supporting Google's global network operations.
Software Engineer, SRE (Cloud Infrastructure)
Google · Feb '23 - Feb '24 · CH-ZRH
tl;drContinued driving reliability improvements and tooling development from Zurich.
Continued development and expansion of Khoj (InvDash), scaling adoption across Google SRE teams.
Mentored junior engineers and drove knowledge transfer across regions.
Contributed to cross-functional infrastructure reliability initiatives.
Prepared for and executed smooth transition to US-based Network Infrastructure team.
Software Engineer, SRE (Serverless Platform)
Google · Mar '19 - Feb '23 · UK-LON / CH-ZRH
tl;drEnhanced reliability for Cloud Run, Cloud Functions, and App Engine.
Reduced team's oncall load by over 50% through actionable metrics and democratization of data.
Enabled safe, slow rollouts for 500+ Spanner databases, preventing global outages.
Created 'Khoj' (InvDash) in 2021 — an automated incident root-causing system now used Google-wide and still actively maintained.
Led Log4j Code Red response for Serverless products.
Creator & Lead Developer — Khoj (InvDash)
Google (Internal Project) · 2021 - Present · Global
tl;drBuilt and continuously evolved an automated incident investigation and root-causing system.
Conceived and built Khoj in 2021 to automate tedious incident root-cause analysis.
System correlates logs, metrics, and change events to surface probable causes during outages.
Adopted Google-wide across multiple SRE teams, significantly reducing Mean Time To Diagnose (MTTD).
Continuously maintained and enhanced over 4+ years, adapting to new infrastructure patterns.
Software Developer
Booking.com · Jun '17 - Feb '19 · NL-AMS
tl;drMachine Learning Services & Image Infrastructure.
Reduced image serving latency by 50% and storage costs by 80% via on-the-fly resizing service.
Built ML platform features used by 200+ Data Scientists.
Migrated image building pipelines to Google Cloud Dataproc.

2Methods & instruments

programmingPython, Go, Java, C++, SQL, Shell.
sre & cloudKubernetes, Terraform, Bazel, GCP, Incident Response, Observability.
ai & mlVertex AI, LLM Ops, Gemini CLI, AI Agents, Model Serving, TPU Fleet Mgmt.
toolsProdspec, Spanner, Bigtable, Prometheus/Monarch.

3Selected peer review

43 peer bonuses, 2019–2025. Six representative reviews:

"Thank you Tanmay for your contribution to key V1P Initiatives: Groot Turnup Automation Scripting, Observability, Troubleshooting and Incident response Improvements with eCatcher, Fireaxe playbooks, Instructions and…"TJ Angelo, Vertex AI, 2025 · Automation
"Thanks Tanmay for introducing me and keeping me up to date with all the innovative things happening in the world of AI. Your presentation on gemini-cli and how to prompt was awesome. Your push towards using AI to…"Himanshu Raj, Vertex AI, 2025 · Innovation
"Thank you for taking time and providing detailed feedback on the Production Agent insights in IRM. Your valuable insights will help us improve the Outage Investigator experience for all Googlers."Abhishek Gupta, Cross-team, 2025 · Collaboration
"Thank you for going above and beyond during my Pittsburgh trip. You had dinner with me every night I was there, organized a small team dinner, and collaborated with me in person. It really added to the welcoming…"Mark Langer, Cross-team, 2025 · Leadership
"Tanmay has really ramped up this quarter quickly on Autoscaler and metrics work. He's been quick to engage and iterate with dev partners on how to format metrics and dashboards, what debugging looks like and how we can…"Kevin Shumaker, Vertex AI, 2025 · Leadership
"Thanks for helping us prepare for ProdEx. We got an overall score of 4/5, which is great! We couldn't get there without your contributions!"Andrei-Marius Dincu, Vertex AI, 2025 · Technical Excellence

4Education & honors

B.Tech in Computer Science
IIIT Hyderabad · 2013 - 2017 · Hyderabad, India
ACM ICPC Regional Finalist (2014, 2015) — among top competitive programmers in Asia.
JEE Mains Rank 719 (Top 0.05%) out of 1.4M candidates nationwide.
National Talent Scholar — recognized for academic excellence.
CodeChef Campus Chapter Ambassador — organized coding competitions.
Teaching Assistant for Algorithms & Data Structures courses.

5Languages

germanic English (native), Dutch (conversational), German (basic).
romance French (conversational), Spanish (basic).
indo-aryan Hindi (native), Urdu (conversational), Kannada (fluent), Sanskrit (exposure).
scriptsCyrillic.

6Citation

@misc{sahay2026, author = {Sahay, Tanmay}, title = {Reliability for AI Systems}, year = {2026}, note = {ICPC regionalist; polyglot (9 languages); reachable at the address above} }