Tanmay Sahay

Tanmay Sahay

Software Engineer — Reliability for AI Systems

Making autonomous AI coding reliable, observable, and operable at scale.

Let's talk: always up for conversations about reliability, AI infrastructure, and systems that don't page you at 3 a.m.
aaloamnaac.4tymgs@ym9hi 0-175-0-665175+ iynnnmsctyehkamiadaa//oinl.
Polyglot:English · Dutch · German · French · Spanish · Hindi · Urdu · Kannada · Sanskrithover to explore
├─Germanic
🇬🇧English(English)🇳🇱Dutch(Nederlands)🇩🇪German(Deutsch)
├─Romance
🇫🇷French(Français)🇪🇸Spanish(Español)
└─Indo-Aryan
🇮🇳Hindi(हिन्दी)🇮🇳Urdu(اردو)🇮🇳Kannada(ಕನ್ನಡ)🕉️Sanskrit(संस्कृतम्)
Scripts🇷🇺Cyrillic(Кириллица)
43
Peer Bonuses
Recognized for impact & collaboration
30+SWE-Years
Impact
Saved via Vertex Endpoint Health
50%Reduction
Latency
Image generation at Booking.com
500+Databases
Adoption
Safe Spanner rollouts enabled

Experience Map

Google

Software Engineer, SRE (Jules — Autonomous AI Coder)
Feb '26 - Present
US-MTV

Building and maintaining critical reliability infrastructure & automation for Jules — Google's autonomous AI coding agent (jules.google.com) — ensuring a global userbase can rely on it to be reliable, scalable, and highly performant.

  • Building observability pipelines to surface agent health, task success rates, and failure modes — enabling data-driven reliability decisions.
  • Designing operability frameworks that make Jules continually easier to operate, maintain, and on-call for.
  • Automating turnup processes for new regions and capacity, reducing manual toil and accelerating global expansion.
  • Driving change management and capacity management practices to ensure safe, predictable rollouts at scale.
ReliabilityObservabilityOperabilityTurnup AutomationChange ManagementCapacity ManagementAI AgentsPythonGo

Google

Software Engineer, SRE (Gemini & Vertex AI)
Apr '25 - Jan '26
US-PIT

Leading reliability for Google Cloud's LLM offerings (Gemini, Veo, Imagen).

  • Automated months-long turnup process for Vertex AI in new regions.
  • Saved Google ~30 SWE-years and relinquished ~7.6k TPUs by enforcing Vertex Endpoint Health.
  • Pioneered 'Gemini Powered Vertex Operations', using AI to reduce Mean Time To Mitigate (MTTM).
  • Designed capacity presubmits preventing Vertex capacity overconsumption.
AI/MLAutomationCapacity PlanningPythonGo

Google

Software Engineer, SRE (Network Infrastructure)
Feb '24 - Apr '25
US-PIT

Ensuring reliability for Google's global backbone network telemetry and monitoring systems.

  • Navigated chaotic environment with constantly changing requirements while maintaining system stability.
  • Successfully collaborated with NetInfra Telemetry teams to improve network observability.
  • Built tooling to enhance network monitoring and alerting capabilities.
  • Contributed to infrastructure supporting Google's global network operations.
NetworkingTelemetryObservabilityCollaborationPython

Google

Software Engineer, SRE (Cloud Infrastructure)
Feb '23 - Feb '24
CH-ZRH

Continued driving reliability improvements and tooling development from Zurich.

  • Continued development and expansion of Khoj (InvDash), scaling adoption across Google SRE teams.
  • Mentored junior engineers and drove knowledge transfer across regions.
  • Contributed to cross-functional infrastructure reliability initiatives.
  • Prepared for and executed smooth transition to US-based Network Infrastructure team.
MentoringToolingCross-team CollaborationInfrastructure

Google

Software Engineer, SRE (Serverless Platform)
Mar '19 - Feb '23
UK-LON / CH-ZRH

Enhanced reliability for Cloud Run, Cloud Functions, and App Engine.

  • Reduced team's oncall load by over 50% through actionable metrics and democratization of data.
  • Enabled safe, slow rollouts for 500+ Spanner databases, preventing global outages.
  • Created 'Khoj' (InvDash) in 2021 — an automated incident root-causing system now used Google-wide and still actively maintained.
  • Led Log4j Code Red response for Serverless products.
ServerlessSpannerIncident ManagementMentoring

Google (Internal Project)

Creator & Lead Developer — Khoj (InvDash)
2021 - Present
Global

Built and continuously evolved an automated incident investigation and root-causing system.

  • Conceived and built Khoj in 2021 to automate tedious incident root-cause analysis.
  • System correlates logs, metrics, and change events to surface probable causes during outages.
  • Adopted Google-wide across multiple SRE teams, significantly reducing Mean Time To Diagnose (MTTD).
  • Continuously maintained and enhanced over 4+ years, adapting to new infrastructure patterns.
Incident ResponseAutomationData CorrelationPythonObservability

Booking.com

Software Developer
Jun '17 - Feb '19
NL-AMS

Machine Learning Services & Image Infrastructure.

  • Reduced image serving latency by 50% and storage costs by 80% via on-the-fly resizing service.
  • Built ML platform features used by 200+ Data Scientists.
  • Migrated image building pipelines to Google Cloud Dataproc.
JavaMachine LearningOptimizationDistributed Systems

IIIT Hyderabad

B.Tech in Computer Science
2013 - 2017
Hyderabad, India

Premier research university, ranked among top CS programs in India.

  • ACM ICPC Regional Finalist (2014, 2015) — among top competitive programmers in Asia.
  • JEE Mains Rank 719 (Top 0.05%) out of 1.4M candidates nationwide.
  • National Talent Scholar — recognized for academic excellence.
  • CodeChef Campus Chapter Ambassador — organized coding competitions.
  • Teaching Assistant for Algorithms & Data Structures courses.
AlgorithmsData StructuresCompetitive ProgrammingProblem Solving

Skills & Technologies

Programming
PythonGoJavaC++SQLShell
SRE & Cloud
KubernetesTerraformBazelGCPIncident ResponseObservability
AI & ML
Vertex AILLM OpsGemini CLIAI AgentsModel ServingTPU Fleet Mgmt
Tools
ProdspecSpannerBigtablePrometheus/Monarch

Peer Recognition

43 peer bonuses • Showing 10

Validated Google Peer Bonus Data
By Team
By Theme
By Year
TJ Angelo
Dec 04, 2025
Vertex AIAutomation

"Thank you Tanmay for your contribution to key V1P Initiatives: Groot Turnup Automation Scripting, Observability, Troubleshooting and Incident response Improvements with eCatcher, Fireaxe playbooks, Instructions and prompting guidance on simplifying ops with Gemini CLI, etc."

#Automation#AI/ML#Leadership
Himanshu Raj
Sep 30, 2025
Vertex AIInnovation

"Thanks Tanmay for introducing me and keeping me up to date with all the innovative things happening in the world of AI. Your presentation on gemini-cli and how to prompt was awesome. Your push towards using AI to automate our operations and investigations will be really impactful for the team."

#AI/ML#Innovation#Knowledge Sharing
Kevin Shumaker
Jun 27, 2025
Vertex AILeadership

"Tanmay has really ramped up this quarter quickly on Autoscaler and metrics work. He's been quick to engage and iterate with dev partners on how to format metrics and dashboards, what debugging looks like and how we can better enable it, and has been excellent at using Taskflow and bug updates to keep the wider audience informed. It's been awesome to see such confident structure introduced to the working group."

#Autoscaler#Communication#Structure
Maria Samokhina
Jun 03, 2025
Vertex AITechnical Excellence

"Tanmay, thank you for your work on capacity presubmits. You laid the foundation for capacity presubmits and later improved them to account for models in migration. As a result, we now have an effective mechanism to preventing Vertex capacity overconsumption before it even happens, and saving hours of debugging for many people working with TPU fleet. Thank you very much, this is a game changer."

#Capacity#TPU#Prevention
Antonino Radici
May 07, 2025
Vertex AIIncident Response

"Thanks for being around in irm/i_G9B23gUQ13v66zRreanP and helping me and Lucky with the incident. This was a multiple hours situation where the utilization of gemini 1.5 hit 100% and we couldn't find the chips. Thanks to your help we were able to harvest chips from multiple endpoints until autoscaler finally kicked in!"

#Incident#Gemini#Collaboration
Masha Pospelova
Oct 03, 2024
Network InfraCollaboration

"Thank you Tanmay for your great work migrating B2 Device Linecards dashboard for TPC under extremely challenging circumstances and a very tight timeline. You did an amazing job navigating a truly chaotic environment where everything changes every day, things don't work as expected and one has to follow up with multiple teams at the same time to get unblocked. You successfully collaborated with NetInfra Telemetry team and merged the overlapping work which is something I wasn't able to do on my own. Thank you and keep up the good work!"

#Dashboard#Migration#Resilience
Mihai Guran
Sep 14, 2024
Cross-teamCollaboration

"Thank you Tanmay for always reviewing my Khoj CLs quickly! I often write CLs and sometimes nobody from my team is able to review them. Tanmay is always very responsive and reviews the CLs and offers great feedback. His help is crucial for making progress quickly on my project."

#Khoj#Code Review#Responsiveness
Enrique García Torres
Sep 27, 2023
Cloud InfraTechnical Excellence

"Thank you for your great work on the AMC->MAC migration. Your efforts and attention to detail to this critical part of the project has made a significant contribution to the success of the project. Your work made possible to have a smooth transition. Also, thank you for always looking on different ways to contribute and help others on their tasks."

#Migration#Attention to Detail#Helpfulness
Dora Diao
Dec 08, 2022
ServerlessMentoring

"Tanmay was my mentor from Serverless Platform team, London site. He introduced and guided me through different internal tools, and walked through with me the problems I had. He was able to commit to frequent 1:1s, and keep me up to date. I appreciate his time and help a lot. Thank you so much for the mentoring during the first three months!"

#Mentoring#Onboarding#Guidance
Philip Beevers
Jan 10, 2022
ServerlessIncident Response

"Thank you for going above and beyond the call of duty in your response to the log4j security vulnerabilities in December 2021. Your commitment to securing Google and our customers is truly appreciated!"

#Log4j#Security#Dedication