Tanmay Sahay
Software Engineer — Reliability for AI Systems
Making autonomous AI coding reliable, observable, and operable at scale.
Experience Map
Building and maintaining critical reliability infrastructure & automation for Jules — Google's autonomous AI coding agent (jules.google.com) — ensuring a global userbase can rely on it to be reliable, scalable, and highly performant.
- Building observability pipelines to surface agent health, task success rates, and failure modes — enabling data-driven reliability decisions.
- Designing operability frameworks that make Jules continually easier to operate, maintain, and on-call for.
- Automating turnup processes for new regions and capacity, reducing manual toil and accelerating global expansion.
- Driving change management and capacity management practices to ensure safe, predictable rollouts at scale.
Leading reliability for Google Cloud's LLM offerings (Gemini, Veo, Imagen).
- Automated months-long turnup process for Vertex AI in new regions.
- Saved Google ~30 SWE-years and relinquished ~7.6k TPUs by enforcing Vertex Endpoint Health.
- Pioneered 'Gemini Powered Vertex Operations', using AI to reduce Mean Time To Mitigate (MTTM).
- Designed capacity presubmits preventing Vertex capacity overconsumption.
Ensuring reliability for Google's global backbone network telemetry and monitoring systems.
- Navigated chaotic environment with constantly changing requirements while maintaining system stability.
- Successfully collaborated with NetInfra Telemetry teams to improve network observability.
- Built tooling to enhance network monitoring and alerting capabilities.
- Contributed to infrastructure supporting Google's global network operations.
Continued driving reliability improvements and tooling development from Zurich.
- Continued development and expansion of Khoj (InvDash), scaling adoption across Google SRE teams.
- Mentored junior engineers and drove knowledge transfer across regions.
- Contributed to cross-functional infrastructure reliability initiatives.
- Prepared for and executed smooth transition to US-based Network Infrastructure team.
Enhanced reliability for Cloud Run, Cloud Functions, and App Engine.
- Reduced team's oncall load by over 50% through actionable metrics and democratization of data.
- Enabled safe, slow rollouts for 500+ Spanner databases, preventing global outages.
- Created 'Khoj' (InvDash) in 2021 — an automated incident root-causing system now used Google-wide and still actively maintained.
- Led Log4j Code Red response for Serverless products.
Google (Internal Project)
Built and continuously evolved an automated incident investigation and root-causing system.
- Conceived and built Khoj in 2021 to automate tedious incident root-cause analysis.
- System correlates logs, metrics, and change events to surface probable causes during outages.
- Adopted Google-wide across multiple SRE teams, significantly reducing Mean Time To Diagnose (MTTD).
- Continuously maintained and enhanced over 4+ years, adapting to new infrastructure patterns.
Booking.com
Machine Learning Services & Image Infrastructure.
- Reduced image serving latency by 50% and storage costs by 80% via on-the-fly resizing service.
- Built ML platform features used by 200+ Data Scientists.
- Migrated image building pipelines to Google Cloud Dataproc.
IIIT Hyderabad
Premier research university, ranked among top CS programs in India.
- ACM ICPC Regional Finalist (2014, 2015) — among top competitive programmers in Asia.
- JEE Mains Rank 719 (Top 0.05%) out of 1.4M candidates nationwide.
- National Talent Scholar — recognized for academic excellence.
- CodeChef Campus Chapter Ambassador — organized coding competitions.
- Teaching Assistant for Algorithms & Data Structures courses.
Skills & Technologies
Peer Recognition
43 peer bonuses • Showing 10
"Thank you Tanmay for your contribution to key V1P Initiatives: Groot Turnup Automation Scripting, Observability, Troubleshooting and Incident response Improvements with eCatcher, Fireaxe playbooks, Instructions and prompting guidance on simplifying ops with Gemini CLI, etc."
"Thanks Tanmay for introducing me and keeping me up to date with all the innovative things happening in the world of AI. Your presentation on gemini-cli and how to prompt was awesome. Your push towards using AI to automate our operations and investigations will be really impactful for the team."
"Tanmay has really ramped up this quarter quickly on Autoscaler and metrics work. He's been quick to engage and iterate with dev partners on how to format metrics and dashboards, what debugging looks like and how we can better enable it, and has been excellent at using Taskflow and bug updates to keep the wider audience informed. It's been awesome to see such confident structure introduced to the working group."
"Tanmay, thank you for your work on capacity presubmits. You laid the foundation for capacity presubmits and later improved them to account for models in migration. As a result, we now have an effective mechanism to preventing Vertex capacity overconsumption before it even happens, and saving hours of debugging for many people working with TPU fleet. Thank you very much, this is a game changer."
"Thanks for being around in irm/i_G9B23gUQ13v66zRreanP and helping me and Lucky with the incident. This was a multiple hours situation where the utilization of gemini 1.5 hit 100% and we couldn't find the chips. Thanks to your help we were able to harvest chips from multiple endpoints until autoscaler finally kicked in!"
"Thank you Tanmay for your great work migrating B2 Device Linecards dashboard for TPC under extremely challenging circumstances and a very tight timeline. You did an amazing job navigating a truly chaotic environment where everything changes every day, things don't work as expected and one has to follow up with multiple teams at the same time to get unblocked. You successfully collaborated with NetInfra Telemetry team and merged the overlapping work which is something I wasn't able to do on my own. Thank you and keep up the good work!"
"Thank you Tanmay for always reviewing my Khoj CLs quickly! I often write CLs and sometimes nobody from my team is able to review them. Tanmay is always very responsive and reviews the CLs and offers great feedback. His help is crucial for making progress quickly on my project."
"Thank you for your great work on the AMC->MAC migration. Your efforts and attention to detail to this critical part of the project has made a significant contribution to the success of the project. Your work made possible to have a smooth transition. Also, thank you for always looking on different ways to contribute and help others on their tasks."
"Tanmay was my mentor from Serverless Platform team, London site. He introduced and guided me through different internal tools, and walked through with me the problems I had. He was able to commit to frequent 1:1s, and keep me up to date. I appreciate his time and help a lot. Thank you so much for the mentoring during the first three months!"
"Thank you for going above and beyond the call of duty in your response to the log4j security vulnerabilities in December 2021. Your commitment to securing Google and our customers is truly appreciated!"