Gen AI Site Reliability Engineer (SRE) -Senior Associate-AI Managed Services - operate
Posted 2026-05-05Industry/Sector
Not ApplicableSpecialism
Managed ServicesManagement Level
Senior AssociateJob Description & Summary
At PwC, our people in managed services focus on a variety of outsourced solutions and support clients across numerous functions. These individuals help organisations streamline their operations, reduce costs, and improve efficiency by managing key processes and functions on their behalf. They are skilled in project management, technology, and process optimization to deliver high-quality services to clients.Those in managed service management and strategy at PwC will focus on transitioning and running services, along with managing delivery teams, programmes, commercials, performance and delivery risk. Your work will involve the process of continuous improvement and optimising of the managed services process, tools and services.
Focused on relationships, you are building meaningful client connections, and learning how to manage and inspire others. Navigating increasingly complex situations, you are growing your personal brand, deepening technical expertise and awareness of your strengths. You are expected to anticipate the needs of your teams and clients, and to deliver quality. Embracing increased ambiguity, you are comfortable when the path forward isn’t clear, you ask questions, and you use these moments as opportunities to grow.
Examples of the skills, knowledge, and experiences you need to lead and deliver value at this level include but are not limited to:
- Respond effectively to the diverse perspectives, needs, and feelings of others.
- Use a broad range of tools, methodologies and techniques to generate new ideas and solve problems.
- Use critical thinking to break down complex concepts.
- Understand the broader objectives of your project or role and how your work fits into the overall strategy.
- Develop a deeper understanding of the business context and how it is changing.
- Use reflection to develop self awareness, enhance strengths and address development areas.
- Interpret data to inform insights and recommendations.
- Uphold and reinforce professional and technical standards (e.g. refer to specific PwC tax and audit guidance), the Firm's code of conduct, and independence requirements.
GenAI Site Reliability Engineer
Observability | Incident Response | Reliability Engineering | AWS and GenAI Operations
Purpose: Operate, monitor, and continuously improve the reliability of in-scope AI platforms and services.
Role
GenAI Site Reliability Engineer
Level
AC - Staff - Experienced
Tower
AI Operations & Platform Support (AI Managed Services)
Experience
4+ years in SRE, production support, cloud operations, or a similar run-state engineering role
Work Location
Bangalore / Hyderabad, India (Remote)
Key Platforms
AWS / Amazon Bedrock, OpenAI / ChatGPT Enterprise, observability and ITSM tooling
Role profile
Hands-on reliability engineer focused on monitoring, incident response, service health, and operational stability for AI workloads.
Primary focus
Observability, alerting, incident investigation, RCA support, automation, and post-change validation.
Best fit
An engineer who likes messy production problems, can separate signal from noise, and is comfortable owning issues through restoration and follow-up.
Role Summary
As a GenAI Site Reliability Engineer, you will operate and improve monitoring for in-scope AI services, investigate incidents, restore service, and implement reliability improvements. The role is oriented around real run-state support rather than net-new build work, so we need people who can work from alerts, logs, traces, tickets, dashboards, and imperfect documentation to drive structured troubleshooting and better outcomes over time.
Key Responsibilities
1. Monitoring, alerting, and service health
- Build and maintain dashboards and alerts for availability, latency, error rates, usage, and other service-health indicators.
- Tune thresholds and alert routing to reduce noise and improve actionable detection, MTTA, and MTTR.
- Monitor platform health across AWS and GenAI services and escalate emerging issues before they become user-impacting incidents.
2. Incident triage, restoration, and problem management
- Investigate incidents using logs, metrics, traces, ticket history, and runbooks; execute restoration steps and coordinate escalation when deeper resolver groups are needed.
- Contribute to RCA and post-incident corrective actions for major incidents and recurring issues.
- Support severity assessment with the incident commander and provide clear technical updates during active events.
3. Reliability improvement and automation
- Identify recurring failure modes and implement improvements such as alert tuning, diagnostics automation, repeatable checks, or resilience enhancements.
- Automate routine support activities to reduce manual effort and improve consistency of triage and recovery.
- Support performance and cost troubleshooting by isolating contributing factors and validating the impact of fixes.
4. Operational readiness and knowledge management
- Maintain runbooks, known-error patterns, troubleshooting guides, and standard operating procedures.
- Support change readiness and post-change validation so monitoring, documentation, and restoration steps stay current as the platform evolves.
- Provide inputs to service reporting on trends, recurring issues, and improvement opportunities.
Preferred Skills and Experience
Skill area
Preferred background
SRE and production operations
Hands-on experience supporting production services in a cloud environment, including monitoring, troubleshooting, incident response, and restoration.
Observability
Experience building dashboards and alerts and using logs, metrics, and traces to diagnose issues. CloudWatch, Datadog, Splunk, New Relic, Grafana, or OpenTelemetry experience is relevant.
Cloud and GenAI platform operations
Working knowledge of AWS operations and familiarity with Bedrock, OpenAI, or adjacent AI platform services used in enterprise production environments.
Incident and problem management
Experience working within ITIL-aligned processes for incident, problem, request, and change management, including strong ticket hygiene and runbook discipline.
Automation and scripting
Ability to automate diagnostics or repetitive support activities using Python, shell scripting, or similar tools.
Critical thinking and collaboration
Ability to solve ambiguous production issues, work across teams, ask the right questions, and engage stakeholders to move investigations and actions forward.
Nice to Have
• Experience supporting Bedrock or OpenAI-powered workloads in production.
• Experience with service reliability metrics such as SLIs, SLOs, MTTA, MTTR, and error trends.
• Exposure to cost and usage monitoring, quota or throttling investigation, and post-change validation.
• AWS certifications or other cloud reliability certifications.
Working Style & Core Behaviors
- Thinks in a structured, evidence-based way and does not jump to conclusions.
- Can stay effective when documentation is incomplete and the right path is not obvious up front.
- Communicates clearly during live incidents and keeps others aligned on status, risks, and next steps.
- Works well with engineers, platform owners, service desk teams, and vendors without creating friction.
What Good Looks Like
- Can trace a noisy alert to a meaningful root cause and either restore service or escalate with the right evidence.
- Improves monitoring quality over time instead of simply reacting to tickets.
- Turns recurring pain points into automation, better documentation, or better alert logic.
- Builds confidence with stakeholders because updates are clear, grounded, and action-oriented.
Team Context
You will join PwC’s AI Operations & Platform Support team supporting a clients’ run-state AI environment. The operating model is centered on Level 2 and Level 3 support, monitoring, incident response, service requests, minor enhancements, and continuous improvement across AWS/Bedrock, OpenAI, and related platform components.
This role will work in a managed-services model focused on incident management, service requests, monitoring, minor enhancements, knowledge management, and continuous improvement. Success depends not only on technical skill, but also on ownership, collaboration, and the ability to engage stakeholders to progress work.
Travel Requirements
0%Job Posting End Date