Your Opportunity
At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together. Schwab Technology Services enables the future of how clients manage their money by providing innovative and reliable technology products and services as a part of our ongoing commitment to democratize access to investing and financial planning.
Join Schwab Data's Data Services team to help shape the next generation of Cloud and AI capabilities that power enterprise-scale analytics and innovation. In this role you will establish operational readiness for Google Cloud across multi-tenant Cloud platforms, implementing observability, monitoring, and incident response for data and AI workloads. You will define production readiness criteria, automate recovery processes, and enforce SLA compliance as these services scale across Schwab.
What you have
Required Qualifications :
- Bachelor's degree in Computer Science, Engineering or related field OR related practical experience
- 5+ years of experience in a software engineering role (Infrastructure or Application)
- 2+ years of hands-on experience implementing and operationalizing observability frameworks for GCP-based platforms (e.g., Vertex AI, BigQuery, Dataflow, Composer, Dataplex, Looker), ensuring SLA compliance and production readiness.
- Proven expertise in designing and automating incident detection, response, and recovery workflows to minimize downtime and operational risk.
- Strong knowledge of cloud infrastructure management, capacity planning, and high-availability strategies for multi-tenant deployments.
- Experience establishing and enforcing operational governance controls, including audit readiness, resiliency standards, and compliance guardrails.
- Demonstrated ability to partner with platform engineers and architects to integrate operational controls into module deployment pipelines.
Preferred Qualifications :
Experience designing disaster recovery and failover automation strategies for regulated, enterprise-scale environments.Familiarity with AI / ML operational monitoring practices, including model performance tracking and drift detection.Ability to implement proactive remediation frameworks to address service disruptions and maintain high availability.Strong interpersonal skills to influence adoption of operational best practices across engineering and SRE teams.J-18808-Ljbffr