Overview
Are you passionate about improving customer experience for millions of customers of Walmart and its subsidiaries? As a Principal Site Reliability Engineer in Customer Engagement Services (CES) Tech Org, you'll lead efforts to ensure our customer service platforms are resilient, scalable, and lightning-fast. You'll architect reliability frameworks, drive automation across incident response and observability, and collaborate with engineering and product teams to embed SRE principles into every layer of the stack. This role offers the excitement of solving real-world challenges at a massive scale—where every improvement directly enhances customer satisfaction and operational excellence. If you're energized by building systems that empower associates and delight customers, this is your opportunity to lead with purpose.
About Team
Customer Care Technology (CES) builds best-in-class customer service experiences for hundreds of millions of Walmart customers and customer service agents globally. We are a group of software engineers, data scientists, and machine learning experts pushing the boundaries of GenAI technology in complex enterprise applications. The CES Technology team is part of the Enterprise Business Systems organization in Walmart Global Tech. We partner with our product, business and UX teams to drive significant measurable business impact. Our mission is to help customers save money and live better.
What you'll do
- Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
- Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
- Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
- Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and "you build it, you run it."
- Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
- Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
- Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
- Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.
What you'll bring
8+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.Deep expertise in distributed systems, service ownership models, CI / CD pipelines, and observability practices.Exceptional communication and collaboration skills, with a track record of influencing cross-functional teams and driving consensus on reliability strategies.Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.Preferred Qualifications
Experience in site reliability engineering, site and system administration, infrastructure management, or related area.Master's degree in site reliability engineering, site and system administration, infrastructure management, or related area and 2 years' experience in site reliability engineering, site and system administration, infrastructure management, or related area.SRE certification (for example, IBM Cloud Site Reliability Engineer).Knowledge of accessibility best practices and WCAG 2.2 AA standards, assistive technologies, and integrating digital accessibility into products and services.Benefits and Location
At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more. You will also receive PTO and / or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.
For information about PTO, see One.Walmart.
Primary Location
2501 Se J St, Ste A, Bentonville, AR 72716-3724, United States of America
J-18808-Ljbffr