Search jobs > Washington, DC > Remote > Site reliability engineer

Site Reliability Engineer - National Remote

Optum
Washington, District of Columbia, US
$70.2K-$137.8K a year
Remote
Full-time

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best.

Here, you will find a culture guided by diversity and inclusion, talented peers, comprehensive benefits and career development opportunities.

Come make an impact on the communities we serve as you help us advance health equity on a global scale. Join us to start Caring.

Connecting. Growing together.

Check out the role overview below If you are confident you have got the right skills and experience, apply today.

As a Site Reliability Engineer (SRE) you will employ software engineering to automate critical IT operations tasks, including production system management, change management, and incident response.

You will be responsible for design review and control; prediction, estimation, and apportionment methodology; failure mode effects and analysis;

the planning, operation and analysis of reliability testing and field failures, and the ability to develop and administer reliability information systems for failure analysis, design and performance improvement and reliability program management over the entire product life cycle.

You will help ensure swift incident response and scalable emergency handling, fostering greater reliability and resilience in managing complex systems.

You will support our efforts in optimizing system performance and implementing, ensuring the reliability of our technology ecosystem.

You'll enjoy the flexibility to telecommute* from anywhere within the U.S. as you take on some tough challenges.

Primary Responsibilities :

System Reliability and Incident Management : Ensure the reliability, availability, and performance of services. Respond to, troubleshoot, and resolve service outages or degradation.

Lead post-incident reviews and drive root cause analysis and mitigation.

Monitoring and Performance Tuning : Develop and maintain advanced monitoring and alerting systems to detect and mitigate issues proactively.

Continuously measure and optimize system performance, identifying bottlenecks and points of failure.

  • Continuous Improvement : Advocate for and implement changes to improve system reliability and scalability. Innovate new ways to manage and automate operations tasks.
  • Collaboration and Advocacy : Work closely with development teams to incorporate best practices and influence architecture, code health, and operational processes.

Promote a culture of shared responsibility for production stability and performance. Integrate SRE principles into the engineering workflow.

  • Capacity Planning and Scalability : Forecast and plan for the infrastructure needs. Implement scalable systems and resource allocation strategies to handle growth and peaks in demand.
  • Documentation and Knowledge Sharing : Create and maintain detailed documentation of the systems, processes, and procedures.

Facilitate knowledge sharing through regular technical presentations and training sessions.

  • Configure, implement, and manage / optimize end-to-end APM solutions, with a focus on Dynatrace, AppDynamics, Splunk, or other relevant tools.
  • Work closely with IT teams to seamlessly integrate APM solutions into the existing infrastructure and applications.
  • Develop and maintain customized dashboards, reports, and alerts to offer real-time insights into the health and performance of the system.
  • Collaborate with diverse teams to understand business requirements and configure APM solutions to meet performance monitoring needs.
  • Conduct system analysis, troubleshooting, and optimization across various applications and infrastructure components.
  • Provide support to internal stakeholders and support teams regarding tweaking configurations, troubleshooting, and tool-specific nuances.
  • Continuous performance management, measuring performance and working with stakeholders to improve the same.
  • Build quality frameworks to provide feedback loop to stakeholders to ease and improve APM product management, patching systems and implementing security controls.
  • Document automation procedures to improve the velocity and quality of the effort.
  • Continuous performance management, Software release management, configuration management and transition to stakeholders.
  • Request feedback from teams, perform tool implementation assessments, offering recommendations for improvements to enhance system reliability and responsiveness.

You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

Required Qualifications :

  • Must possess an industry recognized Reliability Engineer Certification CRE.
  • 4+ years hands on experience with scripting languages (e.g., Python, PowerShell) for automation and customization across various APM tools.
  • 4+ years' experience monitoring software performance in terms of service-level agreements (SLAs), service-level indicators (SLIs), and service-level objectives (SLOs).
  • 4+ years' experience with APM features such as real user monitoring, synthetic monitoring, and effective root cause analysis.
  • 4+ years experience with one or more of the following platforms : Salesforce, Pega, Appian, Microsoft Power Platform.

Preferred Qualifications :

  • ITIL Foundation Certification.
  • Bachelor's Degree in computer science or equivalent technical degree.
  • Understanding of application architecture, infrastructure, and cloud environments.
  • Proficiency in configuring and customizing multiple APM tools like Dynatrace, Splunk, AppDynamics for optimal performance monitoring.
  • Additional certifications (e.g. Salesforce Developer, Quality Engineer Certification CQE etc.) are highly desirable.

Soft Skills :

  • Ability to communicate both verbally and in written form.
  • Excellent communication skills to collaborate effectively with cross-functional teams and convey technical concepts to non-technical stakeholders.
  • Strong problem-solving skills, including the ability to analyze complex systems and identify performance bottlenecks.

Telecommuting Requirements :

  • Must have reliable internet service that allows for effective telecommuting.
  • Must be able to obtain and maintain a government security clearance.
  • All work must be conducted in the United States.
  • Must be eligible to work in the United States.
  • All Telecommuters will be required to adhere to UnitedHealth Group's Telecommuter Policy.

California, Colorado, Nevada, Connecticut, New York, New Jersey, Rhode Island, Hawaii, Washington, or Washington D.C Residents Only :

The salary range for California, Colorado, Nevada, Connecticut, New York, New Jersey, Rhode Island, Hawaii, Washington, or Washington D.

C residents is $70,200 to $137,800 per year. Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc.

UnitedHealth Group complies with all minimum wage laws as applicable. In addition to your salary, UnitedHealth Group offers benefits such as, a comprehensive benefits package, incentive and recognition programs, equity stock purchase and 401k contribution (all benefits are subject to eligibility requirements).

No matter where or when you begin a career with UnitedHealth Group, you'll find a far-reaching choice of benefits and incentives.

Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Application Deadline : This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected.

Job posting may come down early due to volume of applicants.

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone.

We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life.

Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes.

We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.

Diversity creates a healthier atmosphere : UnitedHealth Group is an Equal Employment Opportunity / Affirmative Action employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin, protected veteran status, disability status, sexual orientation, gender identity or expression, marital status, genetic information, or any other characteristic protected by law.

UnitedHealth Group is a drug-free workplace. Candidates are required to pass a drug test before beginning employment.

RPO #GREEN

J-18808-Ljbffr

Remote working / work at home options are available for this role.

1 day ago
Related jobs
Promoted
Optum
Washington, District of Columbia
Remote

As a Site Reliability Engineer (SRE) you will employ software engineering to automate critical IT operations tasks, including production system management, change management, and incident response. You will be responsible for design review and control; prediction, estimation, and apportionment metho...

Promoted
Accenture Federal Services
Washington, District of Columbia

Accenture Federal Services is seeking a Senior Site Reliability Engineer (SRE) who is passionate about leveraging scripting and Infrastructure as Code (IaC) to enhance operational efficiency and reliability within an Azure environment. Platform Reliability and Performance Enhancement: Identify and i...

Promoted
Canonical - Jobs
Washington, District of Columbia

As a Senior Site Reliability / Gitops Engineer you will. Fully remote working environment - we've been working remotely since 2004!. As an Senior SRE & Gitops engineer you'll be in a unique position to drive operations automation to the next level, both in our own private clouds as well ...

Promoted
TikTok
Washington, District of Columbia

BS degree in Computer Science, Computer Engineering, Electrical Engineering or relevant majors with 2+ years of working experience. The teams within USDS that deliver on this commitment daily span Trust & Safety, Security & Privacy, Engineering, User & Product Ops, Corporate Functions an...

Promoted
Computer World Services (CWS)Corporation
Washington, District of Columbia

The Senior Systems Engineer - Observability (SSE) will define and implement infrastructure and application observability, set up governance, optimization, monitoring, and control for a consolidated common operating picture for IT operations. The role will work with engineering, application, security...

Promoted
Zivaro Inc
Washington, District of Columbia
Remote

While much of this role may be conducted remotely, some positions/ programs require travel to customer sites and/or a government security clearance (Secret, Top Secret, TS/SCI + Poly). Seeking multiple Splunk Engineers to Join Zivaro’s team. Provide overall engineering and design support for a...

Promoted
Zscaler
Washington, District of Columbia

Bachelor degree in Computer Science, Computer Engineering or similar discipline with MS degree in Computer Science or Computer Engineering preferred. ...

Promoted
Ad Hoc LLC
Washington, District of Columbia

Ad Hoc is remote-first and remote-always. Being remote allows Ad Hoc to bring the best people onto our teams and give them the freedom to create a work environment that fits their lives. By applying better practices in service design, product management, and technology, we enable VA to increase the ...

Promoted
TEKsystems
Washington, District of Columbia

DevOps/Site Reliability Engineer (SRE). ...

cellebrite
Washington, District of Columbia

As a FedRAMP Site Reliability Engineer (SRE), you will drive the automation of multiple parts of infrastructure and deployment systems. Strive to improve processes, to enable engineering and operations teams to work smarter and faster with high quality. You get to work with other SRE, DevOps enginee...