Search jobs > Washington, DC > Site reliability engineer

Site Reliability Engineer (EMO Engineer)

Computer World Services (CWS)Corporation
Washington, DC, United States
Full-time

Job Description

  • The mission of the OFR is to support the Financial Stability Oversight Council (FSOC) in promoting financial stability by : collecting data on behalf of FSOC;
  • providing such data to FSOC and member agencies; standardizing the types and formats of data reported and collected; performing applied research and essential long-term research;
  • developing tools for risk measurement and monitoring; performing other related services; making the results of the activities of the OFR available to financial regulatory agencies;

and assisting such member agencies in determining the types of formats of data authorized to be collected by such member agencies.

The Senior Systems Engineer - Observability (SSE) will define and implement infrastructure and application observability, set up governance, optimization, monitoring, and control for a consolidated common operating picture for IT operations.

The role will work with engineering, application, security operations, Service Desk and, enterprise / solution architects to develop and implement services, monitor, report and automate where applicable.

This role serves as a subject matter expert in a complex array of full stack solutions. Responsible for the migration of feeds from Splunk to Cribl, on-boarding new feeds, and providing Tier 3 support.

Working with vendors on open tickets as well as working in an Agile environment and with enterprise change control systems.

This role serves as a subject matter expert performing research, analysis, design, creation, and implementation to meet current and future requirements.

Responsible for building and implementing an enterprise observability strategy and operationalizing it.

Key Tasks and Responsibilities

  • Design, implement, and maintain high-performance and scalable observability solutions in a cloud environment.
  • Collaborate with cross-functional teams to gather requirements, architect solutions, and deploy logging and monitoring environments that align with business needs.
  • Configuration and maintenance of Datadog integrations including Webhooks, Amazon, Cisco, CrowdStrike, Cribl Stream, Container, VMWare, SNMP, journald, Okta, python, Zscaler, Microsoft 365, Webhooks, Palo Alto.
  • Configuration of telemetry logs through Cribl Stream including syslog, SNMP traps, JSON, AWS CloudWatch, AWS S3.
  • Development of custom data / telemetry pipelines including Grok parsing, GeoIP parsing, field remapping, and error tracking.
  • Ingest telemetry logs directly from cloud SaaS providers such as Zscaler, Okta, CrowdStrike, ServiceNow, Microsoft 365.
  • Installation and configuration of the Datadog Agent and Datadog Synthetics Agent on Windows servers, Linux servers, and Docker / Kubernetes containers.
  • Configuration of the Datadog Agent to collect host logs, processes, custom metrics (including SNMP), and network performance monitoring (NPM).
  • Configuration of Synthetic testing to monitor infrastructure uptime SLAs and SLOs using private locations.
  • Configuration of service-related monitors based on metrics, logs, live processes, service checks, anomalies / outliers.

Includes monitoring of serverless such as AWS Lambda functions.

  • Development of custom dashboards with a focus on reliability and performance of services.
  • Configuration and management of Service Catalog, including the definition of services and associated dashboards, monitors, SLOs, synthetic tests, metrics, and logs.
  • Configuration of incident management and service-based analytics including integration with JIRA and / or ServiceNow.
  • Maintain code repositories and versioning of any scripting or automation.
  • Provides technical leadership, oversight, governance, and direction for integrating with, and reporting on, observability pipelines.
  • Provide consultative services to support the application integrations required to be observed / monitored, such as Hadoop HDFS, Hadoop Map Reduce, Hive.
  • Identify opportunities for monitoring improvement, including incorporating APM and RUM monitoring.
  • Update documentation and user guides as needed.
  • Collaborate with cross-functional teams.
  • Configure monitors & alerts to integrate with Incident Management tools.

Education & Experience

  • Undergraduate degree in an engineering or computer science discipline and / or equivalent experience / certification.
  • 7+ years of experience in information technology with hands-on technical / engineering roles including :
  • 2+ years of experience working with Datadog, including hands-on experience administering AND supporting a Datadog migration or implementation.
  • Hands-on experience supporting a Datadog migration or implementation.
  • 3+ years of experience with AWS.
  • 3+ years data onboarding within a large-scale enterprise environment.
  • Experience in DataDog including building dashboards, reports, and alerts to meet customer requirements.
  • Experience with Infrastructure & Monitoring as Code tools.
  • Experience configuring and supporting additional Datadog modules.
  • Solid understanding of networking and device configuration.
  • Experience with migrating from other monitoring platforms to Datadog.
  • Experience with Incident Response tools.
  • Knowledge of Agile and continuous integration practices.
  • Collaborative mindset that thrives in fast paced environments.
  • Excellent verbal and written communication skills including the ability to author and present materials ranging from detailed technical specifications to high-level concepts for senior audiences.

Certifications

Preference given for DataDog, Cribl and AWS certifications.

Security Clearance

  • Public Trust
  • Must be US Citizen

Other (Travel, Work Environment, DoD 8570 Requirements, Administrative Notes, etc.)

This is a remote / work from home role.

Computer World Services is an affirmative action and equal employment opportunity employer. Current employees and / or qualified applicants will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, disability, protected veteran status, genetic information or any other characteristic protected by local, state, or federal laws, rules, or regulations.

Computer World Services is committed to the full inclusion of all qualified individuals. As part of this commitment, Computer World Services will ensure that individuals with disabilities (IWD) are provided reasonable accommodations.

If reasonable accommodation is needed to participate in the job application or interview process, to perform essential job functions, and / or to receive other benefits and privileges of employment, please contact Aaron McClellan in Human Resources at

314.952.5138

7 days ago
Related jobs
Promoted
VirtualVocations
Washington, District of Columbia
Remote

A company is looking for a Site Reliability Engineer for a remote position. ...

Promoted
Capital One
Washington, District of Columbia

As a Capital One Lead Software Engineer, Site Reliability Engineer you’ll have the opportunity to be on the forefront of driving a major transformation within Capital One. Lead Software Engineer, Site Reliability (Bank Tech). New York City (Hybrid On-Site): $201,400 - $229,900 for Lead Software Engi...

Promoted
VirtualVocations
Washington, District of Columbia

A company is looking for a Site Reliability Engineer in Remote Kentucky. ...

Promoted
Capital One
Washington, District of Columbia

Senior Software Engineer, Site Reliability Engineering (SRE). As a Site Reliability Engineer, you’ll have the opportunity to be on the forefront of driving a major transformation within Capital One. Site Reliability Engineering experience. If you have visited our website in search of information on ...

Computer World Services (CWS)Corporation
Washington, District of Columbia

The Senior Systems Engineer - Observability (SSE) will define and implement infrastructure and application observability, set up governance, optimization, monitoring, and control for a consolidated common operating picture for IT operations. The role will work with engineering, application, security...

Palantir Technologies
Washington, District of Columbia

As a Site Reliability Engineer on our Security Infrastructure team, you will be hands-on and have wide-ranging impact for the security of Palantir and its customers. The skills and background of successful candidates may vary, but curiosity, tenacity, and a drive to be an extraordinary security engi...

TEKsystems
Washington, District of Columbia

We are seeking an experienced Senior Site Reliability Engineer / DevOps Engineer with a minimum of 8 years of expertise to join our dynamic engineering team. Job Description: Senior Site Reliability Engineer / DevOps Engineer (8+ years) (Bilingual English and Japanese Proficiency Required). Experien...

Cinder
Washington, District of Columbia

Full Time] Site Reliability Engineer at Cinder (United States). Cinder is seeking an experienced Site Reliability Engineer to help lead the development and deployment of our robust infrastructure. Mentor junior engineers and build out a growing team while reporting to Cinder’s head of engineering. W...

Kansas Action for Children, Inc
Washington, District of Columbia

Our team is searching for our next Principal Site Reliability Engineer to play a crucial role improving system reliability and resilience, facilitating faster and more efficient software development and deployment. Improve system reliability and resilience by implementing advanced site reliability e...

Outcome Logix ( A Tech 50 Finalist company 2022, by Pittsburgh Technology Council )
Washington, District of Columbia
Remote

We are seeking a Systems Engineer who will be responsible for designing, implementing, and maintaining IT infrastructure, focusing on HPE GreenLake, VLM, and Veeam solutions. The ideal candidate will have a strong background in systems engineering, excellent problem-solving skills, and the ability t...