Job Title : Open Telemetry (SME)Consultant
Location : Remote
Duration : Long Term Contract
JobDescription :
We are seeking an experiencedmonitoring tools and Open Telemetry Subject Matter Expert (SME) whowill be responsible for designing implementing and optimizingmonitoring solutions and leveraging Open Telemetry to enhanceobservability within the Enterprise Command Center (ECC).
The SMEshould collaborate with the Incident Management team totroubleshoot and resolve incidents.
Key JobFunctions
Lead the design andimplementation of monitoring solutions using industry standardtools such as Splunk and others.
Customize monitoringconfigurations to align with the organizationalrequirements.
Implement and integrate Open Telemetryacross various applications and services for enhancedobservability.
Optimize monitoring solutions forefficiency and accuracy ensuring minimal impact on systemperformance.
Responsible for designing and implementingapplication and infrastructure performance monitoring under AWSCloud environment.
Create monitors and dashboards tomonitor applications and infrastructure performance.
Perform deep statistical analysis using performance data to helpidentify capacity and performance bottlenecks.
Configurealerting mechanisms within monitoring tools to proactively identifyand address potential issues.
Develop comprehensivedocumentation for monitoring tool configurations Open Telemetryimplementations and best practices.
Provide training toincident management teams on utilizing monitoring tools andinterpreting open telemetry data effectively.
Setupmonitoring dashboards for incident detection andalerting.
Perform endtoend analysis of transactionsunder an observability environment.
Troubleshootincidents and identify root cause quickly using wire data analyticsapplication performance management and event correlation monitoringtools.
Diagnose and resolve incidents by providingfactual data from the various monitoring and instrumentationsystems.
JobRequirements :
A goodunderstanding of the IT Cloud infrastructure that includes AWSCloud middleware database storage and / or networkinfrastructure.
Strong understanding of ITinfrastructure networking security concepts and applicationarchitecture.
Handson experience with Open Telemetryinstrumentation and telemetry data collection.
Provenexperience as a Splunk SM with indepth knowledge of Splunkarchitecture and components.
Excellent troubleshootingand problemsolving skills.
Strong documentation skillsand attention to detail.
Proactively monitoring ofhardware software and environmental alerts ormalfunctions.
Analyze dashboards and monitoring tools tolook for trends and patterns in application / infrastructure healthand performance.
Monitor applications and infrastructureusing tools like Splunk DynaTrace Catchpoint MoogSoft xMattersSignalFx Catchpoint MoogSoft xMatters SolarWinds Extrahopetc.
Expert understanding of micro servicebasedapplications deployed in Cloud using Lambdas ECS Fargateetc.
Proficiency in AWS services like IAM Roles Securitygroups EC2 S3 Lambda ALB ECS etc.
Experience workingwith AWS tools like ELB RDS Redshift DynamoDB Aurora Route53 LambdaS3 Batch CloudWatch CloudTrail WAF etc.
Hands onexperience with transaction level monitoring using Dynatrace andSplunk.
Create Splunk search queries anddashboards.
Be the SME in helping recognize and onboardnew data sources into Splunk and other tools analyze the data foranomalies and trends and building dashboards highlighting the keytrends of the data.
Implement best in class engineeringstrategies to support a distributed clustered Splunk environmentconsisting of Search Heads Indexers Forwarders Splunk EnterpriseSecurity (ES) app spanning security performance engineering andoperational roles.
Use opensource Observabilityframework Open Telemetry for instrumenting generating collectingand exporting telemetry data such as traces metrics logs to helpanalyze application performance and behavior.
Usedistributed tracing in an endtoend visibility environment thatconsists of microservices Containers Serverless andLambda.
Work closely with application teams and businessstakeholders to perform troubleshooting and aid in incidenttriage.
Influence other technical teams on incidentcalls and articulate troubleshooting steps effectively.
Follow up on items that could negatively impact productionoperations assist with postmortem related activities and supportvarious efforts related to operational improvements.
Strong relationship management skills and aptitude to multitask andwork well in a high stress environment both within teams andindependently.
PreferredQualifications
Familiarity withdistributed tracing and logging solutions.
Knowledge ofCloud Platforms (AWS Azure) and their integration with monitoringtools.
AWS Solution Architect Associate or highercertification.
Exposure working under a incidentmanagement environment.
Triage incidents to resolutionin a 24 / 7 / 365 environment effectively guide incident triage callsfrom a technical perspective share technical details obtained frommonitoring tools and dashboards to aid troubleshooting outlinedetails of resolution activities provide timely status updates tostakeholders assist with postmortem related activities and supportvarious efforts related to operational improvements.
Ability to report incident details and metrics to seniorleadership.
Perform analysis of data evaluating multipleapplication protocols including web database storage and supportinginfrastructure such as UNIX DNS LDAP SSL SMTP and FTP.
Proficient in Scripting UNIX / LINUX Shell Scripting & Python.Working knowledge of JavaScript / Perl etc. for customizingmonitoring configurations
Certification in relevantmonitoring tools or Open Telemetry is aplus.
AWS,SHELL SCRIPTING,PERL,MIDDLEWARE,ITINFRASTRUCTURE,AZURE,JAVASCRIPT,SPLUNK,PYTHON,UNIX,SMTP