Jobs / DAI Companies
Sr Monitoring & Observability Engineer, Los Angeles (On-Site)
DAI Companies · Los Angeles, CA, United States
Los Angeles, CA, United StatesExp: 3+ yrs115,000-125,000 USD/yearlyOnsite
Remuneration
115,000-125,000 USD/yearly
Location
Los Angeles, CA, United States
Visa sponsorship
Not specified
Job summary
The Senior Monitoring & Observability Engineer is a senior-level infrastructure and reliability engineering role responsible for designing, implementing, optimizing, and supporting enterprise monitoring and observability platforms across networks, systems, cloud environments, and critical business applications. The position combines observability engineering, cloud and infrastructure operations, automation, and incident management responsibilities, including ownership of monitoring tools. The role partners with Infrastructure, Security, DevOps, and IT Operations teams to improve system reliability, alert quality, operational efficiency, and service availability while supporting SRE-aligned practices.
Benefits
10% yearly bonus target
Qualifications
- Bachelor’s degree in IT, Computer Science, Networking, or a related field, or equivalent work experience.
- 3+ years of experience in IT operations, network monitoring, or system administration.
- Hands-on experience implementing and tuning enterprise monitoring/observability platforms.
- Demonstrated experience building or implementing one or more of: Datadog, Dynatrace, AppDynamics, Splunk, SolarWinds Orion, Orion DPA, Nagios, PRTG, or Zabbix.
- Advanced understanding of network protocols (TCP/IP, BGP, OSPF, VLANs, VPN, DNS, DHCP).
- Proficiency in Windows/Linux environments.
- Proficiency in at least one major cloud platform (AWS, Azure, or GCP).
- Familiarity with ITIL best practices for incident, problem, and change management.
- Scripting and automation experience using Python, PowerShell, Bash, Ansible, or similar tools.
- Working knowledge of cybersecurity best practices, firewall configurations, and SIEM tools.
- Strong leadership, communication, and collaboration skills.
- Ability to translate monitoring data into clear operational action across cross-functional teams.
- Ability to work in a high-stress, dynamic environment while handling multiple high-priority incidents.
Responsibilities
- Monitor and manage IT infrastructure, network systems, and business applications using enterprise monitoring tools.
- Serve as the first point of escalation for TOC Engineers, providing advanced troubleshooting, guidance, and root cause analysis.
- Lead or support incident response, root cause analysis, escalation, and post-incident review processes.
- Ensure issues are properly classified, escalated, and resolved efficiently.
- Take key roles in ITIL Incident, Problem, and Change Management processes.
- Build and tune monitoring and observability tooling, including instrumentation, integrations, dashboards, alert logic, synthetic checks, log pipelines, and APM configuration.
- Develop and implement automation scripts and tooling to improve operational efficiency, alerting quality, and response times.
- Analyze system logs, network traffic, event data, and performance metrics to identify trends, reduce alert noise, and prevent outages.
- Document monitoring standards, troubleshooting steps, system configurations, dashboards, and runbooks for knowledge sharing.
- Collaborate with IT, Security, and DevOps teams to maintain system reliability and security posture.
- Work with vendors and service providers to resolve tool, platform, and infrastructure issues.
- Participate in 24/7 on-call rotations and provide leadership during major incidents, coordinating cross-functional resolution efforts.
- Mentor junior TOC/NOC engineers on monitoring tools, dashboards, alert handling, and incident response practices.
Skills
AnsibleAppDynamicsAWSAzureBashDatadogDynatraceGCPLinuxPowerShellPythonSplunkWindowsWindows Server
Certifications
CompTIA A+CompTIA Network+CompTIA Security+Microsoft Fundamentals Certifications (Azure, M365, or Windows Server)AWS Cloud PractitionerAzure FundamentalsITIL Foundation certificationDatadog certificationSplunk certificationDynatrace certificationAppDynamics certificationSolarWinds certification
Work schedule
24/7 on-call rotations
Industry
Global equity marketsHealth careFinancial servicesDigital newsInsurance
Relocation
No