Infrastructure & Cloud

Medior Site Reliability Engineer

Would you like to go into the back backbone we rely on ?

We are looking for an experienced Site Reliability Engineer (SRE) to join the Engineering Chapter team and help ensure the reliability, scalability, and performance of critical on-premises services within the ERA product organization.

In this role, you'll focus on building and maintaining a modern observability platform, implementing monitoring best practices, and automating operational processes. Working closely with cross-functional engineering teams, you'll help improve system resilience, reduce incident response times, and ensure the availability of business-critical services.

If you're passionate about observability, automation, and operational excellence, this opportunity is for you.


Role

Observability & Monitoring

  • Design, implement, and maintain enterprise monitoring solutions.
  • Build intuitive Grafana dashboards and visualizations.
  • Configure meaningful alerts to proactively detect issues.
  • Implement distributed tracing and centralized log aggregation.
  • Define and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
  • Continuously improve monitoring coverage and platform visibility.

Infrastructure & Reliability

  • Manage and optimize on-premises monitoring infrastructure.
  • Ensure platform reliability, scalability, and high availability.
  • Support Linux-based environments and troubleshoot infrastructure issues.
  • Participate in 24/7 on-duty rotations for incident response.
  • Contribute to reducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).

Automation & DevOps

  • Automate deployment, configuration, and operational tasks.
  • Develop automation scripts using Python, Bash, or Go.
  • Improve infrastructure management through automation and standardization.
  • Support Infrastructure as Code and operational best practices.

Collaboration

  • Work closely with development teams to improve application instrumentation.
  • Promote observability best practices across engineering teams.
  • Balance technical improvements with business priorities.
  • Contribute to continuous improvement initiatives within the Engineering Chapter.

Security & Compliance

  • Ensure monitoring solutions comply with enterprise security standards.
  • Maintain secure on-premises monitoring environments.
  • Support compliance and governance requirements.

Profile

Core Technical Skills

  • Advanced experience with Grafana
  • Strong expertise in Prometheus and PromQL
  • Hands-on experience with OpenTelemetry
  • Experience with Elasticsearch
  • Strong Linux system administration skills
  • Good understanding of networking fundamentals
  • Experience securing on-premises infrastructure

Programming & Automation

Experience with one or more of:

  • Python
  • Bash
  • Go

Experience

  • 3+ years of experience in monitoring, observability, or Site Reliability Engineering.
  • At least 2 years of hands-on experience with Grafana and Prometheus in production environments.
  • Strong experience supporting Linux-based production systems.
  • Proven experience managing enterprise on-premises infrastructure.
  • Experience participating in 24/7 operational support or on-call rotations.

Security

  • Understanding of enterprise security practices.
  • Experience working within compliance-driven environments.

Who You Are

  • Passionate about reliability, automation, and operational excellence.
  • Analytical with strong troubleshooting skills.
  • Comfortable working in production-critical environments.
  • Able to prioritize effectively and balance technical improvements with business needs.
  • Collaborative and proactive in working with cross-functional teams.
  • Committed to continuous improvement and knowledge sharing.


Offer

Freelance Long term Contract


What You'll Help Deliver

As a Site Reliability Engineer, you'll contribute directly to:

  • Improved platform reliability and system availability.
  • Reduced MTTD (Mean Time to Detect) and MTTR (Mean Time to Recover).
  • Comprehensive observability across critical services.
  • Automated deployment, monitoring, and operational processes.
  • Secure and compliant monitoring infrastructure supporting business-critical applications.
Voordelen
  • 3_days_remote3 dagen telewerken
Bij Sander, behandelen we elke aanvraag strikt vertrouwelijk!
Apply now
Submit your CV today and let us connect you with top employers in your field.