Site Reliability Engineer (SRE)
Job Details
| Industry | Information Technology and Services | Location: | Remote |
|---|---|---|---|
| Function | Engineering | Employment Type: | Full-time |
| Work Exp. Reqd: | Mid-senior level | Education Reqd: | Master's Degree |
Role Overview
We are seeking a Site Reliability Engineer (SRE) to join our advanced engineering division, where innovation, resilience, and precision drive everything we build. In this role, you will architect and optimise highly available, cloud-native systems that support complex AI, data, and cybersecurity workloads.
The ideal candidate is a systems thinker — combining the depth of backend engineering with the mindset of reliability and automation. You will work across multi-cloud environments, Kubernetes clusters, and observability pipelines to ensure zero-downtime operations and self-healing infrastructure for enterprise-scale applications.
Responsibilities
- Design, implement, and manage resilient, scalable infrastructure across AWS, GCP, or Azure using Infrastructure-as-Code (IaC) tools such as Terraform or Pulumi.
- Build and enhance monitoring, alerting, and observability frameworks using Prometheus, Grafana, ELK, or OpenTelemetry to achieve proactive fault detection.
- Collaborate with development, AI/ML, and DevOps teams to automate reliability workflows, improve CI/CD pipelines, and eliminate repetitive operational tasks.
- Define and maintain SLAs, SLOs, and SLIs for critical systems to ensure consistent performance and reliability.
- Drive incident response and root cause analysis (RCA), using automation and predictive analytics to prevent recurrence.
- Optimise resource utilisation and cost efficiency while maintaining system integrity and performance.
- Contribute to long-term architectural decisions that advance our autonomous and intelligent infrastructure goals.
Requirements
- Proven experience (4+ years) in SRE, DevOps, or Systems Engineering within complex, high-traffic environments.
- Strong expertise in Kubernetes, Docker, CI/CD pipelines, and infrastructure automation tools (Terraform, Ansible, etc.).
- Advanced knowledge of Linux systems, networking, and cloud platforms (AWS/GCP/Azure).
- Hands-on experience with observability tools (Grafana, Prometheus, ELK, Datadog, etc.).
- Proficiency in scripting or automation languages such as Go, Python, or Bash.
- Understanding of distributed systems design, performance tuning, and incident management.
- Strong analytical mindset with an emphasis on root cause prevention rather than reaction.
- Excellent communication skills and the ability to collaborate effectively with cross-functional teams.
Benefits
- Competitive salary with equity options.
- Opportunity to work on cutting-edge, high-impact projects at the intersection of AI, security, and infrastructure.
- Professional development allowance and access to enterprise-grade cloud resources.
- Inclusive, innovation-driven engineering culture with a focus on long-term growth and autonomy.