Site Reliability Engineer (SRE)

Job Details

Industry	Information Technology and Services	Location:	Remote
Function	Engineering	Employment Type:	Full-time
Work Exp. Reqd:	Mid-senior level	Education Reqd:	Master's Degree

Role Overview

We are seeking a Site Reliability Engineer (SRE) to join our advanced engineering division, where innovation, resilience, and precision drive everything we build. In this role, you will architect and optimise highly available, cloud-native systems that support complex AI, data, and cybersecurity workloads.

The ideal candidate is a systems thinker — combining the depth of backend engineering with the mindset of reliability and automation. You will work across multi-cloud environments, Kubernetes clusters, and observability pipelines to ensure zero-downtime operations and self-healing infrastructure for enterprise-scale applications.

Responsibilities

Design, implement, and manage resilient, scalable infrastructure across AWS, GCP, or Azure using Infrastructure-as-Code (IaC) tools such as Terraform or Pulumi.
Build and enhance monitoring, alerting, and observability frameworks using Prometheus, Grafana, ELK, or OpenTelemetry to achieve proactive fault detection.
Collaborate with development, AI/ML, and DevOps teams to automate reliability workflows, improve CI/CD pipelines, and eliminate repetitive operational tasks.
Define and maintain SLAs, SLOs, and SLIs for critical systems to ensure consistent performance and reliability.
Drive incident response and root cause analysis (RCA), using automation and predictive analytics to prevent recurrence.
Optimise resource utilisation and cost efficiency while maintaining system integrity and performance.
Contribute to long-term architectural decisions that advance our autonomous and intelligent infrastructure goals.

Requirements

Proven experience (4+ years) in SRE, DevOps, or Systems Engineering within complex, high-traffic environments.
Strong expertise in Kubernetes, Docker, CI/CD pipelines, and infrastructure automation tools (Terraform, Ansible, etc.).
Advanced knowledge of Linux systems, networking, and cloud platforms (AWS/GCP/Azure).
Hands-on experience with observability tools (Grafana, Prometheus, ELK, Datadog, etc.).
Proficiency in scripting or automation languages such as Go, Python, or Bash.
Understanding of distributed systems design, performance tuning, and incident management.
Strong analytical mindset with an emphasis on root cause prevention rather than reaction.
Excellent communication skills and the ability to collaborate effectively with cross-functional teams.

Benefits

Competitive salary with equity options.
Opportunity to work on cutting-edge, high-impact projects at the intersection of AI, security, and infrastructure.
Professional development allowance and access to enterprise-grade cloud resources.
Inclusive, innovation-driven engineering culture with a focus on long-term growth and autonomy.

Apply for this job Share this job