Site Reliability Engineer (SRE)

Job Details

Industry Information Technology and Services Location: Remote
Function Engineering Employment Type: Full-time
Work Exp. Reqd: Mid-senior level Education Reqd: Master's Degree

Role Overview

We are seeking a Site Reliability Engineer (SRE) to join our advanced engineering division, where innovation, resilience, and precision drive everything we build. In this role, you will architect and optimise highly available, cloud-native systems that support complex AI, data, and cybersecurity workloads.

The ideal candidate is a systems thinker — combining the depth of backend engineering with the mindset of reliability and automation. You will work across multi-cloud environments, Kubernetes clusters, and observability pipelines to ensure zero-downtime operations and self-healing infrastructure for enterprise-scale applications.

Responsibilities

  • Design, implement, and manage resilient, scalable infrastructure across AWS, GCP, or Azure using Infrastructure-as-Code (IaC) tools such as Terraform or Pulumi.
  • Build and enhance monitoring, alerting, and observability frameworks using Prometheus, Grafana, ELK, or OpenTelemetry to achieve proactive fault detection.
  • Collaborate with development, AI/ML, and DevOps teams to automate reliability workflows, improve CI/CD pipelines, and eliminate repetitive operational tasks.
  • Define and maintain SLAs, SLOs, and SLIs for critical systems to ensure consistent performance and reliability.
  • Drive incident response and root cause analysis (RCA), using automation and predictive analytics to prevent recurrence.
  • Optimise resource utilisation and cost efficiency while maintaining system integrity and performance.
  • Contribute to long-term architectural decisions that advance our autonomous and intelligent infrastructure goals.

Requirements

  • Proven experience (4+ years) in SRE, DevOps, or Systems Engineering within complex, high-traffic environments.
  • Strong expertise in Kubernetes, Docker, CI/CD pipelines, and infrastructure automation tools (Terraform, Ansible, etc.).
  • Advanced knowledge of Linux systems, networking, and cloud platforms (AWS/GCP/Azure).
  • Hands-on experience with observability tools (Grafana, Prometheus, ELK, Datadog, etc.).
  • Proficiency in scripting or automation languages such as Go, Python, or Bash.
  • Understanding of distributed systems design, performance tuning, and incident management.
  • Strong analytical mindset with an emphasis on root cause prevention rather than reaction.
  • Excellent communication skills and the ability to collaborate effectively with cross-functional teams.



Benefits


  • Competitive salary with equity options.
  • Opportunity to work on cutting-edge, high-impact projects at the intersection of AI, security, and infrastructure.
  • Professional development allowance and access to enterprise-grade cloud resources.
  • Inclusive, innovation-driven engineering culture with a focus on long-term growth and autonomy.