The role
We are now looking for a Site Reliability Engineer to ensure our systems run smoothly and reliably at scale. Your expertise in monitoring, observability, and system automation will help maintain the high availability and performance our customers depend on. You will work at the intersection of development and operations, using your technical skills to build robust infrastructure and streamline deployment processes.
Your mission will be to proactively identify and resolve system issues before they impact our customers. You will collaborate closely with development teams to implement monitoring solutions, create comprehensive alerting systems, and develop the tools needed to maintain system reliability. Initially, you will focus on enhancing our existing monitoring and alerting infrastructure, then gradually build self-healing systems and self-service capabilities that empower teams to diagnose and resolve issues independently.
As part of this role, you will:
- Design and implement comprehensive alerting systems that detect issues early and provide actionable insights to streamline the resolution of these issues.
- Collaborate with our development teams to ensure our observability stack provides clear visibility into system health and performance.
- Optimise on-call processes, including creating and maintaining detailed runbooks that enable efficient incident response and knowledge sharing across teams.
- Build self-healing systems using AI tools that automatically resolve common issues before they require human intervention.
- Develop automation tools and diagnostic capabilities that help teams quickly identify and resolve issues when manual investigation is required.
- Ensure secure and reliable code deployment processes through robust CI/CD pipelines and infrastructure automation.
- Join our 24/7 support rotation which provides first-level platform support to ensure a great customer experience.
Requirements
You
We are looking for someone who is excited about building innovative solutions and wants to have a large impact in a smaller company; you will be a key part of defining Unitary’s future during this early stage of our new product strategy. We need versatile people who are happy to get stuck into whatever needs doing, and are ready to learn and grow with the company.
For this particular role, we need a collaborative engineer who excels at working across teams and can translate complex technical concepts into actionable solutions. You should be comfortable balancing your time between fixing urgent issues and investing in proactive system improvements. Communication is crucial, as you'll be working closely with multiple engineers and may need to coordinate during high-stress incident situations.
We would love to hear from you if:
- Have worked with visualisation tools such as Grafana for creating and maintaining dashboards that provide meaningful insights into system performance
- Are proficient with metrics platforms such as Prometheus, InfluxDB, or OpenTelemetry for collecting and analysing system data
- Have experience with incident management tools such as Incident.io for coordinating response efforts and recording follow-up learnings and actions
- Can demonstrate strong problem-solving skills and the ability to work autonomously
- Are confident writing production code in languages such as Go or Python
- Thrive in a collaborative environment where group output and team achievements weigh heavier than individual input
It would be even better, but not essential, if you have:
- Experience working in a fully remote, international team
- Previous startup experience
- Built Slack bots or similar automation tools to streamline team workflows
- Experience with CI/CD platforms for building reliable deployment pipelines (e.g. GitLab CI, ArgoCD)
- Worked with Kubernetes and infrastructure as code tools such as Terraform for scalable system deployment
- Are familiar with MLOps practices and tools, and monitoring machine learning systems in production
This role will report to the VP of Engineering and can be placed anywhere within 3 hours of the UK time zone.