Lead Site Reliability Engineer

🌐 Global IT

About Fintech Farm

We are a UK fintech creating successful neobanks in emerging markets in partnerships with local traditional banks.
The mission is to make banking services accessible, simple and fun to use worldwide and the goal is to launch neobanks in 50+ markets, serving 100m+ customers.

Our success builds upon a best-in-class product, customer experience, emotional engagement, viral marketing and deep credit-decisioning expertise across our product suite covering credit, payments, savings and investments.
One of our founders also previously co-founded a highly successful Eastern European neobank with a multi-million customer base.

We launched our first market with Leobank in Azerbaijan in 2021, where we’ve already taken a leading market position.
Our next market was Vietnam, where we launched Liobank in early 2023 and have also reached strong traction.
We have several more markets on the roadmap in the next 12 months and are starting to build out teams there.

Why Fintech Farm is a Great Place to Be

Our Ambition

We are looking to become a leading consumer digital bank brand in each market we operate, making it easy for consumers to interact with their money.
You could be a part of this exciting journey.

Our Culture

Customers.
We always go above and beyond to provide an amazing customer experience.
We serve our customers the way we would want our mom to be served.
And who said that banking has to be boring? We make our apps not just easy but fun to use.

People.
We are all business partners in our company. Each of us thinks big, acts as if we own the place and never takes “no” for an answer.
We work with strong individuals whom we empower and trust rather than micromanage.
Common sense rather than formal policies prevails in all that we do.
We always stay curious and open-minded. We embrace the ‘we over me’ culture.

Your Role

As a Lead SRE, you will drive the reliability, scalability, and performance of our multi-market microservices infrastructure.
You’ll lead a team of engineers focused on automating operations, improving observability, and ensuring zero-downtime service delivery across our cloud and on-prem environments.
Your mission is to build resilient systems and empower development teams with the tools and practices needed to operate safely and efficiently at scale.

What You Will Be Doing

Build and define the SRE function, establishing best practices for reliability, observability, and incident management across the platform

Manage and optimize Kubernetes clusters (AWS EKS and on-prem), ensuring scalability, cost efficiency, and resilience

Oversee observability and alerting stack — including Prometheus, Grafana, Alertmanager, ELK, VictoriaMetrics

Implement and refine monitoring and alerting strategies, establishing actionable SLIs/SLOs and effective on-call processes

Drive improvements in infrastructure as code using Terraform/Terragrunt

Collaborate closely with software and DevOps teams to ensure production readiness and reliable CI/CD delivery pipelines

Participate in and enhance incident management processes, including post-mortems and continuous improvement initiatives

Lead efforts in security hardening, compliance, and cost optimization across environments

Contribute to strategic planning of infrastructure roadmap and technology evolution

Who You Are

A leader who takes ownership and inspires reliability-focused culture

Obsessed with system stability, scalability, and measurable performance

Strong communicator who can translate technical concepts into clear direction

Calm under pressure, analytical in incident response, and proactive in prevention

Passionate about mentoring engineers and driving operational excellence

Your Experience

6+ years in DevOps/SRE roles, with at least 2 years in a technical leadership position

Deep expertise in Kubernetes (EKS and on-prem), Prometheus, Grafana, and alerting systems

Strong background in AWS and Infrastructure as Code (Terraform/Terragrunt)

Experience designing and maintaining CI/CD pipelines (GitLab CI/CD or GitHub Actions)

Proficiency in scripting languages (Python, Bash) and automation tooling (Ansible, Helm)

Familiar with GitOps principles (Flux, ArgoCD)

Solid understanding of networking, security, and observability practices

Proven ability to lead incident response and drive cross-functional reliability improvements

Exposure to DevSecOps standards, compliance, and audit processes (ISO 27001, SOC 2, PCI DSS)

What We Are Offering

Competitive salary (negotiable based on seniority and leadership scope)

Share options

Opportunity to shape the SRE function in a fast-scaling fintech start-up

A collaborative environment that values autonomy, innovation, and impact