Senior Site Reliability Engineer

BillingPlatform

Serbia Contract 5-10 7h ago

88% Strong

Job description

Senior Site Reliability Engineer BillingPlatform is an industry-leading, fast-growing SaaS company. Our award-winning, cloud-based revenue lifecycle management platform is leveraged by leading global enterprises to automate and streamline the entire quote-to-cash process. At BillingPlatform, our employees are our most valuable asset, and we believe deeply in a culture of collaboration, accountability, innovation, and transparency. We seek bright, enthusiastic, and creative professionals looking to be part of our incredible team focused on challenging the status quo and driving transformational value to customers. Backed by leading private equity firms FTV Capital and Columbia Capital, we have achieved remarkable industry recognition for growth, including being listed for the fifth consecutive year on Deloitte’s Technology Fast 500™ list of fastest-growing technology companies and ranked on the Inc 5000 list for four years running. Our ability to innovate market-leading solutions has been validated by all major industry analyst firms, including being named a Leader in the first-ever Gartner® Magic Quadrant™ for Recurring Billing Applications, and being recognized as the Leader in Forrester Research’s “The Forrester Wave™: SaaS Recurring Billing Solutions.” To learn more about us, visit billingplatform.com. Responsibilities Own and improve on-call processes, incident response playbooks, and post-mortem culture Define, track, and manage SLOs, SLIs, and error budgets for critical services Lead blameless post-mortems and drive systematic reliability improvements Respond to production incidents and coordinate cross-functional resolution Design, build, and maintain scalable AWS infrastructure using IaC (Terraform, Pulumi) Manage Kubernetes clusters and containerized workloads in production Build and maintain CI/CD pipelines to improve deployment speed and reliability Evaluate and implement tooling to enhance developer productivity and system stability Implement monitoring, alerting, and distributed tracing (Prometheus, Grafana, Datadog, Jaeger) Identify and resolve performance bottlenecks across services, networks, and databases Build dashboards and runbooks for self-service operational insights Partner with engineering teams to embed reliability practices (load testing, capacity planning, chaos engineering) Conduct architecture reviews with a focus on reliability and operability Qualifications 5+ years of experience in SRE, DevOps, or infrastructure engineering Deep expertise with AWS and cloud-native architectures Strong experience with Kubernetes and container orchestration at scale Hands-on experience with infrastructure-as-code tools (Terraform or Pulumi) Proficiency in Python, Go, or Bash Experience with observability tools (Prometheus, Grafana, Datadog, or similar) Strong understanding of SLOs, SLIs, and error budgets Experience with service mesh technologies (Istio, Linkerd) Familiarity with chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos) Background in Oracle database reliability and administration Contributions to open-source infrastructure projects Experience in a high-growth SaaS or product-led environment Excellent English communication skills (written and spoken) Incentives Become a part of the team on global initiatives A high-impact role at a growing SaaS company that values personal growth, accountability, and teamwork A culture of open collaboration and problem-solving 100% remote Competitive pay This position is based in Serbia and is not eligible for relocation. BillingPlatform provides equal employment opportunity (EEO) to all persons regardless of age, color, national origin, citizenship status, physical or mental disability, race, religion, creed, gender, sex, pregnancy, sexual orientation, gender identity and/or expression, genetic information, marital status, status with regard to public assistance, veteran status, or any other characteristic protected by federal, state, or local law.