Systems Reliability Engineer
Tain
Responsibilities:
- Enhance and maintain monitoring of metrics, logs and tracing (Grafana, Prometheus, ELK, OpenTelemetry & more)
- Build automation scripts for component restarts (Jenkins, Ansible & more)
- Proactively monitor system performance, identify potential issues, and implement preventive measures. Act as a mentor to Technical Support Engineers in these specialized areas.
- Gain a solid understanding of the live casino platform to assist with deployments, troubleshooting issues and BAU tasks.
- Join the 24/7 shift rota – Day, night, rest, off, repeat.
- Communicate effectively with customers and internal stakeholders such as DevOps, studio techs, Corporate IT and Customer account management.
- Respond and resolve incidents, minimizing downtime and ensuring system stability.
- Collaborate with other IT departments to ensure seamless integration of new systems and services.
- Participate in the evaluation and adoption of new SRE tools
Requirements:
- 2+ years of experience in SRE.
- Strong understanding of Linux/Unix operating systems
- Familiarity with scripting languages such as Python.
- Experience with automation tools such as Ansible and Terraform.
- Familiarity with CI/CD concepts and tools such as Jenkins or GitLab CI/CD
- Strong problem-solving and troubleshooting skills
- Experience with hyperconverged systems, hypervisors such as VMware and end to end planning, execution, monitoring and troubleshooting
- Excellent communication and teamwork skills
- Eager to learn and adapt to new technologies and approaches
- Passion for the iGaming industry and understanding of its unique challenges and opportunities