Site Reliability Engineer - Bengaluru, India - Tata Technologies
Description
Requirements
Proven experience as a Site Reliability Engineer or similar role, with a strong focus on troubleshooting production issues.
• Proficiency in programming languages like Python, Java, or Go, with the ability to write clean, efficient, and maintainable code.
• Familiarity with observability tools and technologies such as Prometheus, Grafana, ELK stack, or similar.
• Knowledge of cloud platforms like AWS, Azure and hands-on experience with containerization technologies like Docker and Kubernetes.
• Ability to work in a fast-paced environment and handle multiple incidents effectively.
• Excellent communication and collaboration skills to work closely with product teams and other stakeholders.
Preferred (Beneficial) Skills:
• Previous experience in Reliability Engineering, SRE practices, and implementing error budgets.
• Familiarity with Chaos Engineering and conducting game days for proactive resilience testing.
• Knowledge of CI/CD pipelines and experience with infrastructure-as-code tools like Terraform or Cloud Formation.
Key Responsibilities:
• Responding to and resolving production issues promptly, ensuring minimal disruption to services.
• Collaborating with cross-functional teams, including product, engineering, and operations, to identify and resolve critical issues.
• Leveraging your strong programming skills to develop tools and automation for monitoring, logging, and incident response.
• Implementing best practices in observability, such as setting up monitoring, alerting, and logging to proactively detect and mitigate potential issues.
• Contributing to the improvement of incident management processes and post-incident reviews to enhance system reliability.
• Participating in on-call rotations and being proactive in preventing incidents through continuous improvement.