Site Reliability Engineer - Bengaluru, India - Tata Technologies

    Tata Technologies
    Tata Technologies Bengaluru, India

    Found in: Talent IN 2A C2 - 1 week ago

    Tata Technologies background
    Description

    Requirements

    Proven experience as a Site Reliability Engineer or similar role, with a strong focus on troubleshooting production issues.


    • Proficiency in programming languages like Python, Java, or Go, with the ability to write clean, efficient, and maintainable code.


    • Familiarity with observability tools and technologies such as Prometheus, Grafana, ELK stack, or similar.


    • Knowledge of cloud platforms like AWS, Azure and hands-on experience with containerization technologies like Docker and Kubernetes.


    • Ability to work in a fast-paced environment and handle multiple incidents effectively.


    • Excellent communication and collaboration skills to work closely with product teams and other stakeholders.

    Preferred (Beneficial) Skills:


    • Previous experience in Reliability Engineering, SRE practices, and implementing error budgets.


    • Familiarity with Chaos Engineering and conducting game days for proactive resilience testing.


    • Knowledge of CI/CD pipelines and experience with infrastructure-as-code tools like Terraform or Cloud Formation.

    Key Responsibilities:


    • Responding to and resolving production issues promptly, ensuring minimal disruption to services.


    • Collaborating with cross-functional teams, including product, engineering, and operations, to identify and resolve critical issues.


    • Leveraging your strong programming skills to develop tools and automation for monitoring, logging, and incident response.


    • Implementing best practices in observability, such as setting up monitoring, alerting, and logging to proactively detect and mitigate potential issues.


    • Contributing to the improvement of incident management processes and post-incident reviews to enhance system reliability.


    • Participating in on-call rotations and being proactive in preventing incidents through continuous improvement.