Site Reliability Lead - Bangalore, India - iimjobs

    iimjobs
    Iimjobs background
    Full time
    Description

    Purpose:

    As a Site Reliability Engineering Lead, you will bridge the gap between Development, Cloud Platform Engineering Teams and Product Owners of different Digital Offerings.

    Defining and implementing the SRE-concepts with our teams, and aligning the service quality with the business objectives and user expectations will be at the core of your responsibilities:


    Responsibilities:

    • Define and measure the reliability of the service using SLI, SLOs and consider the risk minimization of service degradation.
    • Enable the development team to bring new software or new features (Digital Offering) to production as quickly as possible, while also ensuring an agreedupon acceptable level of IT operations performance and error risk in line with the service level agreements (SLAs) agreed.
    • Closely cooperate with different Product Owners, Site Reliability Engineers, and the Cloud Platform Teams to define processes to migrate between different Cloud Platforms while ensuring reliability for business offerings.
    • Work with multiple Site Reliability Engineers for operations and system administration tasks analyzing logs, performance tuning, applying patches, testing production environments, identify opportunities and drive the design and implementation of endtoend observability, alerting, selfhealing and automation capabilities to improve service health, manageability, and reliability.
    • Work with different stake holders (POs, SREs and Platform Team) to define Incident Management Process as required for responding to incidents, drive postmortems reviews for improving the service quality.
    • Closely work with Dev and SRE team to select appropriate metrics related to observability and reliability as well as defining SLIs and SLOs
    • Define and drive observability for selfdeveloped software and the managed cloud components by collecting appropriate observability data for insights and alerting including setting up proper alerting for critical components.
    • Ensure availability and responsiveness of application by setting up and maintaining the required documentation method and tools. Building Playbooks for troubleshooting techniques to effectively identify and investigate issues that can be used by SREs.
    • Handle resolution of blockers, escalation to stakeholders, and provisioning of resources.
    • Own availability, performance, and supportability targets for the service.
    • Author functional and technical documentation and remain current on relevant technologies and procedures.

    What you should have:

    • 812 years of relevant industry experience.
    • Minimum of 3 years as a Site Reliability Engineering Lead.
    • Minimum of 5 years' experience as a Site Reliability Engineer
    • Minimum of 8 years' experience with cloud computing platforms like Azure and related services.
    • Indepth knowledge of system architecture, networking, and microservice based distributed systems.
    • Expertise in designing and implementing reliable, scalable, and faulttolerant systems using container Orchestration Technologies like Docker and Kubernetes.
    • Proficiency in setting up and managing monitoring, alerting, and logging systems for early detection and resolution of issues for container orchestrators like Kubernetes using Tools like Prometheus, Grafana, Open Telemetry Collector or similar tools.
    • Handson experience in incident management, including incident response, troubleshooting, and postmortem analysis.
    • Proficiency in coding/scripting languages commonly used in infrastructure automation and monitoring (such as Terraform).
    • Knowledge of best practices in disaster recovery planning and execution for cloudbased Systems.
    • Ability to lead and mentor a team of SREs, providing guidance, support, and coaching.
    • Capability to advocate for SRE best practices and principles within the organization and drive cultural changes as needed.
    • Willingness to stay updated with the latest trends, tools, and technologies in the field of site reliability engineering.
    • Strong communication skills to effectively collaborate with crossfunctional teams, including Software Developers, Product Owners, and Cloud Platform Engineers.