Site Reliability Engineering Manager - Hyderabad, India - SID Global Solutions

    Default job background
    Accounting / Finance
    Description

    Dear Candidate,

    Greetings

    We are looking for immediate SRE Manager role. Candidate must have experience in SRE, GCP, Kubernetes, AWS. Please find below the details. If interested please send me your updated resume to Looking for immediate joiner.

    Responsibilities:

    Team Leadership:

    Lead and mentor a team of Site Reliability Engineers to achieve operational excellence and reliability goals.

    Foster a culture of collaboration, innovation, and accountability within the team.

    Set clear goals and expectations and provide regular feedback and performance evaluations.

    Reliability Engineering:

    Design, implement, and maintain monitoring, alerting, and incident response systems to ensure the reliability and availability of services.

    Develop and implement best practices for incident management, post-mortem analysis, and proactive issue resolution.

    Collaborate with development teams to ensure that reliability and scalability are built into the design and architecture of new systems.

    Infrastructure Automation:

    Drive the automation of infrastructure provisioning, configuration management, and deployment processes to improve efficiency and reduce manual effort.

    Implement infrastructure as code practices using tools such as Terraform, Ansible, or similar technologies.

    Continuously evaluate and adopt new technologies and tools to streamline operations and improve reliability.

    Performance Optimization:

    Identify performance bottlenecks and optimization opportunities in our systems and services.

    Work closely with development teams to optimize application performance and resource utilization.

    Conduct capacity planning and scalability testing to ensure that our systems can handle projected growth and traffic spikes.

    Incident Management:

    Lead incident response efforts during service outages and critical incidents.

    Coordinate with cross-functional teams to quickly diagnose and resolve issues, minimize downtime, and prevent recurrence.

    Conduct thorough post-mortem analyses to identify root causes and implement preventative measures.

    Qualifications:

    Strong background in Linux/Unix systems administration and networking fundamentals.

    Proficiency in at least one programming language and experience with scripting and automation.

    Experience with cloud computing platforms GCP and container orchestration technologies Kubernetes.

    Deep understanding of reliability engineering principles, including monitoring, alerting, incident response, and post-mortem analysis.

    Excellent communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams.

    Strong analytical and problem-solving skills, with a focus on continuous improvement and operational efficiency.

    Qualifications:

    Certification in relevant technologies (e.g., GCP Certified DevOps Engineer, Certified Kubernetes Administrator).

    Experience with CI/CD pipelines and continuous deployment practices.

    Knowledge of infrastructure security best practices and compliance standards.

    Experience with microservices architecture and distributed systems.

    Familiarity with agile methodologies and agile software development practices.