Jobs
>
Bengaluru

    Senior Site Reliability Engineer - Bengaluru, India - NVIDIA

    NVIDIA
    NVIDIA Bengaluru, India

    Found in: Talent IN C2 - 3 days ago

    Default job background
    Full time
    Description

    NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's motivated by outstanding technology and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. NVIDIA is at the forefront of generative AI models, from language to images. Doing what's never been done before takes vision, innovation, and the world's best talent. As an NVIDIAN, you'll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work.

    NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join its cloud service team for supporting, triaging, and building generative AI-powered visual applications. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. We live SRE practices that are key to product quality, such as limiting time spent on reactive operational work, blameless postmortems, proactive identification of potential outages, and iterative improvements, which all make for exciting and multifaceted day-to-day work. The person in this position will be responsible for Service Response and workflow and will drive tools/service development to maintain and improve service SLOs. We partner with Service Owners to drive the reliability of the service.

    What you will be doing:

    • Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning 60+ edge locations plus all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.
    • Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.
    • Monitoring & supporting critical high-performance, large-scale services running multi-cloud.
    • Participate in the triage & resolution of sophisticated infra-related issues.
    • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.
    • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
    • Practice balanced incident response and blameless postmortems.
    • Be part of an on-call rotation to support production systems and lead significant production improvement around tooling, automation, and process.
    • Architect, design, and code using your expertise to optimize, deploy and productize services.

    What we need to see:

    • 8+ years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner.
    • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
    • Solid understanding of containerization and microservices architecture, K8s. Excellent understanding of the Kubernetes ecosystem and best practices with K8s.
    • Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them.
    • Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives.
    • Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI/CD auto-remediation, alert correlation).
    • Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services.
    • Experience with the ELK and Prometheus stacks as a power user and administrator.
    • Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI.
    • Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.

    Ways to stand out from the crowd:

    • Exposure to containerization and cloud-based deployments for AI models.
    • Excellent coding: Python, Go (Any similar language).
    • Prior experience driving production issues and helping with on-call support and understanding of Deep Learning / Machine Learning / AI.
    • Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton as well as experience with StackStorm and similar automation platforms is a bonus.
    • Understanding of observability instrumentation techniques and best practices, including OpenTelemetry.

    NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you.


  • ViewSonic

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 3 days ago


    ViewSonic Bengaluru, India

    Job Requirements: · Bachelor's degree in computer science, Engineering, or a related field. · 3+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role. · Proficient in AWS solutions including but not limited to EC2, S3, CloudWatch, Lambda, and RDS. ...

  • MethodHub

    Database Reliability Engineer

    Found in: Appcast Linkedin IN C2 - 3 days ago


    MethodHub Bengaluru, India

    Database Reliability Engineer (DBRE) · Location: Bengaluru, Noida · Looking for strong DB Reliability Engineering candidates (4-10 yrs band) · Must have strong skillset on MySQL DBA + Linux OS + Automation tools (Chef/Ansible/Shell scripting) · Hands on experience on High Availab ...

  • Integra Connect

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 3 days ago


    Integra Connect Bangalore Urban, India

    About IntegraConnect · Integra Connect delivers a comprehensive, integrated suite of cloud-based technologies and services that enable specialty groups to optimize clinical and financial performance as reimbursement shifts to value-based models. Connected by the IntegraCloud plat ...

  • Ensono

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 2 days ago


    Ensono Bengaluru, India

    About Role · Ensono is continuing its growth and building a cloud-native managed service offering for our clients. We are looking for energetic and skilled remote Site Reliability Engineers to join us on this exciting new journey. As a Site Reliability Engineer, you and your team ...

  • Cyitechsearch

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 4 days ago


    Cyitechsearch Bengaluru, India

    We are hiring for Site Reliability Engineer · Skills : · - Develop and provide operational support for full-stack software applications. · - Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation. · - Five years' experience as a site ...

  • TERRAGIG LLP

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 3 days ago


    TERRAGIG LLP Bengaluru, India

    Role : Site Reliability Engineer · Experience : 5+ Years · Work Model : Remote / Contract 3 years · Skills : · - Develop and provide operational support for full-stack software applications. · - Relevant industry certifications, such as through the Site Reliability Engineering ...

  • ViewSonic

    Site Reliability Engineer

    Found in: Appcast Linkedin IN C2 - 3 days ago


    ViewSonic Bengaluru, India

    Job Requirements: · Bachelor's degree in Computer Science, Engineering, or a related field. · 1+ year of experience in a relevant role, such as Site Reliability Engineer, DevOps Engineer, or similar, is preferred but not mandatory. · Basic understanding of AWS solutions including ...

  • PhonePe

    Site Reliability Engineer

    Found in: Appcast Linkedin IN C2 - 3 days ago


    PhonePe Bangalore Urban, India

    JOB DESCRIPTION: We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production services. · Systems internals/security, Linux, Network, and Monitorin ...

  • Larsen & Toubro

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 3 days ago


    Larsen & Toubro Bengaluru, India

    EXP:- 5to 8 Years · Location- Pune,Bangalore,Hyderabad,Chennai · Primary Skills: · Site Reliability Engineering (SRE) · Application Support on Middleware tools like Apache, WebSphere, Tibco, JMS, RabbitMQ, etc. · Automation using tools like Ansible, Chef, etc.; familiarity with ...

  • Solugenix

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 1 day ago


    Solugenix Bengaluru, India

    Job Title:SRE Cloud Engineer · Location : Hyderabad / Bangalore · Shifts : 24/5 · Exp : 5+ Years · Job Summary : Cloud Engineer is primarily responsible for working hands on various AWS services like EC2, RDS, EKS, ECS, S3, VPC, Route53, Lambda, Code pipeline etc., and should ha ...

  • Zyoin group

    Database Reliability Engineer

    Found in: Talent IN 2A C2 - 3 days ago


    Zyoin group Bangalore/Hyderabad, India permanent

    We are seeking a highly skilled and experienced Database Reliability Engineer (DBRE) to join our team and play a crucial role in ensuring the performance, scalability, and high availability of our customer database services on the Tessell Platform. · Minimum Requirements : · year ...

  • Waytogo Consultants

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 4 days ago


    Waytogo Consultants Bangalore, India permanent

    Job Description : · As an SRE Lead (Site Reliability Engineering Lead), you will play a crucial role in ensuring the reliability, scalability, and performance of our systems and services. · He/ She will lead a team of SREs (Site Reliability Engineers) and collaborate closely wit ...

  • Talent500

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 3 days ago


    Talent500 Bengaluru, India

    Job Description : · Cloud Engineer - Site Reliability Engineering for Ford Credit Tech · Were passionate about building software that solves problems. We count on our Site Reliability Engineers (SREs) to empower our users with a rich feature set, high availability, and stellar pe ...

  • Kunato AI

    - Site Reliability Engineer

    Found in: Talent IN 2A C2 - 4 days ago


    Kunato AI Bangalore, India permanent

    Site Reliability Engineer (SRE) Python/Golang. · Job Description : . · We are seeking a highly skilled and passionate Site Reliability Engineer (SRE) to join our technology team. The ideal candidate will possess strong programming skills with expertise in Python, Golang, or both. ...

  • RENUZA TECHNOLOGIES PRIVATE LIMITED

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 4 days ago


    RENUZA TECHNOLOGIES PRIVATE LIMITED Bangalore, India permanent

    Job Description : · Site Reliability Engineers (SREs) are responsible for ensuring the reliability and performance of production systems at Renuza Technologies. · They wear many hats, encompassing troubleshooting, software development, system administration, infrastructure manag ...

  • Cyitechsearch

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 4 days ago


    Cyitechsearch Bangalore, India permanent

    About the job : · We are hiring for Site Reliability Engineer · Experience : 5+ Years · Work Model : Remote / Contract 3 years · Skills : · - Develop and provide operational support for full-stack software applications. · - Relevant industry certifications, such as through the ...

  • Microsoft

    Site Reliability Engineering

    Found in: Talent IN C2 - 3 days ago


    Microsoft Bengaluru, India Full time

    Overview · Looking to join an exciting industry and organization at the forefront of the next Tech industry transformation? Are you ready to join a team of the world's best technical experts to enable the success of Microsoft solutions for our commercial & enterprise customers? ...

  • Zyoin group

    Database Reliability Engineer

    Found in: Talent IN 2A C2 - 4 days ago


    Zyoin group Bangalore, India permanent

    Job Description : · We are looking for a database reliability engineer/sr database reliability engineer to help us build and enhance our database platforms to achieve availability, scalability, and operational effectiveness, the right individual will embrace the opportunity to ta ...

  • Lam Research

    Reliability Engineer, Sr

    Found in: Talent IN C2 - 4 days ago


    Lam Research Bengaluru, India

    Job Responsibilities · Determines reliability requirements of components and systems to achieve company, customer and any industry standard reliability objectives. · Perform analysis for reliability and tool availability, and help identify associated costs with deviation from t ...

  • Squareroot Consulting Pvt Ltd.

    Site Reliability Engineer

    Found in: Talent IN 2A C2 - 4 days ago


    Squareroot Consulting Pvt Ltd. Bangalore, India permanent

    Site Reliability Engineer · Location : Bangalore, India · Domain : Cybersecurity · Budget : 30 to 50 Lacks · - We are looking for a hands-on devops engineer leading the design, implementation of devops/SRE practice for our infrastructure for data privacy. · - The successful cand ...