Eli Lilly and Company

Principal Site Reliability Engineer

Reposted Yesterday

Be an Early Applicant

In-Office

Hyderabad, Telangana

Senior level

In-Office

Hyderabad, Telangana

Senior level

The Principal Site Reliability Engineer will lead a team focused on application reliability, establish best practices, and collaborate on automated systems, while shaping the SRE function to support Lilly's mission.

The summary above was generated by AI

At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world.

About the Technology Organization

Technology at Lilly builds and maintains capabilities using pioneering technologies like most prominent tech companies. What differentiates Technology at Lilly is that we create new possibilities through tech to advance our purpose – creating medicines that make life better for people around the world, like data driven drug discovery and connected clinical trials. We hire the best technology professionals from a variety of backgrounds, so they can bring an assortment of knowledge, skills, and diverse thinking to deliver solutions in every area of our business.

About the Business Function

The Software Product Engineering (SPE) team is a specialised engineering group that delivers strategic solutions and differentiated capabilities. We take a forward-thinking approach, focusing on an enterprise platform and product mindset, ensuring that the solutions we build can be leveraged across Technology teams for broader impact and efficiency.

Job Title: Principal Site Reliability Engineer

Role Summary

As a Lead SRE Engineer, you will drive reliability, scalability, and operational excellence across a portfolio of applications deployed on a modern internal platform. You will lead and mentor a team of SRE engineers, establish best practices, and collaborate closely with product and development teams to ensure robust, automated, and self-healing systems. Your leadership will be critical in shaping the SRE function and enabling the team to deliver high-impact solutions that support Lilly’s mission.

What You’ll Be Doing

Lead the SRE team responsible for the reliability and performance of applications deployed on a cloud-native internal platform.

Design, implement, and maintain automation frameworks, self-service tooling, and auto-healing systems to eliminate manual toil.

Build and enhance end-to-end observability, monitoring, logging, and alerting systems for proactive issue detection and resolution.

Ensure Uptime: Take ultimate ownership of our production environment's stability. Lead end-to-end incident management, from escalation to Root Cause Analysis (RCA). Manage patching, upgrades, and disaster recovery processes.

Champion Infrastructure as Code (IaC) and CI/CD best practices to ensure consistent, repeatable, and secure deployments.

Collaborate with development and product teams to embed reliability and scalability into application design and architecture.

Continuously evaluate and introduce emerging tools and technologies to keep the SRE stack modern and efficient.

Mentor and guide SRE engineers, fostering a culture of ownership, innovation, and continuous improvement.

Implement AIOps frameworks to improve operational tasks and enhance system self-healing capabilities.

Participate in and optimise the on-call rotation, striving to minimise human intervention through automation.

Drive capacity planning, disaster recovery, and business continuity initiatives.

Support onboarding, documentation, and knowledge sharing for platform services and operational best practices.

How You Will Succeed

Demonstrate technical leadership and strategic thinking in SRE practices.

Proactively identify and resolve reliability risks and bottlenecks.

Foster strong cross-functional relationships with engineering, product, and operations teams.

Lead by example in incident management, troubleshooting, and performance optimisation.

Promote a culture of blameless postmortems and continuous learning.

Effectively communicate complex technical concepts to both technical and non-technical stakeholders.

What You Should Bring

Proven experience leading SRE or DevOps teams in a complex, cloud-native environment.

Deep expertise in at least one major cloud platform (AWS, Azure, or GCP).

Advanced knowledge of Linux/Unix systems, networking, and distributed systems.

Proficiency in programming/scripting (Python, Go, or similar).

Hands-on experience with containers and orchestration (Docker, Kubernetes at scale).

Strong background in CI/CD pipelines and Infrastructure as Code (Terraform, Ansible, Helm, etc.).

Expertise with observability platforms (Prometheus, Grafana, ELK, Datadog, Splunk).

Experience with SRE practices (SLIs, SLOs, error budgets, blameless postmortems).

Excellent problem-solving, debugging, and performance optimisation skills.

Experience with security engineering, IAM, secrets management, and vulnerability scanning is a plus.

Exposure to cloud cost optimisation strategies is desirable.

Experience mentoring and developing engineers.

Basic Qualifications and Experience Requirement

Bachelor’s degree in Computer Science, Engineering, or related field.

8+ years of hands-on experience in SRE, DevOps, or related roles, with at least 2 years in a technical leadership capacity.

Demonstrated success in managing reliability for large-scale, distributed systems.

Relevant certifications (e.g., AWS Certified DevOps Engineer, CKA, etc.) are a plus.

Additional Skills/Preferences

Experience with AI/ML in operations (AIOps) for anomaly detection, predictive scaling, or automated incident triage.

Contribution to open-source projects or thought leadership in SRE/DevOps communities.

Knowledge of Agile principles and frameworks (e.g., Scrum, SAFe), including related tools (such as Jira).

Excellent analytical, problem-solving, and investigative skills.

Strong communication and collaboration skills.

Additional Information

Availability to work flexible work hours is/may be required. This team will support continuous operations across two shifts and therefore, this role will require non-standard work hours, and some work on weekends and holidays. Appropriate adjustments in benefits will be provided for employees working non-standard hours where applicable

Lilly is dedicated to helping individuals with disabilities to actively engage in the workforce, ensuring equal opportunities when vying for positions. If you require accommodation to submit a resume for a position at Lilly, please complete the accommodation request form (https://careers.lilly.com/us/en/workplace-accommodation) for further assistance. Please note this is for individuals to request an accommodation as part of the application process and any other correspondence will not receive a response.

Lilly does not discriminate on the basis of age, race, color, religion, gender, sexual orientation, gender identity, gender expression, national origin, protected veteran status, disability or any other legally protected status.

#WeAreLilly

Top Skills

Ansible

AWS

Azure

Datadog

Docker

Elk

GCP

Grafana

Helm

Kubernetes

Linux

Prometheus

Python

Splunk

Terraform

Unix

Similar Jobs

Zeta

Site Reliability Engineer

22 Days Ago

In-Office

Hyderabad, Telangana, IND

Senior level

Cloud • Fintech • Financial Services

The Principal Site Reliability Engineer focuses on system reliability, automation, incident response, capacity planning, performance optimization, and continuous improvement in cloud environments.

Top Skills: AnsibleAWSAzureBashChefDockerElk StackGitGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPuppetPythonShellTerraform

Wells Fargo

Senior Product Manager

8 Hours Ago

Hybrid

Hyderabad, Telangana, IND

Senior level

Fintech • Financial Services

Lead product management initiatives, conduct market analysis, manage product development, ensure compliance, and mentor junior PMs. Focus on customer needs and strategic goals.

Top Skills: AgileAIAPIsCloud PlatformsData AnalyticsMachine Learning

Wells Fargo

Lead Trade Services Processor

8 Hours Ago

Hybrid

Hyderabad, Telangana, IND

Senior level

Fintech • Financial Services

The Lead Trade Services Processor ensures customer satisfaction by managing Trade Service products, guiding team members, and making recommendations for improvements. The role includes interacting with clients, training staff, and adhering to regulatory policies.

What you need to know about the Hyderabad Tech Scene

Because of its proximity to leading research institutions and a government committed to the city's growth, Hyderabad's tech scene is booming. With plans to establish India's first "AI city," the city is on track to become one of the world's most anticipated tech hubs, with companies like TransUnion, Schrödinger and Freshworks, among others, already calling the city home.