SRE Infrastructure Lead (Terraform)-Hyderabad Onsite
We are seeking an SRE Infrastructure Lead (Terraform) to join our team.
You will lead and work with the developers and teammates to ensure the technical goals of the project are met on behalf of the client. In this role, you will be responsible for managing your team, functional requirements, execution, and performance for a core set of features and/or components.
Roles and Responsibilities:
● Work with service teams and define the runtime architecture of the applications.
● Define architecture and design for the resources needed to run the services. e.g. containers, ec2s, lambdas, virtual machines, load balancers, cdns, to name a few.
● Architect and design infrastructure to dynamically scale up and down, horizontal and vertical to support the load on the application.
● Come up with recommendations on the infrastructure needed to run services based on the business and technical needs. Performance of the application - concurrency, latency, throughput and cost of running in inhouse and in cloud.
● Should be able to provide thought leadership at various levels ranging from engineers to the CxOs in the organization.
● Own the Infrastructure, APM and work with DevOps teams to Build, Release, Monitor and run the services to improve service reliability.
● Will be required to code and provide solutions.
● Define and accelerate the implementation of support processes, tools and best practices
● Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure
● Performance and maturity baselining of DevOps process, tools maturity & coverage, metrics, technology and engineering practices
● Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Mgmt) and streamline – automate release management. Build dashboards to provide visibility into performance of the applications.
● Mentor and coach other SREs in the organization
● Provide written and verbal updates to executives and the stakeholders of the application in the organization.
● Understand the current process, system setup and propose the improvements needed in the processes, and technology so that the application exceeds the desired Service Level Objective.
● Strong believer of automation to bring in sustained continuous improvement by automating Toil, Runbooks, improving ability of the applications to auto heal leading to improved reliability
Must Have Skills:
● Should have more than 10+ years of overall experience in system administration.
● Strong hands-on coding experience, at least 3+ years of experience in one or more of the following: Terraform/Cloudformation, Python, Bash, Perl,etc.
● Should have 5+ years of experience with AWS.
● Strong hands-on experience with configuration management tools such as Chef, Puppet, Salt, Ansible, API GW) with at least 3+ years of experience.
● Strong knowledge of operating systems, particularly Linux-based OS. 5+ years of experience.
● Should have 3+ years of experience with container technologies such as Docker and container management systems such as Kubernetes
● Very good understanding of computing, network, and storage layers in computing.
● Strong understanding of Observability (monitoring, logging, tracing, metrics), Chaos engineering concepts.
● Proficiency in using Infrastructure Tool (IAC) in Terraform
● Strong experience with version control and workflow tooling such as GIT (GitLab, GitHub)
● Strong experience implementing CI/CD and understand DevOps methodology (design, deploy, optimize)
● Should have good understanding and experience with gitops.
● Experience designing and building highly available, resilient, large-scale, distributed systems that utilize load balancing, horizontal scalability and automated disaster recovery.
● Strong communication skills and ability to explain protocol and processes with team and management.
● Strong documentation skills
● Strong troubleshooting skills with the ability to spot issues before they become problems.
● Experience with project management and workflow tools such as Agile, Jira, Scrum/Kanban, etc. Nice to Have Skills:
● Certifications in Linux, Kubernetes, AWS are preferred.
● Experience with any of the programming languages such as go
● Contribution to open source community
● Managed large scale production environment with 100s of servers and 1000s of running containers
● Experience with caching solutions, database administration, immutable infrastructure. Qualification: Master’s or Bachelor’s degree in Computer Science Engineering, or a related technical degree with 10+ years of relevant experience.
Location: Hyderabad / Bangalore