Site Reliability Engineer (DC/NY/SF/Remote)

What you'll do:

Internal-facing

Implement monitoring, logging and alerting for legacy systems we've inherited from a past team

Provide guidance to teams as they prepare new systems for production launch

Manage a Root Cause Analysis process used by multiple scrum teams across dozens of systems

Help teams setup and run fire drill exercises on a quarterly cadence

Help define an approach for bringing Chaos Engineering to the agency

External-facing

Maintain a statuspage displaying service availability through a mix of automated monitoring and manual updates

Document SRE best practices for use by other application teams throughout the agency

Develop a training process to help application teams who are new to SRE practices and cloud infrastructure to build reliable, scalable, and secure applications

Consult with application teams on an as-needed basis on how to properly configure monitoring, logging and alerting

Consult with application teams on how to setup processes such as an incident response process, an RCA process.

What we're looking for:

At least 1 year of production on-call experience

Previous experience maintaining a medium or large scale production system, especially in regards to working with monitoring, logging, and alerting

Experience debugging issues across a complex system architecture

Ability to make changes to an existing codebase, such as installing a new monitoring agent(not required, but helpful)

Ability to perform light scripting, such as writing a simple bash/python/ruby/go script(not required, but helpful)

Experience configuring selenium scripts, newrelic synthetics, or other automated functional testing

Excellent written and verbal communication skills, technical and otherwise

Ability to communicate complex technical topics to a range of audiences, from highly technical to non-technical

Ability to develop clear, repeatable processes, and to produce documentation and runbooks that are accessible to a range of audiences

Experience with the following systems a plus: AWS, Azure, new relic, splunk, ELK, cloudwatch

Education requirements: Bachelor’s degree

Position

DevOps Engineer

Must have Skills

Shell Scripting
Beginner
Python
Beginner
Selenium
Beginner
AWS
Beginner

Active a month ago

Site Reliability Engineer (DC/NY/SF/Remote)

What you'll do:

Internal-facing

Implement monitoring, logging and alerting for legacy systems we've inherited from a past team

Provide guidance to teams as they prepare new systems for production launch

Manage a Root Cause Analysis process used by multiple scrum teams across dozens of systems

Help teams setup and run fire drill exercises on a quarterly cadence

Help define an approach for bringing Chaos Engineering to the agency

External-facing

Maintain a statuspage displaying service availability through a mix of automated monitoring and manual updates

Document SRE best practices for use by other application teams throughout the agency

Develop a training process to help application teams who are new to SRE practices and cloud infrastructure to build reliable, scalable, and secure applications

Consult with application teams on an as-needed basis on how to properly configure monitoring, logging and alerting

Consult with application teams on how to setup processes such as an incident response process, an RCA process.

What we're looking for:

At least 1 year of production on-call experience

Previous experience maintaining a medium or large scale production system, especially in regards to working with monitoring, logging, and alerting

Experience debugging issues across a complex system architecture

Ability to make changes to an existing codebase, such as installing a new monitoring agent(not required, but helpful)

Ability to perform light scripting, such as writing a simple bash/python/ruby/go script(not required, but helpful)

Experience configuring selenium scripts, newrelic synthetics, or other automated functional testing

Excellent written and verbal communication skills, technical and otherwise

Ability to communicate complex technical topics to a range of audiences, from highly technical to non-technical

Ability to develop clear, repeatable processes, and to produce documentation and runbooks that are accessible to a range of audiences

Experience with the following systems a plus: AWS, Azure, new relic, splunk, ELK, cloudwatch

Education requirements: Bachelor’s degree

Job Type

Client Payroll

Positions

DevOps Engineer

Must have Skills

Shell Scripting
Beginner
Python
Beginner
Selenium
Beginner
AWS
Beginner

Languages

english -Fluent

Up to 450 K/Year USD (Annual salary)

Longterm (Duration)

Fully Remote

We're offline

Leave a message