IT Site Reliability Engineer
At GitLab, the IT Infrastructure team is responsible for Site Reliability Engineering for our tech stack applications and cloud infrastructure that supports corporate initiatives across many of our departments. In addition to traditional AWS and GCP administration, we also provide escalation engineering support for departments that manage their respective SaaS tech stack applications (vendor hosted). Another of our functions is to provide DevOps Engineering for several internally built applications that power our business operations and automation.
The IT team collaborates closely with the Engineering Infrastructure Reliability team that is responsible for our GitLab.com SaaS platform (our product infrastructure). The IT, Engineering, and Infrastructure Security teams collaborate to architect, implement, and manage our AWS and GCP infrastructure policies and collectively manage all related services.
- Lead the handling of ticket queue (GitLab issues) for AWS and GCP corporate infrastructure requests from team members. This ranges from simple IAM and DNS requests to designing and deploying new scalable application infrastructure.
- Design, build and maintain core infrastructure that enables GitLab can scale to support 2,000+ team members and the applications and services that they use day-to-day.
- Implement and maintain system logging and monitoring to alert on problems and prevent outages, and get ahead of customer needs.
- Maintain the corporate AWS and GCP infrastructure utilizing Ansible, Terraform, GitLab CI/CD, and Kubernetes
- Gather and analyze operating system and application metrics to assist in performance tuning and fault finding
- Create sustainable systems and services through patching, automation, and upgrades
- Document every action so your findings turn into repeatable actions and then into automation.
- Provide mentorship to IT System Administrators and IT Analysts who have an interest in infrastructure and IaC.
- Collaborate with other teams to improve services and help with system design, platform management, and capacity planning
The IT Site Reliability Engineer is a grade 6.
- 5+ years of experience in IT in a high growth Software as a service (SaaS) environment
- Knowledge of configuration management tools like Ansible, Chef, or Terraform
- Hands-on experience working in GCP and AWS environments
- Experience working with CI/CD tools and Git
- Ability to use GitLab
The IT Site Reliability Engineers share the same responsibilities outlined above.
- AWS and GCP - At least 2 years managing applications in AWS and/or GCP. An AWS and/or GCP professional certification is nice to have, however practical experience is more important in conjunction with Terraform experience for deploying applications and services using infrastructure-as-code with security best practices.
- Security - Strong understanding of security best practices, network design, and how AWS/GCP roles should be used for IAM/RBAC least privilege.
- Infrastructure-as-Code - Configuration management experience with Terraform and/or Ansible to effectively manage our infrastructure. Previous experience with AWS CloudFormation, Chef, Pulumi, Puppet, etc. is acceptable, however strong Terraform experience is a requirement.
- Kubernetes - Experience with managing Kubernetes clusters and using kubectl, k9s, etc for managing helm chart deployments, ingress services, and troubleshooting pods. Previous experience with Docker and related technologies is acceptable since container concepts are transferable.
- Operating Systems - Experience with managing Alpine, Debian, or Ubuntu Linux systems. We do not use Windows at GitLab. Many services are deployed in containers.
- Cloud Services - Manage, configure and troubleshoot Linux operating system issues (Linux), storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability PostgreSQL and Redis clusters
- Monitoring and instrumentation - Implement metrics in Prometheus, Grafana, Elastic, log management and related systems, and Slack/PagerDuty/Sentry integrations
- Engineering practices - High availability, data security, reliability and scalability, as well as disaster recovery
The IT Site Reliability Engineer is a grade 7.
The Senior IT Site Reliability Engineer has all the same responsibilities as the ones outlined above plus the following:
- 7+ years of experience in IT in a high growth SaaS environment
- Advanced knowledge of identity and access management
- Advanced knowledge in one of the following scripting languages - Python or Ruby
- Advanced knowledge of container and microservice technologies
The Senior IT Site Reliability Engineer has all the same responsibilities as the intermediate position plus the following:
- AWS and GCP - At least 5 years managing applications in AWS and/or GCP. An AWS and/or GCP professional certification is nice to have, however practical experience is more important in conjunction with Terraform experience for deploying applications and services using infrastructure-as-code with security best practices.
- Security - The current infrastructure and DevOps landscape requires a strong security background to design hardened environments using a variety of cloud services beyond the traditional firewall rules of VPCs. It is helpful to have a working knowledge of how different security vendor point solutions can be used to create a robust architecture.
- Software Languages and Frameworks (beyond simple scripts) - We work in a variety of languages including: PHP (Laravel), Ruby on Rails, GoLang, Python and Shell.
- CI/CD - Experience with Terraform and GitLab CI/CD for automated build, test and deployments. Previous experience with CI/CD platforms, GitHub Actions, Jenkins, etc is acceptable, however
- Build or implement open source automation and systems to manage AWS and GCP infrastructure and business applications and related services.
- Systems architecture design - In a DevOps ecosystem, your systems thinking will allow you to see automation efficiencies in areas outside of infrastructure. At GitLab, everyone can contribute and the IT Operations team welcomes automation and efficiency contributions from all roles.
- Mean Time between Failures (MTBF)
- Mean Time to Repair (MTTR)
- Number of days since last environment audit
- Cycle Time for IT Support Issue Resolution
The next step in the IT Site Reliability Engineer job family is to move to the IT Manager job family.
Candidates for this position can expect the hiring process to follow the order below. Please keep in mind that candidates can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find their job title on our team page.
- Qualified candidates will be invited to schedule a 30 minute screening call with one of our Global Recruiters
- Candidates will be invited to complete a ’take home assessment’. This is to be completed in your own time and returned within 3-5 working days
- Next, candidates will be invited to schedule an interview with the Hiring Manager
- Candidates will then be invited to schedule a Team interview with two members of the IT Systems Engineering team in a panel interview
- Candidates will also be invited to schedule a Technical interview with two other team members
- Finally, candidates will interview with our Director of IT Operations
Additional details about our process can be found on our hiring page.
GitLab Inc. is a company based on the GitLab open-source project. GitLab is a community project to which over 2,200 people worldwide have contributed. We are an active participant in this community, trying to serve its needs and lead by example. We have one vision: everyone can contribute to all digital content, and our mission is to change all creative work from read-only to read-write so that everyone can contribute.
We value results, transparency, sharing, freedom, efficiency, self-learning, frugality, collaboration, directness, kindness, diversity, inclusion and belonging, boring solutions, and quirkiness. If these values match your personality, work ethic, and personal goals, we encourage you to visit our primer to learn more. Open source is our culture, our way of life, our story, and what makes us truly unique.
Top 10 Reasons to Work for GitLab:
- Mission: Everyone can contribute
- Results: Fast growth, ambitious vision
- Flexible Work Hours: Plan your day so you are there for other people & have time for personal interests
- Transparency: Over 2,000 webpages in GitLab handbook, GitLab Unfiltered YouTube channel
- Iteration: Empower people to be effective & have an impact, Merge Request rate, We dogfood our own product, Directly responsible individuals
- Diversity, Inclusion & Belonging: A focus on gender parity, Team Member Resource Groups, other initiatives
- Collaboration: Kindness, saying thanks, intentionally organize informal communication, no ego
- Total Rewards: Competitive market rates for compensation, Equity compensation, global benefits (inclusive of office equipment)
- Work/Life Harmony: Flexible workday, Family and Friends days
- Remote Done Right: One of the world's largest all-remote companies, prolific inventor of remote best practices
See our culture page for more!
Work remotely from anywhere in the world. Curious to see what that looks like? Check out our remote manifesto and guides.