Future of CI Pipeline Processing
| Status | Authors | Coach | DRIs | Owning Stage | Created |
|---|---|---|---|---|---|
| proposed |
furkanayhan
|
ayufan
|
jreporter
cheryl.li
|
devops verify | 2023-05-15 |
Summary
GitLab CI is one of the oldest and most complex features in GitLab. Over the years its YAML syntax has considerably grown in size and complexity. In order to keep the syntax highly stable over the years, we have primarily been making additive changes on top of the existing design and patterns. Our user base has grown exponentially over the past years. With that, the need to support their use cases and customization of the workflows.
While delivering huge value over the years, the various additive changes to the syntax have also caused some surprising behaviors in the pipeline processing logic. Some keywords accumulated a number of responsibilities, and some ambiguous overlaps were discovered among keywords and subtle differences in behavior were introduced over time. The current implementation and YAML syntax also make it challenging to implement new features.
In this design document, we will outline a streamlined approach to improve pipeline behavior predictability and reduce the configuration burden on users, ultimately strengthening GitLab’s product competitiveness.
Goals
Business Goals
- Enhance Product Competitiveness: By reducing configuration complexity and improving pipeline predictability, GitLab will offer a more intuitive and robust CI/CD experience. This positions GitLab as the preferred choice for both new and existing users, helping to attract and retain customers, including those with highly complex workflows.
- Increase Development Efficiency: Clarifying keyword responsibilities and simplifying the pipeline model reduces code complexity, which improves maintainability and decreases the time and resources needed for future enhancements. The development teams will have greater agility to implement new features and address issues quickly.
Product Goals
- Provide a clear, consistent pipeline configuration model that reduces ambiguity and allows users to more accurately control pipeline behavior.
- Create a cohesive, predictable model for DAG and STAGE configurations, enabling users to seamlessly integrate both without risk of unexpected behavior.
- Simplify GitLab CI’s codebase to make future improvements more manageable and reduce the maintenance burden on GitLab’s engineering team.
- Facilitate a migration path for existing customers without introducing any breaking changes.
Problem Statement
- Ambiguity and Overlapping Keyword Roles: Some keywords, like
whenandallow_failure, have multiple roles that overlap, leading to unpredictable behavior. Users find it difficult to anticipate outcomes, especially in complex pipelines. This ambiguity increases support cases and frustrates users, who may seek alternative solutions. - Inconsistent Pipeline Models: The STAGE and DAG models do not always behave consistently, making it challenging for users to configure pipelines that use both models without unintended side effects. This inconsistency adds a learning curve and reduces GitLab’s appeal for complex pipeline needs.
Non-Goals
We will not discuss how to avoid breaking changes for now.
Motivation
The list of problems is the main motivation for this design document. Most of these problems have been discussed before in the “Restructure CI job when keyword” epic.
Problem 1: The responsibility of the when keyword
Right now, the when keyword has many responsibilities;
on_success(default): Run the job only when no jobs in earlier stages fail or haveallow_failure: true.on_failure: Run the job only when at least one job in an earlier stage fails. A job in an earlier stage withallow_failure: trueis always considered successful.never: Don’t run the job regardless of the status of jobs in earlier stages. Can only be used in arulessection orworkflow: rules.always: Run the job regardless of the status of jobs in earlier stages. Can also be used inworkflow:rules.manual: Run the job only when triggered manually.delayed: Delay the execution of a job for a specified duration.
It answers three questions;
- What’s required to run? =>
on_success,on_failure,always - How to run? =>
manual,delayed - Add to the pipeline? =>
never
As a result, for example; we cannot create a manual job with when: on_failure.
This can be useful when persona wants to create a job that is only available on failure, but needs to be manually played.
For example; publishing failures to dedicated page or dedicated external service.
Problem 2: Abuse of the allow_failure keyword
We control the blocker behavior of a manual job by the allow_failure keyword.
Actually, it has other responsibilities; “determine whether a pipeline should continue running when a job fails”.
Currently, a manual job;
- is not a blocker when it has
allow_failure: true(by default) - a blocker when it has
allow_failure: false.
As a result, for example; we cannot create a manual job that is allow_failure: false and not a blocker.
job1:
stage: test
when: manual
allow_failure: true # default
script: exit 0
job2:
stage: deploy
script: exit 0

Currently;
job1is skipped.job2runs becausejob1is ignored since it hasallow_failure: true.- When we run/play
job1;- if it fails, it’s marked as “success with warning”.
allow_failure with rules
allow_failure becomes more confusing when using rules.
From docs:
The default behavior of
allow_failurechanges to true withwhen: manual. However, if you usewhen: manualwithrules,allow_failuredefaults tofalse.
From docs:
The default value for
allow_failureis:
truefor manual jobs.falsefor jobs that usewhen: manualinsiderules.falsein all other cases.
For example;
job1:
stage: build
script: ls
when: manual
next_job1:
stage: test
script: exit 0
job2:
stage: test
script: ls
rules:
- if: $ALWAYS_TRUE != "asdsad"
when: manual
next_job2:
stage: deploy
script: exit 0

job1 and job2 behave differently;
job1is not a blocker because it hasallow_failure: trueby default.job2is a blockerrules: when: manualdoes not returnallow_failure: trueby default.
Problem 3: Different behaviors in DAG/needs
The main behavioral difference between DAG and STAGE is about the “skipped” and “ignored” states.
Background information:
- skipped:
- When a job is
when: on_successand its previous status is failed, it’s skipped. - When a job is
when: on_failureand its previous status is not “failed”, it’s skipped.
- When a job is
- ignored:
- When a job is
when: manualwithallow_failure: true, it’s ignored.
- When a job is
Problem:
The skipped and ignored states are considered successful in the STAGE processing but not in the DAG processing.
Problem 3.1. Handling of ignored status with manual jobs
Example 1:
build:
stage: build
script: exit 0
when: manual
allow_failure: true # by default
test:
stage: test
script: exit 0
needs: [build]

buildis ignored (skipped) because it’swhen: manualwithallow_failure: true.testis skipped because “ignored” is not a successful state in the DAG processing.
Example 2:
build:
stage: build
script: exit 0
when: manual
allow_failure: true # by default
test:
stage: test
script: exit 0

buildis ignored (skipped) because it’swhen: manualwithallow_failure: true.test2runs and succeeds.
Problem 3.2. Handling of skipped status with when: on_failure
Example 1:
build_job:
stage: build
script: exit 1
test_job:
stage: test
script: exit 0
rollback_job:
stage: deploy
needs: [build_job, test_job]
script: exit 0
when: on_failure

build_jobruns and fails.test_jobis skipped.- Even though
rollback_jobiswhen: on_failureand there is a failed job, it is skipped because theneedslist has a “skipped” job.
Example 2:
build_job:
stage: build
script: exit 1
test_job:
stage: test
script: exit 0
rollback_job:
stage: deploy
script: exit 0
when: on_failure

build_jobruns and fails.test_jobis skipped.rollback_jobruns because there is a failed job before.
Problem 4: The skipped and ignored states
Let’s assume that we solved the problem 3 and the “skipped” and “ignored” states are not different in DAG and STAGE. How should they behave in general? Are they successful or not? Should “skipped” and “ignored” be different? Let’s examine some examples;
Example 4.1. The ignored status with manual jobs
build:
stage: build
script: exit 0
when: manual
allow_failure: true # by default
test:
stage: test
script: exit 0

buildis in the “manual” state but considered as “skipped” (ignored) for the pipeline processing.testruns because “skipped” is a successful state.
Alternatively;
build1:
stage: build
script: exit 0
when: manual
allow_failure: true # by default
build2:
stage: build
script: exit 0
test:
stage: test
script: exit 0

build1is in the “manual” state but considered as “skipped” (ignored) for the pipeline processing.build2runs and succeeds.testruns because “success” + “skipped” is a successful state.
Example 4.2. The skipped status with when: on_failure
build:
stage: build
script: exit 0
when: on_failure
test:
stage: test
script: exit 0

buildis skipped because it’swhen: on_failureand its previous status is not “failed”.testruns because “skipped” is a successful state.
Alternatively;
build1:
stage: build
script: exit 0
when: on_failure
build2:
stage: build
script: exit 0
test:
stage: test
script: exit 0

build1is skipped because it’swhen: on_failureand its previous status is not “failed”.build2runs and succeeds.testruns because “success” + “skipped” is a successful state.
Problem 5: The dependencies keyword
The dependencies keyword is used to define a list of jobs to fetch
artifacts from. It is a shared responsibility with the needs keyword.
Moreover, they can be used together in the same job. We may not need to discuss all possible scenarios but this example
is enough to show the confusion;
test2:
script: exit 0
dependencies: [test1]
needs:
- job: test1
artifacts: false
Information 1: Canceled jobs
Are a canceled job and a failed job the same? They have many differences so we could easily say “no”. However, they have one similarity; they can be “allowed to fail”.
Let’s define their differences first;
- A canceled job;
- It is not a finished job.
- Canceled is a user requested interruption of the job. The intent is to abort the job or stop pipeline processing as soon as possible.
- We don’t know the result, there is no artifacts, etc.
- Its eventual state is “canceled” so no job can run after it.
- There is no
when: on_canceled. - Even
when: alwaysis not run.
- There is no
- A failed job;
- It is a machine response of the CI system to executing the job content. It indicates that execution failed for some reason.
- It is equal answer of the system to success. The fact that something is failed is relative, and might be desired outcome of CI execution, like in when executing tests that some are failing.
- We know the result and there can be artifacts.
- Its eventual state is “failed” so subsequent jobs can run depending on their
whenvalues.when: on_failureandwhen: alwaysare run.
The one similarity is; they can be “allowed to fail”.
build:
stage: build
script:
- sleep 10
- exit 1
allow_failure: true
test:
stage: test
script: exit 0
when: on_success # default
If build runs and gets canceled, then test runs.

If build runs and gets failed, then test runs.

An idea on using canceled instead of failed for some cases
There is another aspect. We often drop jobs with a failure_reason before they get executed,
for example when the namespace ran out of compute minutes or when limits are exceeded.
Dropping jobs in the failed state has been handy because we could communicate to the user the failure_reason
for better feedback. When canceling jobs for various reasons we don’t have a way to indicate that.
We cancel jobs because the user ran out of Compute Credits while the pipeline was running,
or because the pipeline is auto-canceled by another pipeline or other reasons.
If we had a stop_reason instead of failure_reason we could use that for both cancelled and failed jobs
and we could also use the canceled status more appropriately.
Information 2: Empty state
We recently updated the documentation of
the when keyword for clarification;
on_success: Run the job only when no jobs in earlier stages fail or haveallow_failure: true.on_failure: Run the job only when at least one job in an earlier stage fails.
For example;
test1:
when: on_success
script: exit 0
# needs: [] would lead to the same result
test2:
when: on_failure
script: exit 0
# needs: [] would lead to the same result

test1runs because there is no job failed in the previous stages.test2does not run because there is no job failed in the previous stages.
The on_success means that “nothing failed”, it does not mean that everything succeeded.
The same goes to on_failure, it does not mean that everything failed, but does mean that “something failed”.
This semantic goes by a expectation that your pipeline succeeds, and this is happy path.
Not that your pipeline fails, because then it requires user intervention to fix it.
Technical expectations
All proposals or future decisions must follow these goals;
- The
allow_failurekeyword must only responsible for marking failed jobs as “success with warning”.- Why: It should not have another responsibility, such as determining a manual job is a blocker or not.
- How: Another keyword will be introduced to control the blocker behavior of a manual job.
- With
allow_failure, canceled jobs must not be marked as “success with warning”.- Why: “canceled” is a different state than “failed”.
- How: Canceled with
allow_failure: truejobs will not be marked as “success with warning”.
- The
whenkeyword must only answer the question “What’s required to run?”. And it must be the only source of truth for deciding if a job should run or not. - The
whenkeyword must not control if a job is added to the pipeline or not.- Why: It is not its responsibility.
- How: Another keyword will be introduced to control if a job is added to the pipeline or not.
- The “skipped” and “ignored” states must be reconsidered.
- TODO: We need to discuss this more.
- A new keyword structure must be introduced to specify if a job is an “automatic”, “manual”, or “delayed” job.
- Why: It is not the responsibility of the
whenkeyword. - How: A new keyword will be introduced to control the behavior of a job.
- Why: It is not the responsibility of the
- The
needskeyword must only control the order of the jobs. It must not be used to control the behavior of the jobs or to decide if a job should run or not. The DAG and STAGE behaviors must be the same.- Why: It leads to different behaviors and confuses users.
- How: The
needskeyword will only define previous jobs, like stage does.
- The
needsanddependencieskeywords must not be used together in the same job.- Why: It is confusing.
- How: The
needsanddependencieskeywords will be mutually exclusive.
Proposal
N/A
Design and implementation details
This will be determined after the proposal is approved. Breaking changes, implementation details, and migration paths will be discussed in this phase.
Feedback
Please share your feedback at the feedback issue.
eef3c341)
