In this article ⏷

When a Load Fails at 2 A.M., Does It Restart From Zero?

BimlCatalog Resumes A Failed Batch From The Failure Point, Not Zero.

A load fails at 2 a.m. The on-call engineer is asleep. The question that decides how the next hour goes is not "what broke." It is "what happens when this runs again."

‍

If the answer is "it restarts from zero," then a fifty-table batch that died on table forty-eight re-copies forty-seven tables that already landed cleanly. The compute bill doubles. The maintenance window blows out. And the engineer who finally wakes up has to reason about whether the partial run left anything half-written.

‍

The better answer is "it resumes from the failure point." The forty-seven completed tables are skipped. The copy work that already succeeded is not repeated. The batch picks up where it stopped. But "resume from the failure point" is not a property you get for free. Something has to remember, per task, what already finished, what failed, and what kind of restart each one needs. That memory is a state machine. The interesting question for anyone evaluating a warehouse automation tool is: where does that state machine live, and do you own it?

‍

In BimlFlex, it lives in the BimlCatalog: the operational database you deploy from a DACPAC and run on your own SQL Server or PostgreSQL instance. No vendor scheduler sits on the path. The restart logic is in a table you can SELECT from and a stored procedure you can read. This post is a teardown of that schema, presented as what it is: an artifact you own.

‍

We have written before about the BimlCatalog as the place you achieve lineage visibility without manual mapping, and about how to embed runtime checks and observability as code. Those cover what the catalog records and how you watch it. This goes one layer deeper, into the part that does not just log a failure but decides how the next run recovers from it.

‍

The execution table is a tree, not a list

‍

Open bfx.Execution in your BimlCatalog and it is not a flat log of runs. It is a hierarchy. Every row carries the identifiers that place it in a tree:

‍

CREATE TABLE [bfx].[Execution](

[ExecutionID] BIGINT IDENTITY (1, 1) NOT NULL,

[ParentExecutionID] BIGINT NULL,

[OriginalExecutionID] BIGINT NULL,

[ServerExecutionID] BIGINT NULL,

[ExternalExecutionID] NVARCHAR(100) NULL,

...

[ExecutionStatus] CHAR(1) NULL,

[NextLoadStatus] CHAR(1) NULL,

...

);

‍

A batch is a parent. Each task inside it is a child whose ParentExecutionID points back at the batch. That single relationship is what lets recovery operate at the right granularity. The batch did not fail as one opaque unit. Forty-seven of its children succeeded and one failed, and the schema knows which is which, because each child is its own row with its own status.

‍

OriginalExecutionID is the column that makes resumption coherent across attempts. When a batch reruns, the new execution can trace back to the original run it is recovering, so "this is attempt two of the load that started at 2 a.m." is a fact the data carries, not a guess you reconstruct from timestamps.

‍

Two columns carry the actual state. ExecutionStatus is where a task is now. NextLoadStatus is the instruction for what the next run should do with it. The recovery behavior is entirely a function of how those two get set when something fails.

‍

NextLoadStatus is the restart instruction

‍

NextLoadStatus is a single character, and the small set of values it takes is the whole vocabulary of recovery. The product documents them in the orchestration guide, so this is not an internal secret. It is the contract:

‍

R: Retry: Full retry of the previous failed execution

‍

D: Databricks Restart: Restart only the compute step; reuse the data the copy step already landed

‍

C: Canceled / skip-completed: This task already finished under the failing batch, so skip it next time

‍

P: Pending: Awaiting the next scheduled execution

‍

Read that table as a recovery policy rather than a list of codes. When a batch fails, you do not want one uniform response across every task in it. You want the tasks that already succeeded marked so they are skipped. You want the task that actually failed marked for the right kind of restart. The difference between R and D is the difference between redoing all of a task's work and redoing only the part that broke.

‍

What happens the instant a batch fails

‍

The decisions get written when the failure is logged. The error-logging procedure in your BimlCatalog walks the batch's children and stamps each one. The logic is plain enough to follow as ordinary SQL behavior.

‍

First, every child that already finished successfully under this batch is marked to be skipped on the next run:

‍

UPDATE [bfx].[Execution]

SET [NextLoadStatus] = 'C' -- skip: already completed under this batch

WHERE [ParentExecutionID] = @ExecutionID

AND [ExecutionStatus] = 'S';

‍

That single statement is what makes "resume" mean something. The forty-seven tables that landed are not eligible to be re-copied. They carry the C instruction. The next run reads it and moves past them.

‍

Then the task that was mid-flight when the failure hit gets a restart mode chosen for it. This is where the copy-versus-compute distinction earns its place. If the data copy had already completed and only the downstream compute step failed, there is no reason to re-copy. The procedure takes a flag for exactly that case and chooses the restart accordingly:

‍

SET [ExecutionStatus] = 'F', -- failed

[NextLoadStatus] = CASE WHEN @SkipCopyOnRestart = 1

THEN 'D' -- compute-only restart, reuse landed data

ELSE 'R' -- full rerun

END

‍

D is the economical path. The copy succeeded, so on the next run the pipeline skips the copy activities and hands the original audit identifier of the already-landed data to the compute step. The expensive ingestion is not repeated; only the part that actually failed runs again. R is the safe full rerun for cases where the copy itself is suspect.

‍

This is also why the public Databricks orchestration guide describes pushdown restart the way it does: when the copy activities completed and the Databricks job failed, the run is set to D, and the next run reuses the already-landed data instead of re-extracting it. No manual configuration. The state machine made the call and wrote it to a row you can read.

‍

Why "you own it" is the load-bearing word

‍

Plenty of tools restart failed loads. The distinction worth pressing is not that BimlFlex has restart logic. It is where that logic lives and who controls it.

‍

Some automation platforms put restart, scheduling, and run history behind a bundled scheduler or a hosted control plane. The recovery semantics are real, but they are the vendor's. You run inside their orchestrator, you read run state through their interface, and if you ever stop paying, the part that decides how a failed batch resumes leaves with the subscription. The audit trail and the restart behavior were never yours to query directly.

‍

BimlFlex puts the state machine in schema you deploy and own. bfx.Execution is a table in your database. NextLoadStatus is a column you can index, query, and report on with the same SQL you use for everything else. You can answer "which tasks are pending a compute-only restart right now" with a SELECT, not a support ticket. The recovery policy is not narrated to you through a dashboard you rent. It is data you hold. This is the same principle that runs through everything BimlFlex generates: own the code you generate, and own the operational state that runs it.

‍

That ownership is what turns restart-and-recovery from a feature you trust into a system you can inspect. When the load fails at 2 a.m., the next run does not start from zero, and you do not have to take anyone's word for why. The answer is in a table you can read.

‍

If you want to see how this ties into broader pipeline economics, we covered the cost side in building cost-aware orchestration into your pipelines, and the change-data-capture restart patterns in automating CDC restarts on Azure Data Factory.

‍

A failed load is not a question of whether it recovers. It is a question of whether the thing deciding how it recovers belongs to you. With BimlCatalog, it does.