Nov 28, 2020 - Restoring individual postgres table

In a scenario where a production database backup becomes too large to be handled nimbly selective database restoration becomes quite useful for recovering pre-determined sets of data.

PostgreSQL allows restoration of individual tables from dump files which can be used, for instance, to query a particular table for retrieving data from a previous state in time, say, for investigating a bug, or recovering accidentally deleted data.

Generating the Backup File

Naturally the first step here is to generate your database backups. Ideally you should set up an automation that recurrently generates backups and stores them in a safe place.

The pg_dump utility is used for generating the entire PostgreSQL database backup:

pg_dump.exe --host localhost --port 5432 --username "postgres" --schema "public" --format custom --blobs --verbose --file "my_database.dump" "my_database"

In my case I have a recurring task scheduled to run daily (at dawn) for generating the backup file and uploading it to AWS S3 for storage.

Restoring a Single Table

Now whenever I need to restore a specific table for data recovery or bug investigating purposes here’s my go to recipe:

1) Download daily backup file of interest from S3

2) Create an empty local database my_database_restored

3) Create the table that needs to be restored my_table in the empty database

Some tips:

  • The table name, column names and column types should match those of the original database
  • It’s not necessary to recreate any of the table’s indexes, so use this to intelligently reduce restoration time and database disk usage 🤓

4) Finally run pg_restore to selectively import desired table’s data:

pg_restore.exe -U postgres --data-only -d "my_database_restored" -t "my_table" "my_database.dump"

That’s it, after the command completes all table data will be available in your single table database for querying.


Further Reference

Oct 11, 2020 - A review of consensus protocols

The consensus problem is a fundamental problem in multi-agent systems which requires a group of processes (or agents) to reliably and timely agree on a single data value. Although extensively discussed in the context of distributed computing it’s not exclusive to this field, also being present in our society in a variety of situations such as in democratic elections, the legislative process, jury trial proceedings, and so forth.

It’s solved through the employment of a consensus protocol governing how processes (agents) interact with one another. It may seem redundant but, to solve the consensus problem, first all processes agree to follow the same consensus protocol.

Some of these processes may fail or be unreliable in other ways (such as in a conflict of interest situation) so consensus protocols must be fault tolerant or resilient. The processes must somehow propose their candidate values, communicate with one another, and decide on a single consensus value.

In this post I review four major consensus protocols for solving the consensus problem based on my implementation of them, namely:

  1. Chandra–Toueg
  2. Ben-Or
  3. Basic Paxos
  4. Nakamoto Consensus

You can find supporting code for this analysis in this GitHub repository: ConsensusKit

Before getting started let’s recap the three key properties of consensus protocols and take a quick look at relevant terminology for this discussion.

Properties of a consensus protocol

Formally a consensus protocol must satisfy the following three properties:

Termination

  • Eventually, every correct process decides some value.

Integrity

  • If all the correct processes proposed the same value “v”, then any correct process must decide “v”.

Agreement

  • Every correct process must agree on the same value.

These requirements are rather straight forward. “Termination” certifies the protocol is resilient to halting failures. “Agreement” deters any two correct processes from deciding on different values which would break consensus. Lastly “Integrity”, which is in fact flexible and may vary depending on application requirements, assures the protocol behaves in an expected and unbiased way.

Terminology

These terms are used throughout this post as well within the provided support source code.

Protocol

The official set of rules (“algorithm”) governing the behavior of processes for solving the consensus problem.

Process

An individual agent belonging to a larger group of agents collectively interested in reaching consensus. Processes can be “correct”, meaning they strictly adhere to the protocol and are not subject to failures, or “faulty”, meaning we cannot rely on them following the protocol at all times.

Instance

A well-defined round of interactions among processes under a consensus protocol. Depending on the underlying protocol one or multiple intances are required for reaching consensus.

Proposer

A process role. Processes under this role are entitled for proposing values. Any value agreed upon by all correct processes must have originated in a proposer.

Decider

A process role. Processes under this role are entitled for deciding on a value. Any value agreed upon by all correct processes must have been voted by at least one decider.

Archiver

An application or a subsystem that is responsible for persisting decided values and implementing a protocol’s specific data constraints.

Failure detector

An application or a subsystem that is responsible for the detection of node failures or crashes.

Majority

A protocol’s specific threshold requiring at least half of available votes, more when faulty processes are taken into account.

Quorum

A protocol’s specific threshold which ensures that a minimum number of participants (or votes) is met for taking an action.

With these definitions in place let’s move on to the implementation and review of each consensus protocol.

Protocol structure

My main goal has been to implement these four protocols in code to be able to run and evaluate them in a controlled environment. I was also interested in trying to spot structural similarities among them that could be extracted into a common base structure, promoting code reuse.

The result of my first coding iteration is displayed in the simplified class diagram below:

Main Clasees

Three main classes are depicted in the diagram:

  • The Procotol class has a collection of processes that will interact among each other according to the protocol rules for reaching consensus. Each execution of the protocol, through the Execute method, creates a new Instance (not to be confused with the conventional meaning of “instance” in object oriented programming).
  • The Instance class represents one round of interactions among processes potentially resulting in a consensus, as indicated by the Consensus and Value properties. It’s responsible for managing the interaction between processes exercicing proposer and/or decider roles, providing methods for message delivery (Send) and retrieval (WaitMessage and WaitQuorum).
  • The Process class is responsible for implementing proposer and/or decider roles. Behavior is defined in the Bind method just before the instance starts since, as well see, some protocols require processes to play different roles according the the current instance.

The Process.Propose virtual method is called at the start of an instance execution for all proposer processes in that instance to kick off communication and has the following default implementation:


protected virtual void Propose(Instance r)
{
    var v = Proposer.GetProposal();
    if (Archiver.CanCommit(v))
        Broadcast(r, MessageType.Propose, v);
}

It first generates a proposal based on the underlying proposer logic (ex: random boolean proposer), then verifies against an archiver if the generated value passes custom application requirements and finally broadcasts it to all sibling processes.

Processes that bind to the Instance.WaitQuorum event will wait for the protocol specific amount of messages (for each message type) required for taking an action. The specific amounts are defined in each protocol implementation when deriving the Instance class.

All four consensus protocols were implemented using this base structure, as presented in the following sections. Rather than performing an extensive code analysis I will focus on presenting each theoretical algorithm and key points of my implementation.

1) Chandra–Toueg

The Chandra–Toueg consensus algorithm was first published by Tushar Deepak Chandra and Sam Toueg in 1996 and introduced the concept of failure detectors as means for solving the consensus problem. The algorithm assumes a number of faulty processes denoted by f that’s less than n/2 (i.e. less than simple majority).

In each instance one process acts as the decider (rotating coordinator) and all other processes act as proposers. The actions carried out in each instance are1:

  1. All processes send (r, preference, timestamp) to the coordinator.
  2. The coordinator waits to receive messages from at least half of the processes (including itself).
    1. It then chooses as its preference a value with the most recent timestamp among those sent.
  3. The coordinator sends (r, preference) to all processes.
  4. Each process waits (1) to receive (r, preference) from the coordinator, or (2) for its failure detector to identify the coordinator as crashed.
    1. In the first case, it sets its own preference to the coordinator’s preference and responds with ack(r).
    2. In the second case, it sends nack(r) to the coordinator.
  5. The coordinator waits to receive ack(r) or nack(r) from a majority of processes.
    1. If it receives ack(r) from a majority, it sends decide(preference) to all processes.
  6. Any process that receives decide(preference) for the first time relays decide(preference) to all processes, then decides preference and terminates.

Let’s take a look at the code. The following code snippet defines the proposer process behavior:


private void BindAsProposer(Instance r)
{
    WaitQuorum(r, MessageType.Select, msgs =>
    {
        var m = msgs.Single();
        SendTo(m.Source, r, MessageType.Ack, m.Value);
    });

    WaitQuorum(r, MessageType.Decide, msgs =>
    {
        var v = msgs.Single().Value;
        Terminate(r, v);
    });
}

On instance start proposers broadcast their values. Then they wait for a quorum to select a value from the pool of proposals. Since the protocol defines only a single decider per round this quorum is met when the decider process broadcasts his selection. Upon receiving this selection the proposer node will acknowledge it and wait for the decider’s decision, and then decide on the same value and terminate execution.

Now let’s see the code defining the complementary decider processes behavior:


private void BindAsCoordinator(Instance r)
{
    WaitQuorum(r, MessageType.Propose, msgs =>
    {
        var v = PickMostRecentValue(
            msgs.Where(m => Archiver.CanCommit(m.Value))
        );

        Broadcast(r, MessageType.Select, v);
    });

    WaitQuorum(r, MessageType.Ack, msgs =>
    {
        var v = PickMostRecentValue(msgs);
        Broadcast(r, MessageType.Decide, v);
        Terminate(r, v);
    });
}

It first waits for a quorum of proposals, i.e, for a simple majority of proposers to submit their values. Then the decider picks the most recent message value, possibly validating the value with the archiver, and broadcasts the selected value back to all processes.

After broadcasting the selected value the decider waits for a quorum of proposers to acknowledge it, in which case it decides on that value and broadcasts its decision to all processes before terminating.

Notice that this simplified implementation doesn’t take failures into account, even though the Chandra–Toueg protocol is fault-resilient.

2) Ben-Or

Ben-Or is a decentralized consensus protocol, i.e., it doesn’t assign the decider role to any specific process. Curiously the algorithm correctness proof was only provided in a paper 15 years after its original publication2.

Because it lacks a decider for solving tie-break situations, Ben-Or is only capable of handling binary value consensus (“true” or “false”), rather than arbitrary value consensus. Even then, it employs randomization for converging to consensus.

The algorithm works in phases within instances and is resilient up to f crashing processes among a total of n processes given that n > 2f. Here are the actions performed by every process in an instance:

  1. Assign the process initial preference value x (0 or 1)
  2. Assing the phase number k to zero
  3. Loop:
    1. Increment the current phase number k
    2. Send (R, x, k) to all processes
    3. Wait for messages of the form (R, k, *) from n − f processes (“∗” can be 0 or 1)
      • If received more than n/2 (R, k, v) with the same v Then
        • Send (P, k, v) to all processes
      • Else
        • Send (P, k, ?) to all processes
    4. Wait for messages of the form (P, k, ∗) from n − f processes (“∗” can be 0, 1 or ?)
      • If received at least f + 1 (P, k, v) with the same v ≠ ? Then
        • Decide(v) and terminate execution
      • Else if at least one (P, k, v) with v ≠ ? Then
        • Set x to v
      • Else
        • Set x to 0 or 1 randomly

It’s essentially a loop that ends when the process is able to decide on a value, otherwise it keeps going on, where each loop iteration is a new phase consisting of two asynchronous rounds.

In the algorithm, every message contains a tag (R or P), a phase number, and a value which is either 0 or 1 (for messages tagged P, it could also be “?”). Messages tagged “R” are called reports and those tagged with “P” are called proposals.

This protocol was the easiest to implement, but don’t fool yourself, in reality it’s behavior is not trivial to analyze, and that’s why it took so long for the correctness proof the come along. Below is my implementation of the Ben-Or process behavior:


public override void Bind(Instance r)
{
    WaitQuorum(r, MessageType.Propose, msgs =>
    {
        var x = PickMostFrequentValue(
            msgs.Where(m => Archiver.CanCommit(m.Value))
        );

        var v = x.Count > r.Proposers.Count / 2 ? x.Value : null;

        Broadcast(r, MessageType.Select, v);
    });

    WaitQuorum(r, MessageType.Select, msgs =>
    {
        var x = PickMostFrequentValue(msgs.Where(m => m.Value != null));

        if (x.Count >= f + 1)
        {
            Terminate(r, x.Value);
            Proposer.Reset();
        }
        else
        {
            if (x.Count > 0)
                Proposer.Set(x.Value);
            else
                Proposer.Reset();
        }
    });
}

I have replaced the “R” tag by the MessageType.Propose enum value and the “P” tag by the MessageType.Select for being more consistent with other protocol implementations within the repository. Aside from that it’s pretty much a copy of the algorithm described earlier, with the actual loop implemented in an event based manner, i.e., as a result of cyclic message passing between processes.

3) Basic Paxos

So I just learned that the Paxos family of protocols is as old as I am, being first submitted in 1989 (my year of birth) by Leslie Lamport. It’s name is derived after a fictional legislative consensus system used on the Paxos island in Greece, where Lamport wrote that the parliament had to function “even though legislators continually wandered in and out of the parliamentary Chamber”.

Paxos was by far the toughest one to reason about and implement in code. In it processes are classified as proposers, accepters, and learners (a single process may have all three roles). The idea is that a proposer attempts to ratify a proposed decision value (from an arbitrary input set) by collecting acceptances from a majority of the accepters, and this ratification is observed by the learners. Agreement is enforced by guaranteeing that only one proposal can get the votes of a majority of accepters, and validity follows from only allowing input values to be proposed3.

The basic Paxos algorithm is briefly presented below:

  1. The proposer sends a message prepare(n) to all accepters. (Sending to only a majority of the accepters is enough, assuming they will all respond.)
  2. Each accepter compares n to the highest-numbered proposal for which it has responded to a prepare message. If n is greater, it responds with ack(n, v, nv) where v is the highest-numbered proposal it has accepted and nv is the number of that proposal (or ⊥, 0 if there is no such proposal).
  3. The proposer waits (possibly forever) to receive ack from a majority of accepters. If any ack contained a value, it sets v to the most recent (in proposal number ordering) value that it received. It then sends accept(n, v) to all accepters (or just a majority).
  4. Upon receiving accept(n, v), an accepter accepts v unless it has already received prepare(n’) for some n’ > n. If a majority of acceptors accept the value of a given proposal, that value becomes the decision value of the protocol.

Note that acceptance is a purely local phenomenon, so I’ll be including an additional message passing step in my implementation intended to detect which if any proposals have been accepted by a majority of accepters before processes can reliably decide on a value.

Now let’s look at the code implementation starting with the the proposer role behavior:


protected override void Propose(Instance r)
{
    proposalNumber = minNumber + 1;
    Broadcast(r, MessageType.Propose, proposalNumber);
}

private void BindAsProposer(Instance r)
{
    WaitQuorum(r, MessageType.Ack, msgs =>
    {
        var v = PickHighestNumberedValue(msgs)?.Value ?? Proposer.GetProposal();

        if (Archiver.CanCommit(v))
        {
            var x = new NumberedValue(v, proposalNumber);
            Broadcast(r, MessageType.Select, x);
        }
    });

    WaitMessage(r, MessageType.Nack, msg =>
    {
        if (msg.Value != null)
        {
            var n = (long)msg.Value;
            if (n > minNumber)
            {
                minNumber = Math.Max(n, minNumber);
                if (RandomExtensions.Tryout(0.5))
                    Propose(r);
            }
        }
    });

    WaitQuorum(r, MessageType.Accept, msgs =>
    {
        var m = msgs.Select(m => m.Value).Distinct();

        if (m.Count() == 1)
        {
            var x = m.Single() as NumberedValue;
            Terminate(r, x);
            Broadcast(r, MessageType.Decide, x);
        }
    });
}

Obs: Usage of Math.Max in this snippet may seem redundant, but it was actually employed to solve a race condition without resorting to a mutex.

You can see that in this case the Process.Propose method was overridden to implement the Paxos protocol proposal numbering logic, and that I’ve used the MessageType.Propose enum member to represent the “prepare” message.

After broadcasting its proposal number the process will wait for either MessageType.Ack or MessageType.Nack responses. In case it receives a negative response this means its proposal number is outdated, so the process will update it and submit a new proposal (after a random interval added for debugging purposes). However, in case it receives enough positive responses to form a quorum the process will broadcast a MessageType.Select message representing the proposer’s accept vote (the algorithm’s third step) for the highest numbered value it’s seen so far, or it’s own preferred value if it hasn’t seen one yet.

Lastly, the proposer will be waiting for a quorum of consistent MessageType.Accept messages from accepters to learn which value has been accepted by a majority of them, in which case it decides on that value and terminates.

On the other side of the picture we have accepters, whose behavior code is presented below:


private void BindAsAccepter(Instance r)
{
    WaitMessage(r, MessageType.Propose, msg =>
    {
        var n = (long)msg.Value;

        if (n > minNumber)
        {
            minNumber = n;
            SendTo(msg.Source, r, MessageType.Ack, accepted);
        }
        else
        {
            SendTo(msg.Source, r, MessageType.Nack, minNumber);
        }
    });

    WaitMessage(r, MessageType.Select, msg =>
    {
        var x = msg.Value as NumberedValue;

        if (x.Number >= minNumber && Archiver.CanCommit(x.Value))
        {
            accepted = x;
            SendTo(msg.Source, r, MessageType.Accept, x);
        }
    });

    WaitMessage(r, MessageType.Decide, msg =>
    {
        var x = msg.Value as NumberedValue;
        Terminate(r, x);
    });
}

Upon receiving MessageType.Propose messages the accepter will “ack” as long as the proposal number is greater than the highest proposal number it’s seen so far (minNumber), updating that number, or “nack” considering it’s smaller than or equal.

Then it waits for MessageType.Select messages and accepts it as long as the proposal number is greater than or equal minNumber, in which case it updates its lastly accepted value and sends a MessageType.Accept message back to the proposer.

In closing, the accepter simply waits for a MessageType.Decide to decide on its value and terminate execution.

4) Nakamoto Consensus

In the Bitcoin white paper, Nakamoto proposed a very simple Byzantine fault tolerant consensus algorithm that is also known as Nakamoto consensus. It’s fundamentally a randomized consensus protocol that employs the concept of “Proof of Work” for defining which process is entitled to legitimately decide on a value.

This proof of work operation is asymmetric, hard to conclude but easy to verify. This is necessary to achieve a functional randomized behavior that is the basis of the protocol, otherwise it would be very difficult for processes to agree on the validity of proposed values.

Nakamoto consensus has its own terminology. Instead of “values” processes seek to reach consensus on data “blocks”. As the protocol execution advances the set of decided data blocks form a “chain” with each new block reinforcing the previous chain block. The act of delivering a proof of work on a block, in order for a process to propose that block to the network, is called “mining”.

The algorithm can be briefly described in two sentences as follows4:

  • At each round r, an honest process attempts to mine new blocks on top of the longest chain it observes by the end of round r − 1 (where ties can be broken arbitrarily). This is often referred to as the longest chain rule.
  • At each round r, an honest process confirms a block if the longest chain it adopts contains the block as well as at least k other blocks of larger heights. This is sometimes referred to as the k-deep confirmation rule.

The purpose of the k-deep confirmation rule is to reduce the chances of a process deciding on an uncertain/volatile data block to a negligible probability.

Even though the Nakamoto consensus protocol has a simple form, it requires a reasonable amount of code for implementing the block mining and block chains management logic. Here’s my implementation top level code:


protected override void Propose(Instance r)
{
    ThreadManager.Start(() =>
    {
        var b = MineBlock(r);
        if (b != null)
        {
            ProcessBlock(r, b);
            Broadcast(r, MessageType.Propose, b);
        }
    });
}

public override void Bind(Instance r)
{
    WaitMessage(r, MessageType.Propose, msg =>
    {
        var b = msg.Value as Block;
        ProcessBlock(r, b);
    });
}

private void ProcessBlock(Instance r, Block b)
{
    if (b.VerifyPoW())
    {
        var newChain = AddNewChain(b);
        if (IsKDeep(newChain) && !IsTerminated(r))
        {
            Terminate(r, GetBlockAt(newChain, k).Value);
            CommitChain(newChain);
        }
    }
}

Since mining a block is a time consuming operation it’s performed within a thread and, unlike other protocols described earlier, the proposal phase becomes asynchronous. Upon receiving a proposal the process will evaluate whether the proposed block is valid, i.e., if it’s possible to verify its proof of work. Then, the block is added to a compatible chain found in the set of unconfirmed chains. Finally, if the resulting unconfirmed chain segment becomes k-deep the process will decide on its oldest unconfirmed block.

Conclusion

Taking as base components and terminology described in the literature for coding these four major consensus protocols allowed for the abstraction of structural and message routing logic away from each specific protocol implementation in a consistent manner. The end result highlights each protocol logic and behavior making the code more readable and succinct.

Alternatives to improve the current achitecture and proceed with this study include i) refactoring the Instance class to extract event biding and message passing responsabilities into new components and ii) implement supporting components for evaluating protocol performance in the presence of process failures/crashes.

This exercise has been extremely helpful to me to better understand the consensus problem and have a glimpse into different approaches for solving it. I hope it may serve as a starting point for others interested in learning the basics of consensus protocols in a more practical way.


Notes

  • Revised on Oct 29, 2020. For reference you can find the original article here.

Sources

[1] Wikipedia. Chandra–Toueg consensus algorithm.

[2] Marcos Kawazoe Aguilera, Sam Toueg. Correctness Proof of Ben-Or’s Randomized Consensus Algorithm. 1998.

[3] James Aspnes. Paxos - Class notes. 2003 to 2012.

[4] Jianyu Niu, Chen Feng, Hoang Dau, Yu-Chih Huang, and Jingge Zhu. Analysis of Nakamoto Consensus, Revisited. 2019.

Jul 6, 2020 - A framework for investigating bugs

As software developers, fixing bugs is part of our jobs. Over the course of the years we end up investigating and solving hundreds, thousands of them, and mostly inevitably we become increasingly skilled at it.

So why is that we become much better at solving bugs, and how can we leverage this knowledge to guide new developers for becoming better bug solvers (and also problem solvers) as well? One answer, based on my experience and observations, is that experienced software developers are really fast at determining the root cause of bugs.

If we look at an established problem solving process such as Toyota’s eight step process1 we can see that most uncertainty is contained in steps “1” through “4”, which are concerned with searching the root cause of the problem:

  1. Clarify the problem
  2. Breakdown the problem
  3. Target setting
  4. Root cause analysis
  5. Develop countermeasures
  6. See countermeasures through
  7. Monitor results and processes
  8. Standardize successful processes

It’s arguably really difficult to accurately estimate how long it will take to find out the root cause of a bug: there are those that take only a few minutes, others a couple of hours and some even days! Nonetheless, once the root cause is found and confirmed the next steps can be quite easily estimated, planned and carried out.

Hence, my proposition is that in order to become more effective bug solvers we inevitably need to become better bug investigators, i.e., better at searching and confirming the root cause of bugs. In the next sections I derive a simple framework for root cause analysis inspired by my own observations and by the analysis of competing hypotheses (ACH) methodology2, which is aimed at evaluating multiple competing hypotheses for observed data in an unbiased way.

The Framework

The proposed framework has five steps and is presented below. It’s centered at hypotheses generation and evaluation through gathered evidence, and the way these steps are performed, as we’ll see, varies greatly according to the experience level of the developer:

Root cause flowchart

1. Clarify the Problem

The first step is to describe the problem, or bug in our case, clearly so to remove any ambiguity from its definition. All parties involved in sorting a bug out must agree on what the problem is, and what it isn’t. Asking a few questions about the conditions in which the bug surfaced already go a long way in this matter, helping bringing up relevant application use cases that have failed:

  • What were you trying to achieve?
  • What steps did you take?
  • What should have happened?
  • What happened instead?

Junior developers are more prone to take this step for granted and jump right into the next steps without making sure all parties involved share the same definition of the problem as he/she does. Senior developers, however, will only proceed after reaching out all parties involved to ensure everyone is on the same page in regard to what the problem is and how to reproduce it.

By not performing this simple step appropriately a developer is risking spending valuable hours solving the wrong problem, and over the past ten years I can say that I’ve seem this happen several times.

2. Generate Hypotheses

Once everyone agrees on what the problem is it’s time to brainstorm hypotheses on why it’s happened, i.e., what’s causing the bug. Based solely on a clear problem definition and technical knowledge of a product it’s already possible to come up with several candidate hypotheses.

Tougher bugs generally yield to less specific hypotheses at first, but as we iterate over the framework’s steps and our comprehension of the problem improves we’ll be able to refine them for narrowing down a probable root cause.

Let’s use an example to illustrate this step, suppose we are dealing with a bug in a GPS navigation application:

  • Route calculation from address A to address B produces a non-optimal more costly route when using the “find cheapest route” feature

Out of the box a developer may propose the following hypotheses:

  • The “find cheapest route” feature is not being activated, and the standard route calculation algorithm is being used instead
  • Road pricing (tolls) information is outdated in the application, and the “find cheapest route” feature relies on it for calculating the right route
  • The “find cheapest route” feature algorithm is faulty, and the bug definition describes a failing test case
  • The more costly route is the only one available due to ongoing circumstances (“not a bug” hypothesis)

Each one of these hypotheses will go through an evidence gathering step and an evaluation step in order to invalidate the majority of them, and finding out the one that eventually holds true.

In this step Senior developers have a clear advantage usually due to having i) an extensive past record of bugs solved and their root causes ii) a deep knowledge of the product and business rules iii) a deep knowledge of the product’s technology stack. Under these circumstances, for what he/she’s seen and knows, a Senior developer is able to enumerate a much larger number of highly likely root cause hypotheses in a short time, when compared to Junior developers.

3. Gather Evidence

Evidence should be gathered with the goal of invalidating hypotheses, including: system logs, system metrics, analytics, screenshots, debugging sessions, etc. A skeptical mindset for gathering evidence helps us overcome common cognitive biases and being more effective in this process. Below I list a few cognitive biases that may affect our judgement on what evidence to seek for validiting a bug’s root cause hypothesis:

  • Availability bias: the tendency to think that examples of things that come readily to mind are more representative than they actually are.
  • Confirmation bias: the tendency to search for, interpret, favor, and recall information that confirms or supports one’s prior personal beliefs or values.
  • Information bias: the tendency to seek information even when it cannot affect action.
  • Illusion of validity: the tendency to overestimate one’s ability to interpret and predict accurately the outcome when analyzing a set of data, in particular when the data analyzed show a very consistent pattern.

Besides being more susceptible to cognitive biases, it’s not uncommon for Junior developers to not use all available information sources for collecting evidence, eventually leaving out key evidence in a first iteration, only to come back to it again in future iterations, misspending valuable time. Senior developers, however, will in general know where to look for evidence, and what to look for, hardly leaving anything important out of the analysis.

4. Evaluate Hypotheses

Despite the more illustrative flowchart given above in which this step is presented as coming up only after one has completed generating hypothesis and gathering evidence, it’s actually performed somewhat simultaneously to steps “2” and “3”.

I see this step as an ongoing mental process that happens while hypotheses are being generated, as to evaluate their feasibility, and while evidence is being collected, as to evaluate the impact of new evidence on the set of candidate hypotheses being considered.

The cyclical procedure formed by steps “2”, “3” and “4” is similar to a feedback loop that continuosly refines the set of candidate hypothesis based on a growing context of “diagnostical” evidence against which these hypotheses are evaluated.

This cycle is performed until a single hypothesis remains valid, in which case we proceed to the confirmation step. Otherwise, if more than one hypotheses hold, one should still seek for evidence to invalidate them. In case none hypothesis hold, we’ll have to go back to step “2” for brainstorming new hypotheses.

Critical judgement and logical thinking play a central role in this step. Developers are required to analyze facts (evidence), extrapolate them and make effective decisions. Oftentimes we are faced with incomplete evidence having to evaluate in a timely constrained situation whether it’s strong enough to make a decision. Junior developers may not yet have a fully developed mental framework for making effective decisions in the context of an application/product they’ve just recently started working in, hence the importance of more experienced developers to assist them while they develop these skills.

5. Confirm Hypothesis

Finally, when there’s a single hypothesis left it goes through a confirmation step similar to the previous evaluation step, but entirely focused in proving that the bug root cause was indeed found. Two general strategies are provided below:

  1. The tracing strategy: Since the problem is clarified and a probable root cause established, in this strategy we employ tracing techniques and verbose logging to create a track record that will allow for confirmation upon reproducing the bug.
  2. The exploitation strategy: Complementary to the first, in this strategy we exploit the probable root cause for defining new test cases whose outcomes shall be consistent with said root cause, corroborating it.

It’s possible that while trying to confirm the hypothesis we end up invalidating it or discovering it’s not yet actionable, i.e., not specific enough for a developer to start working in a solution, and if that’s the case we’ll have to go back to step “2”.

Again, Junior developers often rush into coding a bug solution before properly confirming it’s root cause. More experienced developers know that it’s much cheaper in the long run to always confirm the root cause before starting to code a fix.

Conclusion

Wrapping up, the framework proposed in this article tries to capture a functional mental model for investigating a problem’s root cause. Key characteristics seem to differentiate senior developers from junior developers in regard to their speed in determining the root cause of bugs, namely:

  • Extensive knowledge of the product and technology stack
  • The ability to produce highly likely hypotheses
  • Critical thinking mindset for seeking and evaluating evidence
  • Meticulousness upon which each step is carried out

One can improve himself in these matters up to a point by simply becoming aware of their role and importance. Proper guidance can take a developer even further. However, only years of practice solving a generous amount of bugs will eventually lead to proficiency.


Sources

[1] Phillip Marksberry, PhD, PE. The Modern Theory of the Toyota Production System: A Systems Inquiry of the World’s Most Emulated and Profitable Management System. Productivity Press, 2012.

[2] Richards J. Heuer Jr. Psychology of Intelligence Analysis. Center for the Study of Intelligence, 1999.