Jul 6, 2020 - A framework for investigating bugs

As software developers, fixing bugs is part of our jobs. Over the course of the years we end up investigating and solving hundreds, thousands of them, and mostly inevitably we become increasingly skilled at it.

So why is that we become much better at solving bugs, and how can we leverage this knowledge to guide new developers for becoming better bug solvers (and also problem solvers) as well? One answer, based on my experience and observations, is that experienced software developers are really fast at determining the root cause of bugs.

If we look at an established problem solving process such as Toyota’s eight step process1 we can see that most uncertainty is contained in steps “1” through “4”, which are concerned with searching the root cause of the problem:

  1. Clarify the problem
  2. Breakdown the problem
  3. Target setting
  4. Root cause analysis
  5. Develop countermeasures
  6. See countermeasures through
  7. Monitor results and processes
  8. Standardize successful processes

It’s arguably really difficult to accurately estimate how long it will take to find out the root cause of a bug: there are those that take only a few minutes, others a couple of hours and some even days! Nonetheless, once the root cause is found and confirmed the next steps can be quite easily estimated, planned and carried out.

Hence, my proposition is that in order to become more effective bug solvers we inevitably need to become better bug investigators, i.e., better at searching and confirming the root cause of bugs. In the next sections I derive a simple framework for root cause analysis inspired by my own observations and by the analysis of competing hypotheses (ACH) methodology2, which is aimed at evaluating multiple competing hypotheses for observed data in an unbiased way.

The Framework

The proposed framework has five steps and is presented below. It’s centered at hypotheses generation and evaluation through gathered evidence, and the way these steps are performed, as we’ll see, varies greatly according to the experience level of the developer:

Root cause flowchart

1. Clarify the Problem

The first step is to describe the problem, or bug in our case, clearly so to remove any ambiguity from its definition. All parties involved in sorting a bug out must agree on what the problem is, and what it isn’t. Asking a few questions about the conditions in which the bug surfaced already go a long way in this matter, helping bringing up relevant application use cases that have failed:

  • What were you trying to achieve?
  • What steps did you take?
  • What should have happened?
  • What happened instead?

Junior developers are more prone to take this step for granted and jump right into the next steps without making sure all parties involved share the same definition of the problem as he/she does. Senior developers, however, will only proceed after reaching out all parties involved to ensure everyone is on the same page in regard to what the problem is and how to reproduce it.

By not performing this simple step appropriately a developer is risking spending valuable hours solving the wrong problem, and over the past ten years I can say that I’ve seem this happen several times.

2. Generate Hypotheses

Once everyone agrees on what the problem is it’s time to brainstorm hypotheses on why it’s happened, i.e., what’s causing the bug. Based solely on a clear problem definition and technical knowledge of a product it’s already possible to come up with several candidate hypotheses.

Tougher bugs generally yield to less specific hypotheses at first, but as we iterate over the framework’s steps and our comprehension of the problem improves we’ll be able to refine them for narrowing down a probable root cause.

Let’s use an example to illustrate this step, suppose we are dealing with a bug in a GPS navigation application:

  • Route calculation from address A to address B produces a non-optimal more costly route when using the “find cheapest route” feature

Out of the box a developer may propose the following hypotheses:

  • The “find cheapest route” feature is not being activated, and the standard route calculation algorithm is being used instead
  • Road pricing (tolls) information is outdated in the application, and the “find cheapest route” feature relies on it for calculating the right route
  • The “find cheapest route” feature algorithm is faulty, and the bug definition describes a failing test case
  • The more costly route is the only one available due to ongoing circumstances (“not a bug” hypothesis)

Each one of these hypotheses will go through an evidence gathering step and an evaluation step in order to invalidate the majority of them, and finding out the one that eventually holds true.

In this step Senior developers have a clear advantage usually due to having i) an extensive past record of bugs solved and their root causes ii) a deep knowledge of the product and business rules iii) a deep knowledge of the product’s technology stack. Under these circumstances, for what he/she’s seen and knows, a Senior developer is able to enumerate a much larger number of highly likely root cause hypotheses in a short time, when compared to Junior developers.

3. Gather Evidence

Evidence should be gathered with the goal of invalidating hypotheses, including: system logs, system metrics, analytics, screenshots, debugging sessions, etc. A skeptical mindset for gathering evidence helps us overcome common cognitive biases and being more effective in this process. Below I list a few cognitive biases that may affect our judgement on what evidence to seek for validiting a bug’s root cause hypothesis:

  • Availability bias: the tendency to think that examples of things that come readily to mind are more representative than they actually are.
  • Confirmation bias: the tendency to search for, interpret, favor, and recall information that confirms or supports one’s prior personal beliefs or values.
  • Information bias: the tendency to seek information even when it cannot affect action.
  • Illusion of validity: the tendency to overestimate one’s ability to interpret and predict accurately the outcome when analyzing a set of data, in particular when the data analyzed show a very consistent pattern.

Besides being more susceptible to cognitive biases, it’s not uncommon for Junior developers to not use all available information sources for collecting evidence, eventually leaving out key evidence in a first iteration, only to come back to it again in future iterations, misspending valuable time. Senior developers, however, will in general know where to look for evidence, and what to look for, hardly leaving anything important out of the analysis.

4. Evaluate Hypotheses

Despite the more illustrative flowchart given above in which this step is presented as coming up only after one has completed generating hypothesis and gathering evidence, it’s actually performed somewhat simultaneously to steps “2” and “3”.

I see this step as an ongoing mental process that happens while hypotheses are being generated, as to evaluate their feasibility, and while evidence is being collected, as to evaluate the impact of new evidence on the set of candidate hypotheses being considered.

The cyclical procedure formed by steps “2”, “3” and “4” is similar to a feedback loop that continuosly refines the set of candidate hypothesis based on a growing context of “diagnostical” evidence against which these hypotheses are evaluated.

This cycle is performed until a single hypothesis remains valid, in which case we proceed to the confirmation step. Otherwise, if more than one hypotheses hold, one should still seek for evidence to invalidate them. In case none hypothesis hold, we’ll have to go back to step “2” for brainstorming new hypotheses.

Critical judgement and logical thinking play a central role in this step. Developers are required to analyze facts (evidence), extrapolate them and make effective decisions. Oftentimes we are faced with incomplete evidence having to evaluate in a timely constrained situation whether it’s strong enough to make a decision. Junior developers may not yet have a fully developed mental framework for making effective decisions in the context of an application/product they’ve just recently started working in, hence the importance of more experienced developers to assist them while they develop these skills.

5. Confirm Hypothesis

Finally, when there’s a single hypothesis left it goes through a confirmation step similar to the previous evaluation step, but entirely focused in proving that the bug root cause was indeed found. Two general strategies are provided below:

  1. The tracing strategy: Since the problem is clarified and a probable root cause established, in this strategy we employ tracing techniques and verbose logging to create a track record that will allow for confirmation upon reproducing the bug.
  2. The exploitation strategy: Complementary to the first, in this strategy we exploit the probable root cause for defining new test cases whose outcomes shall be consistent with said root cause, corroborating it.

It’s possible that while trying to confirm the hypothesis we end up invalidating it or discovering it’s not yet actionable, i.e., not specific enough for a developer to start working in a solution, and if that’s the case we’ll have to go back to step “2”.

Again, Junior developers often rush into coding a bug solution before properly confirming it’s root cause. More experienced developers know that it’s much cheaper in the long run to always confirm the root cause before starting to code a fix.


Wrapping up, the framework proposed in this article tries to capture a functional mental model for investigating a problem’s root cause. Key characteristics seem to differentiate senior developers from junior developers in regard to their speed in determining the root cause of bugs, namely:

  • Extensive knowledge of the product and technology stack
  • The ability to produce highly likely hypotheses
  • Critical thinking mindset for seeking and evaluating evidence
  • Meticulousness upon which each step is carried out

One can improve himself in these matters up to a point by simply becoming aware of their role and importance. Proper guidance can take a developer even further. However, only years of practice solving a generous amount of bugs will eventually lead to proficiency.


[1] Phillip Marksberry, PhD, PE. The Modern Theory of the Toyota Production System: A Systems Inquiry of the World’s Most Emulated and Profitable Management System. Productivity Press, 2012.

[2] Richards J. Heuer Jr. Psychology of Intelligence Analysis. Center for the Study of Intelligence, 1999.

Jun 27, 2020 - Disabling TLS 1.0 and TLS 1.1

Following recommendations from RFC-7525 this year all major browsers are disabling TLS versions 1.0 and 1.1. The table below presents approximate deadline for that to happen1,2,3,4:

Browser Name Date
Google Chrome July 2020
Microsoft Edge July 2020
Microsoft IE11 September 2020
Mozilla Firefox March 2020
Safari/Webkit March 2020§

In light of current global circumstances, this planned change has been postponed — originally scheduled for the first half of 2020.

Due to the pandemic, Mozilla reverted the disable of TLS 1.0 and TLS 1.1 for an undetermined amount of time to better enable access to critical government sites sharing Covid-19 information.

§ Release date for Safari Technology Preview.

So why is this happening? Well, these protocol versions were designed around two decades ago and they don’t support a lot of the recent developments in cryptography. Below is a copy of the rationale included in RFC-7525:

  • TLS 1.0 (published in 1999) does not support many modern, strong cipher suites. In addition, TLS 1.0 lacks a per-record Initialization Vector (IV) for CBC-based cipher suites and does not warn against common padding errors.

  • TLS 1.1 (published in 2006) is a security improvement over TLS 1.0 but still does not support certain stronger cipher suites.

In the next sections I provide more information on how this may affect your web applications, and how to handle this transition, based on my recent experience.

Security Scan

To be honest I only found out about browsers dropping support to TLS 1.0 and TLS 1.1 after running a SSL Server Test against one of the web applications I manage and seeing a drop in its overall rating:

SSL Scan grade B

Since this application is required by contract to achieve grade “A” in this particular test I started digging on this subject to understand the difficulty involved in complying with this requirement. Fortunately in my case it was a no brainer, as you’ll see.

Evaluating Impact

My first action was to evaluate how many of my web application users would be impacted by this change. Given my application supported TLS 1.2, and fewer than 0.5% of users were making connections using older TLS versions, the impact was evaluated as minimum.

Overall, the following table shows for each browser the percentage of connections made to SSL/TLS servers in the world wide web using protocol TLS 1.0 and TLS 1.1 from an analysis conducted in late 2018 5:

Browser/Client Name Percentage (%) – Both TLS 1.0 and 1.1
Google Chrome 0.5%
Microsoft IE and Edge 0.72%
Mozilla Firefox 1.2%
Safari/Webkit 0.36%
SSL Pulse Nov. 20186 5.84%

These figures are from two years ago and by now we can expect them to be even lower, meaning the transition should be transparent to most web application users on the internet.

Third-Party Integrations

Moving forward, I checked all of my web application’s third-party integrations in order to confirm that they also supported TLS 1.2, just to name a few:

  • Amazon DynamoDB (data storage)
  • Amazon Elasticsearch Service (text indexing and search)
  • SendGrid (e-mail delivery)

During testing I experienced the following error in some of them, which apparently was caused due to having TLS protocol versions 1.0 and 1.1 enabled in the source code (C#) and at the same time disabled at the Web Server level:

‘ConnectionError’ —> The client and server cannot communicate, because they do not possess a common algorithm.

This error was easily solvable by forcing the TLS 1.2 version in my application source code, replacing this:

ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;

By this:

ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

With this configuration in place I rebuilt my application and ran integration tests for each external service to guarantee the issue had been solved.

Luckily all of my application’s integrations supported TLS 1.2. If that’s not your case and your Web Server doesn’t give you the option to make exceptions for these external services you’ll have to reach out their technical support and possibly open a request ticket for adding TLS 1.2 support. Another (less secure) option is to create a proxy between your application and the external service that will talk TLS 1.2 with your app and TLS 1.1 or TLS 1.0 with the external service.

Remote Connection

If you’re solely using SSH to connect to remote Virtual Machines (VMs) you can skip this section. However, if like me you also manage some Windows Server VMs connecting to them using the RDP protocol, before disabling TLS 1.0 and 1.1 in these machines make sure RDP is running on TLS 1.2 or above, otherwise you will lock yourself out of the VM!

To be on the safe side besides confirming that the RDP Host in the server was configured to accept TLS 1.2 connections I also disabled older TLS versions in my notebook, connected to the remote machine and sniffed packets using Wireshark:

Wireshark sniffing

This result gave me enough confidence to proceed, knowing that I was indeed connecting using TLS 1.2 and I’d not lose access to the remote server.

Disabling the Protocol

The exact steps you’ll have to follow for disabling TLS 1.0 and TLS 1.1 in your web application will obviously depend on your technology stack. Below I provide references for the three most used web servers (Apache, IIS and Nginx):

If you’ve followed the link with instructions for disabling the protocol in IIS you may have notice it’s the least straightforward of them all! It requires a system wide configuration by modifying a few Registry Keys.

Since I had not one but multiple Windows Server instances to configure I decided to make my life easier and write a simple console app for querying the Windows Registry and making required changes. I made this tool available on GitHub in case you’re interested: TLSConfig.


As with all Windows Registry changes you will have to restart your VM for them to take effect 😒

Important: In case you follow the steps for disabling the protocol only to realise nothing happened, it could be that your application in sitting behind a load balancer or a proxy server which you will have to configure as well.


After using the TLSConfig tool to disable TLS versions 1.0 and 1.1 in all of my application servers I ran the security scan again and… success!

SSL Scan grade A

DISCLAIMER: This blog post is intended to provide information and resources on this subject only. Most of the recommendations and suggestions here are based on my own experience and research. Follow them at your own will and responsibility, evaluate impact at all times before proceeding, test changes in a separate environment before applying them to production, schedule maintenance to low usage hours and be prepared for things to go wrong.


[1] Google Developers. Deprecations and removals in Chrome 84. Retrieved 2020-06-27.

[2] Windows Blogs. Plan for change: TLS 1.0 and TLS 1.1 soon to be disabled by default. Retrieved 2020-06-27.

[3] Mozilla. Firefox 74.0 Release Notes. Retrieved 2020-06-27.

[4] WebKit Blog. Release Notes for Safari Technology Preview 98. Retrieved 2020-06-27.

[5] Qualys Blog. SSL Labs Grade Change for TLS 1.0 and TLS 1.1 Protocols. Retrieved 2020-06-27.

[6] Qualys. SSL Pulse is a continuous and global dashboard for monitoring the quality of SSL / TLS support over time across 150,000 SSL- and TLS-enabled websites, based on Alexa’s list of the most popular sites in the world.

May 29, 2020 - Data replication in random regular graphs

Graph theory is extensively studied, experimented on and applied to communications networks 📡. Depending on a communication network’s requirements it may benefit from adopting one or another network topology: Point to point, Ring, Star, Tree, Mesh, and so forth.

In this post I analyze a network topology based on unweighted random regular graphs, and evaluate its robustness for data replication amid partial network disruption. First I present the definition of this kind of graph and describe its properties. Then I implement and validate a random regular graph generation algorithm from the literature. Finally I simulate data replication for varying degrees of partial network disruption and assess this topology effectiveness.

Supporting source code for this article can be found in this GitHub repository.


A regular graph is a graph where each vertex has the same number of neighbors; i.e. every vertex has the same degree or valency. The image below shows a few examples:


These sample graphs are regular since we can confirm that every vertex has exactly the same number of edges. The first one is 2-regular (two edges per vertex) and the following two are 3-regular (three edges per vertex).

Even though 0-regular (disconnected), 1-regular (two vertices connected by single edge) and 2-regular (circular) graphs take only one form each, r-regular graphs of the third degree and upwards take multiple distinct forms by combining their vertices in a variety of different ways.

More broadly we can denote Gn,r as the probability space of all r-regular graphs on n vertices, where 3 ≤ r < n. Then, we define a random r-regular graph as the result of randomly sampling Gn,r.

Properties of random regular graphs

There are at least two main properties that are worth exploring in this article. It is possible to prove that as the size of the graph grows the following holds asymptotically almost surely:

  • A random r-regular graph is almost surely r-connected; thus, maximally connected
  • A random r-regular graph has diameter at most d, where d has an upper bound of Θ(logr−1(nlogn)) [1]

The connectivity of a graph is an important measure of its resilience as a network. Qualitatively we can think of it as how tolerant the topology is to vertices failures. In this case, the graph being maximally connected means it’s as fault tolerant as it can be in regard to its degree.

Also relevant for communication networks is the graph diameter, which is the greatest distance between any pair of vertices, and hence is qualitatively related to the complexity of messaging routing within the network. In this case, the graph diameter grows slowly (somewhat logarithmically) as the graph size gets larger.

Graph generation algorithm

A widely known algorithm for generating random regular graphs is the pairing model, which was first given by Bollobas2. It’s a simple algorithm to implement and works fast enough for small degrees, but it becomes slow when the degree is large, because with high probability we will get a multigraph with loops and multiple edges, so we have to abandon this multigraph and start the algorithm again3.

The pairing model algorithm is as follows:

  1. Start with a set of n vertices.
  2. Create a new set of n*k points, distributing them across n buckets, such that each bucket contains k points.
  3. Take each point and pair it randomly with another one, until (n*k)/2 pairs are obtained (a perfect matching).
  4. Collapse the points, so that each bucket (and thus the points it contains) maps onto a single vertex of the original graph. Retain all edges between points as the edges of the corresponding vertices.
  5. Verify if the resulting graph is simple, i.e., make sure that none of the vertices have loops (self-connections) or multiple edges (more than one connection to the same vertex). If the graph is not simple, restart.

You can find my implementation of this algorithm in the following source file: RandomRegularGraphBuilder.cs.

Now that the algorithm is implemented, let’s evaluate that the properties described earlier truly hold.

Evaluating graph connectivity

In order to calculate a graph’s connectivity we need to ask for the minimum number of elements (vertices or edges) that need to be removed to separate the remaining vertices into isolated subgraphs.

We start from the observation that by selecting one vertex at will and then pairing it with all other remaining vertices in the graph, one at a time, to calculate their edge connectivity (i.e., the minimum number of cuts that partitions these vertices into two disjoint subsets) we are guaranteed to eventually stumble across the graphs own edge connectivity:

public int GetConnectivity()
    var min = -1;
    for (int i = 1; i < Vertices.Length; i++)
        var c = GetConnectivity(Vertices[0], Vertices[i]);
        if (min < 0 || c < min)
            min = c;
    return min;

The code above was taken from my Graph class and does just that from the array of vertices that compose the graph. The algorithm used in this code sample for calculating the edge connectivity between two vertices is straightforward, but a bit more extensive. Here’s what it does on a higher level:

  1. Initialize a counter at zero.
  2. Search for the shortest path between source vertex and destination vertex (using Depth First Search).
  3. If a valid path is found increment the counter, remove all path edges from the graph and go back to step 2.
  4. Otherwise finish. The counter will hold this pair of vertices’ edge connectivity.

It’s important to emphasize that this algorithm is intended for symmetric directed graphs, where all edges are bidirected (that is, for every arrow that belongs to the graph, the corresponding inversed arrow also belongs to it).

Once we are able to calculate a graph’s connectivity, it’s easy to define a unit-test for verifying that random regular graphs are indeed maximally connected:

public void RRG_Connectivity_Test()
    var n = (int)1e3;
    Assert.AreEqual(3, BuildRRG(n, r: 3).GetConnectivity());
    Assert.AreEqual(4, BuildRRG(n, r: 4).GetConnectivity());
    Assert.AreEqual(5, BuildRRG(n, r: 5).GetConnectivity());
    Assert.AreEqual(6, BuildRRG(n, r: 6).GetConnectivity());
    Assert.AreEqual(7, BuildRRG(n, r: 7).GetConnectivity());
    Assert.AreEqual(8, BuildRRG(n, r: 8).GetConnectivity());

This unit test is defined in the RandomRegularGraphBuilderTests.cs source file and, as theorized, it passes ✓

Evaluating graph diameter

The algorithm for calculating the diameter is much easier, particularly in the case of unweighted graphs. It boils down to calculating the maximum shortest path length from all vertices, and then taking the maximum value among them:

public int GetDiameter()
    return Vertices.Max(
        v => v.GetMaxShortestPathLength(Vertices.Length)

This maximum shortest path length method receives an expectedVerticesCount integer parameter, which is the total number of vertices in the graph. This way the method is able to compare the number of vertices traversed while searching for the maximum shortest path with the graph’s size, and in case they differ, return a special value indicating that certain vertices are unreachable from the source vertex.

So after implementing this method I ran it against random regular graphs of varying sizes and degrees. The results are plotted below:


We can clearly confirm that the graph’s diameter flattens somewhat logarithmically as we increase the graph’s size and degree.

Since we are dealing with random graphs, instead of simply calculating the diameter for a single sample I generated one hundred samples per size and degree, and took their average. It’s worth noting that the inferred diameter standard deviation started reasonably small (less than 0.5) and diminished to insignificant values as the graph size and degree increased.

Data replication

In this study I simulated the propagation of information across random regular graphs accounting for disruption percentages starting from 0% (no disruption) up to 50% of the graph’s vertices. The simulation runs in steps, and at each iteration active vertices propagate their newly produced/received data to their neighbors.

At the beginning of each simulation vertices are randomly selected and marked as inactive up to the desired disruption percentage. Inactive vertices are able to receive data, but they don’t propagate it. The simulation runs until 100% of vertices successfully receive data originated from a source vertex chosen arbitrarily, or until the number of iterations becomes greater than the number of vertices, indicating that the level of disruption has effectively separated the graph into two isolated subgraphs.

Here’s the method that I implemented for running it for any given graph size and degree:

private static void SimulateReplications(int n, int r)

    for (double perc = 0; perc < 0.5; perc += 0.05)
        var reached = -1;
        int iterations = 1;
        int disabled = (int)(perc * n);

        for (; iterations < n; iterations++)
            var graph = new RandomRegularGraphBuilder().Build(n, r);

            var sourceVertex = 50; // Any
            var item = Guid.NewGuid().GetHashCode();

            graph.DisableRandomVertices(disabled, sourceVertex);

            reached = graph.Vertices.Count(n => n.Items.Contains(item));
            if (reached == n || iterations > n)


The results for running it with parameters n = 1000 and r = [4, 8, 12] are given in the chart that follows:

simulation iterations

We can verify that the larger the graph’s degree, the less significant the effects of disruption levels are, which makes sense since intuitively there are much more path options available for the information to replicate.

In the simulation ran with parameters [n=1000, r=12] it was only required two additional iterations (6 in total) at 50% disruption for the source data to reach all graph vertices when compared with the base case in which all vertices were active.

For the lowest graph degree tested with [n=1000, r=4], however, the effects of disruption were quite noticeable, spiking to a total of 42 iterations for 25% disruption and not being able to reach the entirety of vertices for disruption levels above that (that’s why it’s not even plotted in the chart).

After some consideration, I realized that the spike in the number of iterations required for reaching all graph’s vertices in the simulation seems to occur when the chances of accidentally cutting the graph into two isolated subgraphs (while deactivating vertices) increase considerably. This probability can be roughly estimated as (1 - (1 - disruptionr)n), i.e., the probability of deactivating all neighbors of at least one graph vertex.

The following contour plot displays these probability estimates for graphs of size n=1000 for given disruption levels and graph degrees:

disruption contour

Now by analyzing the simulation iteration spikes on top of these probabilities we find that they started occurring when p neared 0.9. It’s important to highlight that as the probability of cutting off graph vertices increases, the number of simulation iterations required for reaching the totality of graph vertices becomes more volatile since, the way my code was devised, a new random regular graph is sampled and the current disruption level is randomly applied at each retrial. Nonetheless, as p nears 1.0, we are certain to end up with at least one disconnected vertex, meaning that we won’t be able to assess a valid number of simulation iterations for the replication to reach the entire graph.


This analysis has shown that from a theoretical point of view random regular graph topologies are particularly well suitable for data replication in applications where communication agents are faulty or cannot be fully trusted to relay messages, such as sensor networks, or reaching consensus in multi-agent systems.

As a result of being maximally connected, with proper scaling high levels of network disruption can be tolerated without significantly affecting the propagation of data among healthy network agents.

Lastly, the topology’s compact diameter favors fast and effective communication even for networks comprised of a large number of participating agents.


[1] B. Bollobás & W. Fernandez de la Vega, The diameter of random regular graphs, Combinatorica 2, 125–134 (1982).

[2] B. Bollobás, A probabilistic proof of an asymptotic formula for the number of labelled regular graphs, Preprint Series, Matematisk Institut, Aarhus Universitet (1979).

[3] Pu Gao, Models of generating random regular graphs and their short cycle distribution, University of Waterloo (2006).