Nov 3, 2019 - System design coherence

Six years (so far) developing and managing the same application has taught me a few lessons, one of them being the value of pursuing system design coherence. It’s a collective, rather than an individual, responsibility, requiring the entire development team commitment.

Coherence is defined as the quality of being logical, consistent and forming a unified whole. Which from my point of view directly relates to how system design should be handled:

  • Logical: System design decisions should be justified, following a clear line of thought.
  • Consistent: System design decisions should be compatible, in agreement, with its current state.
  • Unified whole: System components should fit together, seamlessly working alongside each other.

Below are listed five major practical guidelines on how to manage system design coherence in software projects.

1 - Create and follow codebase conventions

This is one of the most basic yet beneficial measures you can adopt to improve code quality. It deals with how source code files are written and organized within your codebase.

We developers spend most of our time reading, not writing, code, hence it’s extremely important to define and enforce coding conventions up front in your product development life cycle targeting improved code readability.

I personally adopt a combination of organizational and clean coding conventions such as:

  • Segment Files/Directories/Namespaces by domain
  • Avoid multiple languages in one source file
  • Class/Interface names should be nouns or noun phrases
  • Function names should say what they do
  • Avoid too many arguments in functions
  • Avoid functions with too many lines
  • Replace magic numbers with named constants
  • Don’t comment intuitive code
  • Discard dead code
  • Where to declare instance variables
  • Where to put braces
  • Tabs vs spaces

And the list goes on…

As a result of following conventions source code will be uniform throughout the codebase, reducing the cognitive effort for searching and reading code files.

2 - Implement clear software architectures

The definition of software architecture is still a topic of debate, but there’s a general understanding that it deals with how software is developed both in terms of its physical and logical structure.

The codebase of a software project that doesn’t follow a clear architectural style, whatever it may be, deteriorates gradually as new, unstructured code is added to it, becoming harder to modify. Hence the importance of putting in the hours for the design and conservation of an adequate software architecture.

Unfortunately there isn’t a magic architecture that fits all use cases. You need to take into account several factors from your project for choosing the right path to follow.

To provide a couple of examples the monolithic architecture was standard a decade ago, before the microservices architecture gained traction. Like any other, it has several benefits and drawbacks, to name a few:

Pros:

  • Shared Components: Monoliths share a single code base, infrastructure and business components can be reused across applications, reducing development time.
  • Performance: The code execution flow is usually constrained to a single process, making it faster and simpler when compared to distributed code execution.

Cons:

  • Tight Coupling: Code changes to shared components can potentially affect the whole system so it has to be coordinated meticulously.
  • Scalability: You cannot scale components separately due to interdependencies, only the whole application.

Monolith

A software team working on a product that deals with complex data models and needs processing operations to be fast, performant and integrated may prefer to go for a monolithic application.

On the other hand the microservices architecture addresses many of the situations in which monoliths fail, being a great fit for distributed, large scale web applications:

Pros:

  • Decoupled: The application can remain mostly unaffected by the failure of a single module. Also, code changes in one microservice wont impact others, providing more flexibility.
  • Scalability: Different microservices can scale at different rates, independently.

Cons:

  • DevOps: Deploying and maintaining microservices can be complex, requiring coordination among multiple services.
  • Testing: You can effectively test a single microservice, but testing a distributed operation involving multiple microservices is more challenging.

Microservices

Some architectural patterns are more concerned with the physical disposition of an application and how it’s deployed than with its logical structure. That’s why it’s also important to define a clear logical architecture to guide developers on how to structure code, so that everyone in your team understands how components talk to each other, how responsibility is segregated between modules, how to manage dependencies and what the code execution flow looks like.

3 - Fewer is better: Languages, Frameworks and Tools

With each additional language, framework and tool you introduce into your system comes an additional development and operational cost. This cost comes in different forms, which are illustrated in the following examples:

a) You are a member of development team highly experienced in Nginx + Python + PostgreSQL web applications. The team is fluent in this stack, the development pipeline is tidy and new features are delivered frequently. Then one day a developer decides to implement a new strategic feature using a different stack, say Apache + Java + MySQL, in which he is also highly experienced, but his colleagues aren’t. Now whenever this developer is busy and his colleagues have to implement a feature using the different stack they do so more carefully, since they aren’t quite familiar yet with all the programming language features, web server and database modes of operation, etc, as they are with the original stack. Thus, development time increases.

b) You have been assigned for managing a production environment of an application facing a considerable growth rate. Your goal is to deliver a SLA of 99.9% avaiability, which breaks down to only 8h of downtime per year. You gather the team to evaluate all technologies and plan the infrastructure required to support the growth rate: health checks, operational metrics, failure recovery, autoscaling, continuous integration, security updates. The plan is implemented and you start fine tuning the production environment, dealing with unforeseen events and issues. After much effort the production environment is stable and on its way to deliver that SLA, but you discover that a different tech stack was introduced and needs to be deployed. You’ll need to reevaluate the infrastructure. Also if the stack isn’t compatible with your current hosting environment it will potentially incur additional operational expenses.

These are just two illustrative situations, showing the impact of adopting additional technologies on development productivity, infrastructure complexity and operational expenses.

Of course, different technologies bring different possibilities. If we were to use only a limited set of technologies life as a developer would be much harder. For instance, there are scenarios where graph databases outperforms relational databases immensely. In these scenarios the choice is easy since the benefits outweighs the costs. The point is, you should always evaluate the long-term costs of a technological decision before making it to solve a short-term problem.

All right, but how does this relates to system design coherence?

Well, I believe that a system that is designed to avoid redundant technologies, that takes most out of its current stack, that has a stable production environment, whose team carefully evaluates structural changes and is able to sustain development productivity in the long run is a system that is clearly following the definition of “coherence”.

4 - Involve your team in system design decisions

As I’ve stated in the beginning of this article system design is a collective, shared responsibility. Individual, local actions have the potential to affect the system globally, so it’s essential that the development team is on the same page regarding the codebase conventions, employed architectures and the technology stack.

The most effective way to build this shared knowledge environment is to involve the team in all system design decisions. The benefits are plenty:

  • Individuals feel valued and part of the team
  • Important decisions are challenged by the entire team before being made
  • System design strengths and weaknesses are more clearly understood by everyone
  • Creates a sense of collective accountability and trust

This doesn’t mean that every developer on the team should have equal decision power. Senior roles should certainly have more influence in decision making then junior roles. But it’s vital that everyone has the opportunity to give his opinion and participate. Less experienced developers will definitely grow from these proceedings.

5 - The All-in rule

Efforts to refactor a system design should be conducted to completion (all-in), rather than being partially concluded. There’s a great risk of eroding your system design if developers feel free to apply different coding styles and architectural patterns locally whenever they see fit. Before too long you will end up with a disconnected, sometimes conflicting, system design.

By preserving your system design you’re also preserving the validity of your team’s shared knowledge about the system, which is extremely valuable. During development we make several assumptions on the behavior of the system based in this shared knowledge. Once it starts to lose validity unexpected issues start occurring, developers become justifiably less confident in the system design, implement features more carefully, losing productivity.

The challenge here is being open to improve your system design knowing that it can be exceptionally expensive to conduct a large system refactor up to completion. An approach I have used in a similar situation was to isolate refactored services behind an integration interface. The result was two independent system designs seamlessly working alongside each other, rather than having them mixed together:

integrated-designs


These five guidelines have served me well over the past years, helping to keep productivity high, optimize resources and deliver up to the standards. It’s more a mindset than an actual process. Like all mindsets it should be constantly challenged and subject to improvement.

Oct 2, 2019 - Why is it hard to name classes?

When following the Single Responsibility Principle (SRP) we are frequently required to encapsulate code into new classes, segregating responsibility from one “bigger” class into smaller, granular classes. Clean code guidelines states that classes names should be meaningful and describe the intent of the class, i.e., by reading a class name one should have a close idea of what it does.

As much as we’re constantly discouraged from using generic suffixes in classes names such as Manager, Handler, Verifier, etc, we often can’t figure out a great name for a class and end up making use of them. So the question hangs, why is it hard to name classes?

Here’s one unusual answer: Vocabulary.

There are “only” so many nouns in the English language (the de facto working language in computing), and actually when modeling real objects in code classes names come quite naturally. We’ve all seen the “animals” example for explaining inheritance:

public abstract class Animal
{
    public abstract Eat();

    public abstract Sleep();

    public abstract WakeUp();
}
public abstract class Fish : Animal
{
    public abstract Swim();
}
public abstract class Bird : Animal
{
    public abstract Fly();
}

Naming animal classes is easy because it’s within our basic vocabulary. However, naming classes whose purposes are either too specific or not relatable to real things is hard because we either have to use a more sophisticated vocabulary, or invent names ourselves, since there may not be a noun in the English language for it!

For instance, try naming the following classes:

1) A class responsible for holding a user’s financial information, such as credit cards, social security number, bank account access keys, etc.

2) A class responsible for evaluating the risk associated with an offshore IP Address trying to connect to a website with rigorous security requirements.

The first one is straight forward: Wallet. The second one not so much, leading us to those not well regarded naming approaches: IPRiskManager, IPVerifier, IPChecker, and so forth.

Sep 22, 2019 - The short transaction trap

There’s a general piece of wisdom in database adminstration that goes like this:

In order to reduce lock contention in the database, a database transaction has to be as short as possible.

And it holds, 99% of the time. However, I recently found myself in that 1% when I tried to solve a database locking problem blindingly optimizing a transaction duration, only to realize later that I was actually adding fuel to the fire 🔥

The Problem

In our mobile application users recurrently receive tasks, mostly surveys, which they participate sending us back their answers. On average tasks range from 500 to 1000 users, but it’s not unusual for us to submit a task to 10k or more users at a time.

Tasks are delivered to mobile users in a single batch, causing spikes in the number of active users throughout the day. The chart bellow displays the number of online users at one of our servers on a day when the problem occurred:


online-users

Notice three major activity spikes at 2:25 PM, 3:28 PM and 9:11 PM regarding three tasks that were submitted to our userbase. Now let’s analyze the chart of active database connections for that server on the same day:


database-connections

There were two worrying active database connection spikes right before the first two user activity spikes, one at 2:05 PM and the other at 3:12 PM.

While the number of active users grew 2x, the number of active database connections grew 6x, which for me was a red flag worth dealing with.

Investigation

From the database adminstration logs it was easy to spot that this was a locking escalation problem. There are at least three database tables relevant for delivering tasks to our mobile users. A simplified representation is provided bellow:

tasks-schema

  • Tasks: This table contains all tasks and their details such as title, start date, end date, etc
  • UserTasks: This is a many-to-many relationship table between users and tasks, defining which tasks each user is requested to perform
  • TaskStatuses: This is an aggregation table for summarizing the statuses of each task without running a “group by” query on the “UserTasks” table

The database logs showed that a large number of queries against the “TaskStatuses” table were blocked by the task submission transaction, which often runs for a couple of minutes, and performs the following procedure:

  1. Creates a task by inserting it at the “Tasks” table
  2. Inserts empty statuses rows in the “TaskStatuses” aggregation table
  3. Selects eligible users for receiving the task
  4. Inserts an entry in the “UserTasks” table for each eligible user
  5. Updates the aggregation table “TaskStatuses” accordingly (using a database trigger)

The blocked queries were trying to update the “TaskStatuses” table after a user submits his answers to the task (in a serializable transaction), decrementing the “PENDING” count and incrementing the “DONE” count.

The (wrong) Solution

A straight forward approach that I tried was to breakdown the task submission into smaller batches, instead of submitting the task to all users at once, after all, several short transactions are better than one long running transaction, right?

Wrong! Well, at least for my specific use case 🧐. Even though staging tests showed that there was no significant change in the overall duration of tasks submission to mobile users, on the production environment this solution backfired:


database-connections-wrong-solution

Spikes became much more frequent and “taller”. Needless to say that I had to revert this deployment shortly after applying it. Somehow splitting the longer transaction into several short transactions resulted in more frequent and vigorous locking.

Back to Investigation

What I missed to realize while investigating this problem was that locking was escalating only for tasks that already existed, and were being submitted again to a new group of users, usually because it wasn’t possible to reach the desired number of answers from the first submission alone.

In short, the task resubmission acquired a lock for all “TaskStatues” rows for that task, the same rows that are updated when users individually submit their answers to the task:

tasks-schema-locks

The row-level locks due to the task resubmission are represented in red, and the blocked queries from users submitting their answers to the task, waiting to update these locked rows, are represented in dark orange.

Each user waiting to submit his answers to the task holds one active database connection, resulting in the spikes seen previously. So why didn’t splitting the transaction solve the problem, and made it worse?

Well, that’s because splitting the transaction actually increased the chances of collisions in the “TaskStatuses” table! Initially collisions were only possible when resubmitting tasks. Then, with splitting, collisions became possible even in the first submission of a task, since users from the first batch could already be sending their answers before all task submission batches were entirely processed.

The (effective) Solution

To solve this problem and prevent blocking of the “TaskStatuses” table I implemented a mechanism in which a single agent is responsible for updating the “TaskStatuses” table, more specifically a queue worker, and everyone else is only allowed to perform insertions in this table.

Additionaly, I had to drop the TaskID-Status unique constraint and add an “ID” primary key column to the “TaskStatuses” table, whose purpose I explain bellow.

When submitting a new task, or resubmitting an existent task, one row with PENDING status and “1” count is inserted for each user that received this task. Then a queue message is sent to the updating agent to aggregate rows for this task.

When a user submits his answers one row with PENDING status and “-1” count is inserted and another row with DONE status and “1” count is also inserted. Then a queue message is sent to the updating agent to aggregate rows for this task.

When the updating agent receives a message it first fetches all rows IDs for the specified task:

SELECT ID
FROM TaskStatuses
WHERE TaskID={TaskID};

Then it executes an aggregation query based on the resulting IDs:

SELECT Status, SUM(Count)
FROM TaskStatuses
WHERE ID IN ({RowsIDs})
GROUP BY Status;

And finally it deletes all of these rows, and inserts the results from the aggregation query, all within a database transaction, thus effectively keeping the “TaskStatuses” updated and removing duplicate status entries for the same task.

Of course, since this process isn’t atomic, it’s possible, and quite easy actually, to spot duplicate status rows for a task which was not yet processed by the updating agent. However, the system can handle this transitional table state by simply sticking to the following aggregation query whenever reading this table:

SELECT Status, SUM(Count)
FROM TaskStatuses
WHERE TaskID={TaskID}
GROUP BY Status;

This solution proved quite successful, reducing experienced active database connection spikes “heights” to half of what they used to be, on average:


database-connections-right-solution


In this article I presented a strategy for dealing with aggregation tables locking problems. It was effective for my case, and can also be effective for you if you’re dealing with a similar problem. To apply it you will have to change how you’re writing to and reading from the aggregation table and also create an agent responsible for aggregating and cleaning up rows, which I saw fit and implemented as a queue worker.