Oct 2, 2019 - Why is it hard to name classes?

When following the Single Responsibility Principle (SRP) we are frequently required to encapsulate code into new classes, segregating responsibility from one “bigger” class into smaller, granular classes. Clean code guidelines state that classes names should be meaningful and describe the intent of the class, i.e., by reading a class name one should have a close idea of what it does.

As much as we’re constantly encouraged to write meaningful code and being thoughtful when naming things we often can’t figure out a great name for a class and end up making use of loosely generic suffixes such as Helper and Manager. So the question hangs, why is it hard to name classes?

Here’s one unusual answer: Vocabulary.

There are “only” so many nouns in the English language (the de facto working language in computing), and actually when modeling real objects in code classes names come quite naturally. We’ve all seen the elucidatory “animals” example for explaining inheritance:

public abstract class Animal
{
    public abstract Eat();

    public abstract Sleep();

    public abstract WakeUp();
}

public abstract class Fish : Animal
{
    public abstract Swim();
}

public abstract class Bird : Animal
{
    public abstract Fly();
}

Naming animal classes is easy because it’s within our basic vocabulary, but not much useful. How many times have you designed an application in which you had to model different kinds of animals?

Naming classes whose purposes are either too specific or not relatable to real things is harder because we either have to use a more sophisticated vocabulary, or invent names ourselves, since there may not be a noun (or a composition of nouns) in the English language for properly describing it!

For instance, try naming the following classes:

1) A class responsible for maintaining a set of financial transactions and balances, and provide the ability to create new financial transactions.

2) A class responsible for evaluating the risk associated with an offshore IP Address trying to connect to a website with rigorous security requirements.

The first one is straight forward: Wallet. The second one less so, leading us to a more wide set of naming options to choose from: IPRiskManager, IPVerifier, IPChecker, and so forth.

So, are all class name suffixes evil?

Not at all. Well defined behavioral / contextual suffixes are a great tool for producing more intuitive code, reducing the cognitive effort for reading and navigating though the codebase, and that’s why it’s widely used in architectural patterns and frameworks. Here are just a few examples that come to mind: Builder, Factory, Repository, View, Controller, and so forth, the list is endless.

What we need to avoid are just suffixes (and prefixes also) that are too broad to transmit any meaningful intent about the class we’re designing, and narrow it down to a better, more significant term.

Conclusion

Naming classes is part of our daily tasks as software developers. When components relate to already existing tangible things or clear concepts within our vocabulary class names may come up more naturally, and the opposite might occur when dealing with highly specific constructs.

Expanding our vocabulary to use domain specific nouns can be useful for enriching both our system design as well as our knowledge in the field we are working in. Even so, sometimes it’s not possible to find simple, relatable terms to adopt for naming classes. In such cases relying on well defined prefix/suffix naming conventions is a good option, or falling back to a neologism strategy for creating descriptive composed nouns.

Notes

Revised on Jun 17, 2020. For reference you can find the original article here.

Sep 22, 2019 - The short transaction trap

There’s a general piece of wisdom in database adminstration that goes like this:

In order to reduce lock contention in the database, a database transaction has to be as short as possible.

And it holds, 99% of the time. However, I recently found myself in that 1% when I tried to solve a database locking problem blindingly optimizing a transaction duration, only to realize later that I was actually adding fuel to the fire 🔥

The Problem

In our mobile application users recurrently receive tasks, mostly surveys, which they participate sending us back their answers. On average tasks range from 500 to 1000 users, but it’s not unusual for us to submit a task to 10k or more users at a time.

Tasks are delivered to mobile users in a single batch, causing spikes in the number of active users throughout the day. The chart below displays the number of online users at one of our servers on a day when the problem occurred:

Online Users
online-users

Notice three major activity spikes at 2:25 PM, 3:28 PM and 9:11 PM regarding three tasks that were submitted to our userbase. Now let’s analyze the chart of active database connections for that server on the same day:

Active Database Connections
database-connections

There were two worrying active database connection spikes right before the first two user activity spikes, one at 2:05 PM and the other at 3:12 PM.

While the number of active users grew 2x, the number of active database connections grew 6x, which for me was a red flag worth dealing with.

Investigation

From the database adminstration logs it was easy to spot that this was a locking escalation problem. There are at least three database tables relevant for delivering tasks to our mobile users. A simplified representation is provided below:

tasks-schema
figure 1

Tasks: This table contains all tasks and their details such as title, start date, end date, etc
UserTasks: This is a many-to-many relationship table between users and tasks, defining which tasks each user is requested to perform
TaskStatuses: This is an aggregation table for summarizing the statuses of each task without running a “group by” query on the “UserTasks” table

The database logs showed that a large number of queries against the “TaskStatuses” table were blocked by the task submission transaction, which often runs for a couple of minutes, and performs the following procedure:

Creates a task by inserting it at the “Tasks” table
Inserts empty statuses rows in the “TaskStatuses” aggregation table
Selects eligible users for receiving the task
Inserts an entry in the “UserTasks” table for each eligible user
Updates the aggregation table “TaskStatuses” accordingly (using a database trigger)

The blocked queries were trying to update the “TaskStatuses” table after a user submits his answers to the task (in a serializable transaction), decrementing the “PENDING” count and incrementing the “DONE” count.

The (wrong) Solution

A straight forward approach that I tried was to breakdown the task submission into smaller batches, instead of submitting the task to all users at once, after all, several short transactions are better than one long running transaction, right?

Wrong! Well, at least for my specific use case 🧐. Even though staging tests showed that there was no significant change in the overall duration of tasks submission to mobile users, on the production environment this solution backfired:

Active Database Connections
database-connections-wrong-solution

Spikes became much more frequent and “taller”. Needless to say that I had to revert this deployment shortly after applying it. Somehow splitting the longer transaction into several short transactions resulted in more frequent and vigorous locking.

Back to Investigation

What I missed to realize while investigating this problem was that locking was escalating only for tasks that already existed, and were being submitted again to a new group of users, usually because it wasn’t possible to reach the desired number of answers from the first submission alone.

In short, the task resubmission acquired a lock for all “TaskStatues” rows for that task, the same rows that are updated when users individually submit their answers to the task:

tasks-schema-locks
figure 2

The row-level locks due to the task resubmission are represented in red, and the blocked queries from users submitting their answers to the task, waiting to update these locked rows, are represented in dark orange.

Each user waiting to submit his answers to the task holds one active database connection, resulting in the spikes seen previously. So why didn’t splitting the transaction solve the problem, and made it worse?

Well, that’s because splitting the transaction actually increased the chances of collisions in the “TaskStatuses” table! Initially collisions were only possible when resubmitting tasks. Then, with splitting, collisions became possible even in the first submission of a task, since users from the first batch could already be sending their answers before all task submission batches were entirely processed.

The (effective) Solution

To solve this problem and prevent blocking of the “TaskStatuses” table I implemented a mechanism in which a single agent is responsible for updating the “TaskStatuses” table, more specifically a queue worker, and everyone else is only allowed to perform insertions in this table.

Additionaly, I had to drop the TaskID-Status unique constraint and add an “ID” primary key column to the “TaskStatuses” table, whose purpose I explain below.

When submitting a new task, or resubmitting an existent task, one row with PENDING status and “1” count is inserted for each user that received this task. Then a queue message is sent to the updating agent to aggregate rows for this task.

When a user submits his answers one row with PENDING status and “-1” count is inserted and another row with DONE status and “1” count is also inserted. Then a queue message is sent to the updating agent to aggregate rows for this task.

When the updating agent receives a message it first fetches all rows IDs for the specified task:

SELECT ID
FROM TaskStatuses
WHERE TaskID={TaskID};

Then it executes an aggregation query based on the resulting IDs:

SELECT Status, SUM(Count)
FROM TaskStatuses
WHERE ID IN ({RowsIDs})
GROUP BY Status;

And finally it deletes all of these rows, and inserts the results from the aggregation query, all within a database transaction, thus effectively keeping the “TaskStatuses” updated and removing duplicate status entries for the same task.

Of course, since this process isn’t atomic, it’s possible, and quite easy actually, to spot duplicate status rows for a task which was not yet processed by the updating agent. However, the system can handle this transitional table state by simply sticking to the following aggregation query whenever reading this table:

SELECT Status, SUM(Count)
FROM TaskStatuses
WHERE TaskID={TaskID}
GROUP BY Status;

This solution proved quite successful, reducing experienced active database connection spikes “heights” to half of what they used to be, on average:

Active Database Connections
database-connections-right-solution

In this article I presented a strategy for dealing with aggregation tables locking problems. It was effective for my case, and can also be effective for you if you’re dealing with a similar problem. To apply it you will have to change how you’re writing to and reading from the aggregation table and also create an agent responsible for aggregating and cleaning up rows, which I saw fit and implemented as a queue worker.

Aug 24, 2019 - Index-free adjacency

Graph databases are much more efficient than traditional relational databases for traversing interconnected data. The way they store data and execute queries is heavily optimized for this use case, and one property in particular is key for achieving such higher performance: index-free adjacency.

This short article first analyzes the complexity of data traversal in relational databases, then explains how graph databases take advantage of index-free adjacency for traversing data more efficiently.

Relational Databases

Consider a social application in which users can follow each other, and its simplified database schema presented below:

relational-indexing
figure 1

In this illustrative schema a Users table stores application users profiles, and a Followers table stores unidirectional “follower” relationships between users. An Index is also used to speed-up queries against the Followers table (represented in yellow).

Suppose we want to know whether or not it’s possible to reach an specific “Target User” starting from “Source User” with at most “N” steps, how do we implement this in SQL?

Since SQL is designed as a declarative programming language for querying data contained in a relational database we need help of an imperative programming language for implementing this traversal algorithm, and fortunately most database systems do support structured programming out of the box for writing functions and stored procedures.

Regardless of the specific database and programming language used, the traversal code should look similar to the code example provided below:

bool CanReachTarget(int[] sourceIds, int targetId, HashSet<int> visitedIds, int steps)
{
    if (steps == 0)
        return false;

    int[] followersIds = ExecuteQuery<int[]>(
        "SELECT Follower_ID FROM Followers WHERE User_ID in ({0});",
        sourceIds
    );

    if (followersIds.Contains(targetId))
    {
        return true;
    }
    else
    {
        visitedIds.AddRange(sourceIds);
        sourceIds = followersIds.Except(visitedIds).ToArray();
        return CanReachTarget(sourceIds, targetId, visitedIds, steps - 1);
    }
}

At each level of the recursion the function will fetch all followers IDs at that level, check whether or not the target ID is present, return true in case it’s present or go one level deeper in case it’s not, keeping track of aready visited elements. If the function runs out of steps during the recursion it will return false, since it failed to reach the target ID.

Let’s exercise this algorithm with data from figure 1, starting from “Willian Johnson” (ID = 2) and trying to reach “Mary Miller” (ID = 5) with at most 3 steps:

	sourceIds	targetId	visitedIds	steps	followersIds	contains?
#1	[2]	5	[ ]	3	[1, 4]	false
#2	[1, 4]	5	[2]	2	[1, 2, 3, 4, 5]	true

Notice that “Mary Miller” was found at the second level, and the function will return true in this case.

Now, let’s analyze the time complexity of this algorithm. You may have already realized that it’s performing a Breadth First Search (BFS), offloading most of the work to the database query engine. The BFS part executes O(V + E) operations, where V is the number of users whose followers are fetched from the database and E the aggregated number of followers from this group of users. The cost to fetch followers for each user considering a B-Tree index is O(log(n)), where n is the length of the followers table. Hence, the resulting time complexity will be O(V * log(n) + E).

Without the index things would be much worse, since a full table scan would be necessary to fetch followers, yielding O(V * n + E) time complexity in the worst case.

Graph Databases

As I’ve said at the beginning of this article graph databases take advantage of index-free adjacency for traversing data more efficiently. Figure 2 presents a logical representation of how data from figure 1 would be organized in a graph database:

index-free-adjacency
figure 2

Instead of relying on a global index for accessing connected nodes, each node directly references its adjacent nodes. Even though the lack of a global index for nodes relationships is what gives the index-free adjacency property its name, you can think that each node actually holds a mini-index to all nearby nodes, making traversal operations extremely cheap.

As to leverage this structure, graph databases perform traversal queries natively, being able to execute the BFS algorithm from the previous section entirely whithin the database query engine. The BFS part will still execute O(V + E) operations, however the cost to fetch a user node followers will go down to O(1), running in constant time, since all followers are directly referenced at the node level.

You may be thinking:

What about accessing the query starting node in the first place, wouldn’t that require a global index?

And you’re completely right, in order to get to the starting point of the traversal efficiently we will have to rely on an index lookup. Again, considering a B-Tree index, this will run in O(log(n)) time in addition to the traversal.

The resulting time complexity for the query will be simply O(V + E + log(n)), much better than for relational databases, specially when the number of users (n) gets larger.

A lot of factors that affect database performance were left out of the analysis for simplicity, such as file system caching, data storage sparsity, memory requirements, schema and query optimizations.

There’s no single universal database that performs greatly in all scenarios. Different kinds of problems may be better solved using different kinds of databases, if you can afford the additional operational burden. As with most decisions in software development, a reasonable approach is to lay down your requirements and analyze whether or not tools available to you meet them before jumping to a new ship.

Newer Page: 10 of 12 Older

Recent Posts

Tags