Aug 11, 2019 - Complexity classes of problems

In computational complexity theory problems are classified into classes according to the algorithmic complexity for solving them. Confusion often arises due to the fact that naming of these classes are not intuitive, and even misleading. For instance, many consider that the complexity class “NP” stands for “non-polynomial time” when it actually stands for “non-deterministic polynomial time”. Another common misconception is that problems that are elements of the “NP-hard” class are also elements of “NP”, which is not necessarily true.

The following table summarizes the most important complexity classes and their properties according to the current understanding in the field:

  P NP NP-complete NP-hard
Solvable in polynomial time      
Solution verifiable in polynomial time  
Reduces any NP problem in polynomial time    

Each column lays down the pre-requisites a problem ought to fulfil for being considered a member of that complexity class. Notice that, trivially, all elements of “P” are also elements of “NP”, however the inverse cannot be alleged (at least while the P versus NP problem remains a major unsolved problem in computer science).

The next sections provide more details on these properties.

Solvable in polynomial time

Defines decision problems that can be solved by a deterministic Turing machine (DTM) using a polynomial amount of computation time, i.e., its running time is upper bounded by a polynomial expression in the size of the input for the algorithm. Using Big-O notation this time complexity is defined as O(n ^ k), where n is the size of the input and k a constant coefficient.

To put it briefly, a DTM executes algorithms the same way our modern computers do. It follows a set of pre-defined instructions (program), executing one instruction at a time, changing state and resolving the next instruction. We can imagine that at any given point in time there will be a history of executed operations and a set of predictable operations to follow based solely on the machine’s current state:

deterministic-machine

As long as there’s no external stimulus (randomness) involved, systems of this kind are deterministic, always producing the same output from a given initial state.

Solution verifiable in polynomial time

Defines decision problems for which a given solution can be verified by a DTM using a polynomial amount of computation time, even though obtaining the correct solution may require higher amounts of time.

There’s a straightforward brute force search approach for solving this kind of problems:

  1. Generate a solution Si from the space of every feasible solution (in constant time)
  2. Verify the correctness of the solution Si (in polynomial time)
  3. If solution Si correctly solves the problem, end execution
  4. Otherwise move to the next unvisited position i and repeat

It’s easy to see that this algorithm has time complexity O(k ^ n) * O(w), where w is the size of every feasible solution, since it iterates over the solution space and, for each possible solution, performs a verification that takes polynomial time. Typically solution spaces are not polynomial in size, yielding exponential (O(k ^ n)) or factorial (O(n!)) time complexity for this naive approach.

Here we introduce the non-deterministic Turing machine (NTM), a theoretical computer that can take this naive algorithm and run it in a polynomial amount of time:

non-deterministic-machine

As figure 2 exemplifies, the NTM, by design, can be thought as capable of resolving multiple instructions simultaneously, branching into several execution flows, each one of them verifying a different solution to the decision problem, until one of them finds the correct one, halting the NTM.

NTMs only exist in theory, and it’s easy to understand why: Their capability for branching indefinitely into simultaneous execution flows would require an indefinitely large amount of physical resources.

Reduces any NP problem in polynomial time

Defines decision problems whose algorithms for solving them can be used to solve any NP problem after a polynomial time translation step. For instance, if we have a solver for a NP-hard problem, we can then build a solver for a NP problem as follows:

np-solver

Hence, we are effectively reducing the NP-problem into the NP-hard problem for solving it. Intuitively problems that allow this are at least as hard as the hardest NP problem, otherwise they couldn’t be used to solve any NP problem.

As a consequence, if we were to find a polynomial time algorithm for solving a NP-hard problem (which is unlikely), then we could use it to solve any NP problem in polynomial time as well, implying that P = NP and solving the P versus NP problem.

Demonstration

For this demonstration we will use two well-known NP-complete problems:

  • Knapsack problem: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.
  • Partition problem: Given a multiset S of positive integers, determine if it can be partitioned into two subsets S1 and S2 such that the sum of the numbers in S1 equals the sum of the numbers in S2.

First, consider the following solver for the knapsack problem in its decision form:

public class KnapsackProblemSolver
{
    public bool Solve(Item[] items, int capacity, int value)
    {
      /* ... */
    }
}

This solver will answer whether or not it’s possible to achieve a value combining the available items without exceeding the knapsack weight capacity, returning true in case it’s possible, false otherwise.

The Item data structure is presented below:

public class Item
{
    public int Value { get; set; }

    public int Weight { get; set; }

    public int Limit { get; set; }
}

Now, we want to implement a solver for the partition problem that answers whether or not a given multiset can be partitioned:

public class PartitionProblemSolver
{
    public bool Solve(int[] multiset)
    {
      /* ... */
    }
}

In order to do that we evaluate both problems objectives, translate the partition problem input accordingly, and feed it into the knapsack problem solver:

public class PartitionProblemSolver
{
    public bool Solve(int[] multiset)
    {
        var sum = multiset.Sum(n => n);
        if (sum % 2 == 1) // Odd sum base case for integer multiset
          return false;
        
        /* translate partition input into knapsack input */
        var items = multiset.Select(n => new Item { Value = n,  Weight = n, Limit = 1 }).ToArray();
        var capacity = sum / 2;
        var value = capacity;

        /* feed translated input into knapsack solver */
        return new KnapsackProblemSolver().Solve(items, capacity, value);
    }
} 

Let’s analyze the translation step:

  1. It creates a set of items with a single item for each multiset element, assigning the item’s value and weight from the element’s own value.
  2. It defines the knapsack capacity as the expected sum for each partition. Hence, we know that if the capacity is completely filled, the combined weight of items inside and outside of the knapsack will be same.
  3. It defines the knapsack target value also as the expected sum for each partition. Since the defined capacity prevents a value higher than this from being achieved, the solver will return true if, and only if, the exact expected sum for each partition can be achieved, thus solving the partition problem.

If you found that fun, I suggest you try to reduce another NP problem to the knapsack problem. You’ll find out that each problem needs a slightly different approach, sometimes recurring to heuristics to make it work.


In the demonstration at the end of the article you may have noticed that I didn’t provide an actual implementation for solving the knapsack problem, however you can find one at the linked Wikipedia page, or in one of my GitHub repositories: https://github.com/TCGV/Knapsack

This is my take on the topic of complexity classes of problems, which I find of major importance for becoming a more effective, skillful software developer, giving us the tools for better analyzing and solving technical challenges. I hope you enjoyed it!

Aug 5, 2019 - Data structures complexity reference

Quick complexity reference for most common data structures, considering efficient implementations usually available in frameworks class libraries.

Array

Time Complexity:

  Average Worst
Indexing O(1) O(1)
Search O(n) O(n)
Insert O(n) O(n)
Delete O(n) O(n)

Space Complexity: O(n)

Linked List

Time Complexity:

  Average Worst
Indexing O(n) O(n)
Search O(n) O(n)
Insert O(1) O(1)
Delete O(1) O(1)

Space Complexity: O(n)

For both single-linked and doubly-linked lists

Stack

Time Complexity:

  Average Worst
Push O(1) O(1)
Pop O(1) O(1)

Space Complexity: O(n)

Queue

Time Complexity:

  Average Worst
Enqueue O(1) O(1)
Dequeue O(1) O(1)

Space Complexity: O(n)

Hash Table

Time Complexity:

  Average Worst
Search O(1) O(n)
Insert O(1) O(n)
Delete O(1) O(n)

Space Complexity: O(n)

Heap

Time Complexity:

  Average Worst
Heapify O(n) O(n)
Find Max O(1) O(1)
Insert O(log(n)) O(log(n))
Delete O(log(n)) O(log(n))

Space Complexity: O(n)

Binary Search Tree

Time Complexity:

  Average Worst
Indexing O(log(n)) O(n)
Search O(log(n)) O(n)
Insert O(log(n)) O(n)
Delete O(log(n)) O(n)

Space Complexity: O(n)

Self-balancing Binary Search Tree

Time Complexity:

  Average Worst
Indexing O(log(n)) O(log(n))
Search O(log(n)) O(log(n))
Insert O(log(n)) O(log(n))
Delete O(log(n)) O(log(n))

Space Complexity: O(n)

Ex: AVL Tree, Red-Black Tree

Trie

Time Complexity:

  Average Worst
Search O(m) O(m)
Insert O(m) O(m)
Delete O(m) O(m)

Space Complexity: O(m * l)

Where:

  • m is the length of the search string
  • l is the length of the character alphabet

Aug 2, 2019 - A successful deployment model

The deployment is arguably the most critical software development life cycle (SDLC) phase. It takes an application from a battle tested generally stable version to a newer version which introduces new features, improvements and bug fixes. As much as we employ automated tests and set up a staging environment for quality assurance (QA), covering all use cases before deploying to production to make sure an application version is free of defects is unfeasible. Hence, a successful deployment model needs to handle not only software release, but also application monitoring and restoration.

The following sections discusses the deployment process in more detail, proposing a handful of rules for successfully managing risks associated with this process.

Deployable Image

In this context the deployable image is an executable package built from the application’s source code for distribution to test, staging and production environments. Two common formats are:

  • Container image: includes the application’s binaries built from source code, the runtime, system tools, system libraries and settings, everything required for running the application in a virtualization platform.
  • Binary package: includes only the application’s binaries built from source code, usually targeting a pre-determined hosting platform, such as a pre-configured virtual machine running a specific OS version.

Generating the deployable image is the first step of deployment, and as a first rule, you should:

Use the same deployable image for test, staging and production environments

This means that whenever a problem is detected and a code fix is required, the newly generated deployable image validation should start back at the test environment before moving forward through the pipeline. This helps reduce the degree of uncertainty in the deployable image as it goes through each environment, since even rebuilding an application from source code can introduce behavior changes due to different build machine configuration, or accidental package dependencies updates.

System Downtime

From the application user’s perspective deployment should happen seamlessly, without any sudden interruptions or problems. Any amount of downtime can result in loss of confidence in the application and even lead to financial implications. With this in mind, as a second rule, you should:

Update systems without downtime

If you’re adopting container images in your deployment process this becomes a lot easier, since it’s possible to allocate application services containers without much effort, having the new version and old version available side by side, and then switch networking traffic to the new version:

containers-bridge

From here there are two approaches:

  • Hard switch: You simply flip the switch, disconnecting the old version from the network and exposing the entirety of your users to the new version.
  • Canary release: A small set of users is initially exposed to the new version (ex: 5%), and as you become more confident that this version is stable, you gradually increase the percentage of users routed to the new version until reaching 100%.

Canary release is better for mitigating deployment risks, but usually requires more work to set up. Either way, once the new version container is online and healthy, the old version container can be safely removed from the host.

Automated Execution

We humans are prone to error. The more a deployment process rely on manual steps, the riskier it gets. A simple honest mistake in one manual step can be sufficient to bring your application down. For instance, someone could mistakenly apply staging configuration to the production environment, or inadvertently revert the order in which dependable services are deployed, causing them to crash.

Therefore, as a third rule, you should:

Fully automate the deployment process

It shouldn’t take you more than one click to deploy your application. More sophisticated applications even employ continuous deployment, in which a single commit to the main repository’s master branch is enough to trigger an automated deployment to production. However, these applications rely deeply on automation for testing, canary releasing and monitoring.

Monitoring

The main goal of monitoring is to automatically detect problems as soon as possible when they occur, so we can fix them as quickly as possible. Ideally it should be comprehensive enough that we never feel the need to check something manually. Instead, we rely on system administrators being automatically notified when a problem is detected.

Then, as a fourth rule you should:

Set up and rely on automatic monitoring for early problem detection

There are at least two lines of monitoring which are effective for closing the feedback loop in the deployment process: health monitoring and error monitoring.

Health Monitoring

In this line of monitoring we are interested in assuring that our application is performing as expected. First we define a set of system and business metrics that adequately represents the application behaviors. Then we start tracking these metrics, triggering an alert whenever one of them falls outside of its expected operational range.

A few examples of system metrics are:

  • Number of active database connections
  • Web server requests per second
  • Available memory per host machine
  • IOPS per host machine

As for business metrics, a few examples are:

  • User registration rate
  • Number of online users
  • Image upload rate (for media sharing application)
  • Content like rate (for a social application)

Error Monitoring

Errors are continuously occurring in applications, but most of them are known and expected errors, such as authentication token expiration errors, or database query timeout errors. In this line of monitoring we are interested in discovering deviations, or anomalies, from the expected occurrence of application errors, triggering an alert when that happens. An anomaly can be identified as:

  • A significant and unexpected increase (or decrease) in the rate of a known error
  • A consistent manifestation of a new, unexpected, type of error

Monitoring is valuable not only during deployment, but also long after it’s been completed. The faster we are able to identify and solve issues, the less users will be negatively affected by them.

Reversibility

Sooner or later, no matter what, deployment problems will happen, it’s a fact, and if you’re not prepared to handle them quickly, then you’re taking high risks in doing so, depending on good odds favoring you.

As we’ve learned in the previous section monitoring should give us enough information to detect and evaluate problems as they happen. Problems can be manageable, meaning that we can act on them while they’re occurring in the production environment, or they can be showstoppers, meaning that we will need to solve them immediately, bringing the production environment back to a stable state.

Fortunately there’s a straightforward way for dealing with the latter case: reverting the application back to its earlier, stable version. Then, as a fifth rule you should:

Support rollback to earlier application versions

From figure 1 this can be as simple as flipping the switch back to the original container for all hosts, but this only handles the case when there’s a need to revert to the immediately previous version. You’ll have to keep a deployable image rollout history for being able to execute a rollback deployment reverting your application to older versions.

There is a special case to be careful though. As the application evolves, so does the database schema and stored data. If you design the database to evolve while preserving backwards compatibility with earlier application versions, you should be fine. But sometimes this may not possible and we are required to release a version with breaking changes to the database. A strategy to handle this scenario is to incorporate migration scripts to the deployment, one for committing changes to the database, and one for reverting them.


This article is intended to provide guidelines and best practices for helping to reduce risks associated with the deployment process. It takes reasonable effort and time to achieve a matured and reliable deployment process that works for your application, so don’t feel rushed to build a full fledged process right from scratch, but commit yourself to gradually and constantly improve it.