Jan 27, 2020 - Mail submission under the hood

A couple of years ago I implemented a mail submission component capable of interacting with SMTP servers over the internet for delivering e-mail messages. Developing it helped me better understand the inner workings of e-mail transmission, which I share in this article.

I named this component “Mail Broker” as not to be confused with “Message Submission Agent” already defined in RFC 6409 along with other participating agents of electronic mail infrastructure:

Message User Agent (MUA): A process that acts (often on behalf of a user and with a user interface) to compose and submit new messages, and to process delivered messages.

Message Submission Agent (MSA): A process that conforms to this specification. An MSA acts as a submission server to accept messages from MUAs, and it either delivers them or acts as an SMTP client to relay them to an MTA.

Message Transfer Agent (MTA): A process that conforms to SMTP-MTA. An MTA acts as an SMTP server to accept messages from an MSA or another MTA, and it either delivers them or acts as an SMTP client to relay them to another MTA.

This Mail Broker is intended for sending e-mail messages directly from within .NET applications supporting alternate views and file attachments. Let’s take a look at some sample usage code (C#) to clear it out:


using (var msg = new MailMessage())
{
    msg.From = new MailAddress("admin@mydomain.com", "System Administrator");
    msg.To.Add(new MailAddress("john@yahoo.com"));
    msg.CC.Add(new MailAddress("bob@outlook.com"));

    msg.Subject = "Hello World!";
    msg.Body = "<div>Hello,</div><br><br>" +
        "<div>This is a test e-mail with a file attached.</div><br><br>" +
        "<div>Regards,</div><br><div>System Administrator</div>";
    msg.IsBodyHtml = true;

    var plain = "Hello,\n\nThis is a test e-mail with a file attached.\n\n" +
        "Regards,\nSystem Administrator";
    var alt = AlternateView.CreateAlternateViewFromString(plain, new ContentType("text/plain"));
    msg.AlternateViews.Add(alt);

    var filePath = @"...\some-folder\pretty-image.png";
    var att = new Attachment(filePath);
    msg.Attachments.Add(att);
		
    var config = new BrokerConfig { HostName = "mydomain.com" };
    using (var broker = new MailBroker(config))
    {
        var result = broker.Send(msg);
        Console.WriteLine($"Delivered Messages: {result.Count(r => r.Delivered)}");
    }
}

The key difference about it is that messages can be sent from any source e-mail address directly to destination SMTP servers, as long as you perform required sender domain and IP address authentication steps to prevent being blacklisted for misuse (more on that below).

I’ve put it together integrating a couple of GitHub projects, fixing bugs, and implementing my own code as well. You can access its repository on GitHub here.

A lot of networking, encoding and security concepts were involved in this experiment, which I share in the following sections. I believe they are valuable for anyone trying to get a picture of what happens under the hood when we send an ordinary e-mail.

Internal Structure

The diagram below illustrates the Mail Broker internal structure:

mail broker

It is implemented on top of networking and message encoding components. Native framework components / resources such as TcpClient, RSACryptoServiceProvider, SHA256CryptoServiceProvider, Convert.ToBase64String and others were omitted from the diagram for simplicity.

I also took advantage of .NET MailMessage class that can be sent using the now obsolete System.Net.Mail.SmtpClient as it already represents all fields and views of a typical e-mail message.

Sending a Message

Internally the message delivery is performed by the following piece of code:


var mxDomain = ResolveMX(group);

if (!string.IsNullOrWhiteSpace(mxDomain))
{
    Begin(mxDomain);
    Helo();
    if (StartTls())
        Helo();
    foreach (var addr in group)
        results.Add(Send(message, addr));
    Quit();
}
else
{
    foreach (var addr in group)
        results.Add(new Result(delivered: false, recipient: addr));
}

This piece is executed for each destination email address (To, Cc, Bcc). First it resolves mail exchanger records (MX record) for destination addresses grouped by domain, as to find out where to connect for submitting the message. If a MX domain is not found the code marks delivery for this recipient as failed. However, if the MX domain is found it proceeds by beginning an SMTP session with this external mail server.

The Mail Broker will connect to the SMTP server on port 25, receive a “ServiceReady” status code, and start the conversation by sending a HELO command. Then it tries to upgrade this plain text connection to an encrypted connection by issuing a STARTTLS command. If that’s not supported by the mail server it falls back to the plain text connection for delivering the message, which is really bad security wise, but is the only option for delivering the message in this case 😕

We all know that communication on the internet should always be encrypted, but I was amazed to find out how many SMTP servers still don’t support it and communicate in plain text, just hoping that no one is watching! Google’s Transparency Report indicates that today more than 10% of all e-mail messages sent by Gmail are still unencrypted, usually because the receiving end doesn’t support TLS.

Moving forward, after upgrading the connection, the code will issue SMTP commands for individually sending the message to each destination address in this domain group, and after that QUIT the session. The sequence for sending the message is contained in the method showed below:


private Result Send(MailMessage message, MailAddress addr)
{
    WriteLine("MAIL FROM: " + "<" + message.From.Address + ">");
    Read(SmtpStatusCode.Ok);

    WriteLine("RCPT TO: " + "<" + addr.Address + ">");
    SmtpResponse response = Read();

    if (response.Status == SmtpStatusCode.Ok)
    {
        WriteLine("DATA ");
        Read(SmtpStatusCode.StartMailInput);
        WritePayload(message);
        WriteLine(".");
        response = Read();
    }
    else
    {
        WriteLine("RSET");
        Read(SmtpStatusCode.ServiceReady, SmtpStatusCode.Ok);
    }

    return new Result(
        delivered: response.Status == SmtpStatusCode.Ok,
        recipient: addr,
        channel: channel,
        response: response
    );
}

This code reproduces the typical SMTP transaction scenario described in RFC5321:

1.    C: MAIL FROM:<Smith@bar.com>
2.    S: 250 OK
3.    C: RCPT TO:<Jones@foo.com>
4.    S: 250 OK
5.    C: DATA
6.    S: 354 Start mail input; end with <CRLF>.<CRLF>
7.    C: Blah blah blah...
8.    C: ...etc. etc. etc.
9.    C: .
10.   S: 250 OK
11.   C: QUIT
12.   S: 221 foo.com Service closing transmission channel

Lines 7 through 9 perform the transmission of the encoded mail message payload. This is handled by the WritePayload(message) and subsequent WriteLine(".") method calls in the code snippet above, which we will get into more detail in the next section.

Encoding the Message

Preparing the message for transmission is tiresome, with lots of minor details to consider. It can be broken down into two parts, the header and the content. Let’s start with the header, the code below was extracted from the MailPayload class, and is responsible for generating the encoded message headers for transmission:


WriteHeader(new MailHeader(HeaderName.From, GetMessage().From));

if (GetMessage().To.Count > 0)
    WriteHeader(new MailHeader(HeaderName.To, GetMessage().To));

if (GetMessage().CC.Count > 0)
    WriteHeader(new MailHeader(HeaderName.Cc, GetMessage().CC));

WriteHeader(new MailHeader(HeaderName.Subject, GetMessage().Subject));

WriteHeader(new MailHeader(HeaderName.MimeVersion, "1.0"));

if (IsMultipart())
{
    WriteHeader(HeaderName.ContentType, "multipart/mixed; boundary=\"" + GetMainBoundary() + "\"");
    WriteLine();

    WriteMainBoundary();
}

These are pretty much straight forward:

  • From: Specifies the author(s) of the message
  • To: Contains the address(es) of the primary recipient(s) of the message
  • Cc: Contains the addresses of others who are to receive the message, though the content of the message may not be directed at them
  • Subject: Contains a short string identifying the topic of the message
  • MimeVersion: An indicator that this message is formatted according to the MIME standard, and an indication of which version of MIME is used
  • ContentType: Format of content (character set, etc.)

If the message is multipart, i.e., contains any alternate view or attachment, then there’s a need to define a content boundary for transmitting all content parts, which can have different content types and encodings.

Notice that I’m not using the Bcc header, as RFC 822 leaves it open to the system implementer:

Some systems may choose to include the text of the “Bcc” field only in the author(s)’s copy, while others may also include it in the text sent to all those indicated in the “Bcc” list

There are dozens of other headers available providing more elaborate functionality. The initial IANA registration for permanent mail and MIME message header fields can be found in RFC4021, and is overwhelming. You can see I only covered the most basic headers in this experiment.

Now let’s look at how the message content is being handled:


private void WriteBody()
{
    var contentType = (GetMessage().IsBodyHtml ? "text/html" : "text/plain");
    WriteHeader(HeaderName.ContentType, contentType + "; charset=utf-8");
    WriteHeader(HeaderName.ContentTransferEncoding, "quoted-printable");
    WriteLine();
    WriteLine(QuotedPrintable.Encode(GetMessage().Body));
    WriteLine();
}

The body content can be sent as plain text or as HTML, and an additional ContentType header will indicate which one is used. There’s also a ContentTransferEncoding header present, which in this case is always adopting Quoted-printable, a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path. It also limits line length to 76 characters for legacy reasons (RFC2822 line length limits):

There are two limits that this standard places on the number of characters in a line. Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF

Attachments binary data are encoded to Base64, yet another a binary-to-text encoding system, before being written to the underlying SMTP channel:


private void WriteAttachment(Attachment attachment)
{
    WriteHeader(HeaderName.ContentType, attachment.ContentType.ToString());
    WriteHeader(HeaderName.ContentDisposition, GetContentDisposition(attachment));

    if (!string.IsNullOrWhiteSpace(attachment.ContentId))
        WriteHeader(HeaderName.ContentID, "<" + attachment.ContentId + ">");

    WriteHeader(HeaderName.ContentTransferEncoding, "base64");
    WriteLine();

    WriteBase64(attachment.ContentStream);
    WriteLine();
}

Even though Base64 encoding adds about 37% to the original file length it is still widely adopted today. It may seem strange and wasteful, but, for historical reasons, it became the standard for transmitting email attachments and it would require huge efforts to change this.

The code for writing headers and content related to alternative views was omitted for simplicity. If you’re curious you can find it in the MailPayload class source code on GitHub.

Securing the Message

Suppose the message was encoded and transmitted as described above, how can we trust the destination mail server not to tamper with the message’s headers and contents? The answer is we can’t! There’s nothing preventing it from changing the message before delivering it to the end user so far.

Fortunately, to address this issue we can employ the DomainKeys Identified Mail (DKIM) authentication method:

DKIM allows the receiver to check that an email claimed to have come from a specific domain was indeed authorized by the owner of that domain. It achieves this by affixing a digital signature, linked to a domain name, to each outgoing email message. The recipient system can verify this by looking up the sender’s public key published in the DNS. A valid signature also guarantees that some parts of the email (possibly including attachments) have not been modified since the signature was affixed.

I’ve integrated a DKIM Signer component to the Mail Broker for signing encoded messages before sending them. To use it we need to provide a DkimConfig object to the Mail Broker constructor:


public class DkimConfig
{
    public string Domain { get; set; }

    public string Selector { get; set; }

    public RSAParameters PrivateKey { get; set; }
}

The domain and selector parameters are used by receiving SMTP servers for verifying the DKIM signature header, which perform a DNS lookup on <selector>._domainkey.<domain> for retrieving a TXT record containing the signature public key information, for instance:

"v=DKIM1; k=rsa; t=s; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDDmzRmJRQxLEuyYiyMg4suA2Sy
MwR5MGHpP9diNT1hRiwUd/mZp1ro7kIDTKS8ttkI6z6eTRW9e9dDOxzSxNuXmume60Cjbu08gOyhPG3
GfWdg7QkdN6kR4V75MFlw624VY35DaXBvnlTJTgRg/EW72O1DiYVThkyCgpSYS8nmEQIDAQAB"

This configuration object is passed to the MailPayload class for writing signed content:


private void WriteContent()
{
    if (config != null)
        WriteSignedContent();
    else
        WriteUnsignedContent();
}

private void WriteSignedContent()
{
    var content = new MailPayload(GetMessage());

    var dkim = new DkimSigner(config);
    var header = dkim.CreateHeader(content.Headers, content.Body, signatureDate);

    WriteHeader(header);
    Write(content);
}

As you can see the WriteSignedContent instantiates another MailPayload class without passing the DkimConfig constructor parameter. This results in encoding the mail message using the WriteUnsignedContent method. Once headers and body are encoded it instantiates a DkimSigner class and creates a signature header that precedes the content in the mail data transmission.

Making this DKIM signing component work was the most difficult task in this project. You can take a closer look at how it works from MailPayload unit tests here.

Authenticating Senders

At the beginning of this post I wrote that the Mail Broker can send messages from any source e-mail address, just like Mailchimp can send messages on your behalf without you ever giving it your e-mail password. How’s that possible? you may ask, and the answer is because e-mail delivery is a reputation based system.

But keep this in mind, sending the message doesn’t mean it will be accepted, specially if the sender doesn’t have a good reputation. Here’s what happens when I try to send a message from no-reply@thomasvilhena.com to a Gmail mailbox from a rogue server (IP obfuscated):

421-4.7.0 [XX.XXX.XXX.XXX XX] Our system has detected that this message is
421-4.7.0 suspicious due to the very low reputation of the sending IP address.
421-4.7.0 To protect our users from Spam, mail sent from your IP address has
421-4.7.0 been temporarily rate limited. Please visit
421 4.7.0 https://support.google.com/mail/answer/188131 for more information. e6si8481552qkg.297 - gsmtp

You see, Gmail found the message suspicious, and as I checked it didn’t even reach the destination mailbox spam folder, the sender reputation was so low the message wasn’t even spam worthy!

Here’s what Gmail suggests to prevent messages from being marked as Spam, or not being delivered to the end user at all:

  • Verify the sending server PTR record. The sending IP address must match the IP address of the hostname specified in the Pointer (PTR) record.
  • Publish an SPF record for your domain. SPF prevents Spammers from sending unauthorized messages that appear to be from your domain.
  • Turn on DKIM signing for your messages. Receiving servers use DKIM to verify that the domain owner actually sent the message. Important: Gmail requires a DKIM key of 1024 bits or longer.
  • Publish a DMARC record for your domain. DMARC helps senders protect their domain against email spoofing.

Besides the DKIM authentication method already discussed, there are more three DNS based authentication methods specifically designed for IP address and domain validation on the receiving end, which increase the likelihood that the destination SMTP server trusts you are who you say you are, and not a spammer / scammer.

These authentication methods along with strictly following mailing best practices (ex: sticking to a consistent send schedule / frequency, implementing mailing lists opt-in / unsubscribe, constantly checking blacklists, carefully building messages, etc) will help, bit by bit, to build a good sender reputation and having e-mail messages accepted and delivered to the destination users inbox.

Delivery Test

This article wouldn’t be complete without a successful delivery test, so I followed the steps presented in the previous sections and:

  1. Created an EC2 instance running on AWS
  2. Created a SPF record for that instance’s public IP Address
  3. Generated a private RSA key for signing messages
  4. Turned on DKIM by creating a DNS record publishing the public key
  5. Ran the Mail Broker from the EC2 instance sending a “Hello World” message

This time the response from Gmail server was much more friendly:

250 2.0.0 OK 1580057800 f15si8371297qtg.4 - gsmtp

Nevertheless, since my domain has no mailing reputation yet, Gmail directed the message to the Spam folder:

Gmail Spam

Here are the delivery details:

Gmail summary

Notice that standard encryption (TLS) was used for delivering the message, so the connection upgrade worked as expected.

What about SPF authentication and DKIM signature, did they work? Using Gmail’s “show original” feature allows us to inspect the received message details quite easily:

Gmail headers

Success! Both SPF and DKIM verifications passed the test ✔️


Building a mail delivery system from scratch is no simple feat. There are tons of specifications to follow, edge cases to implement and different external SMTP servers that your system will need to handle. In this experiment I only implemented the basic cases, and interacted mainly with Gmail’s server which is outstanding, giving great feedback even in failure cases. Most SMTP servers out there wont do that.

Mail transmission security has improved a lot over the years, but still has a long way to go to reach acceptable standards.

It was really fun running this experiment, but I must emphasize it’s not intended to production use. I guess I will just stick to a managed mail sending API for now 😉

Jan 15, 2020 - Likelihood of autoregression in stock returns

The standard rule for calculating a stock returns monthly volatility is to multiply the daily volatility by the square root of the number of business days in that month, for instance:

Monlthly volatility

This rule is derived from an independent log-normal daily returns model. However, if the stochastic process is considered autoregressive, i.e, having its output variable linearly dependent on its own previous values, then this rule is likely to give us inaccurate volatilities.

Autoregressive behaviour can express itself as a “momentum”, perpetuating the effect of an upward or downward movement, or as a “reversal”, acting against it. The first case can lead to a higher volatility than that calculated by the standard rule, and the later case can lead to a lower volatility.

In this post I evaluate if an autoregressive Gaussian model better describes the behavior of observed daily stock returns, when compared to a non autogressive control model.

The code I share in this article is written in the R programming language, using the following packages: plotly, ggplot2, sfsmisc.

Data Set

Just like in the previous article, I picked three years worth of PETR4 historical log returns to work on, which is the most actively traded stock in B3, the largest stock exchange in Brazil. You can find the source data in the links below:

To generate these files I used a simple parser I built for B3 historical quotes, freely available at their website from 1986 up to the current year, which you can find here.

The Autoregressive (AR) Model

Our AR model is defined as:

AR model

Where:

  • μ → is the Gaussian distribution mean
  • σ → is the Gaussian distribution standard deviation
  • ki → is the AR parameter of order “i”
  • p → is the AR model order

Let’s take a moment to analyze it. It is composed of two terms, a stochastic term and an AR term. If p, the model order, is zero, than it becomes a simple non autoregressive Gaussian model, similar to the one I used in my previous post. If p is greater than zero than past outputs are taken into account when calculating its current output, weighted by the ki parameters.

Model Fitting

Since my last post I have sharpened my R skills a bit 🤓, and learned how to use optimization functions for fitting parameters more efficiently (instead of using a brute-force approach). First let me define the autoregressive model likelihood function that will be minimized:


# calculates likelihood for the AR model, and inverts its signal
# k: model parameters ([σ, μ, kp, kp-1, ... , k1 ])
# x: stock returns samples
likelihood <- function(k, x) {
  lke <- 0
  offset <- 21
  order <- length(k) - 2
  for(i in offset:length(x)) {
    xt <- x[i]
    if (order > 0) {
      xt <- xt - sum(k[3:length(k)] * x[(i-order):(i-1)])
    }
    p <- dnorm(xt, sd=k[1], mean=k[2])
    lke <- lke + log(p)
  }
  return (-lke)
}

Now let’s read the data set and optimize our model’s parameters for varying p orders, starting at zero (non autoregressive control case) up to a twenty days regression:


x <- read.table("PETR4_2019.txt")[["V1"]]

max_order <- 20

res <- lapply(0:max_order, function(n) {
  return (optim(c(0.01, 0.0, rep(-0.01, n)), function(k) likelihood(k, x), method="BFGS"))
})

The resulting res list will hold, for each model order analyzed, the optimized parameter set and its corresponding likelihood. Let’s check the results, below is the code for plotting the likelihood of each model order analyzed, which will help us find the best performing one:


df <- data.frame(
  x = seq(0, max_order), y = sapply(res, function(e) (res[[1]][[2]]-e[[2]]))
)

p <- ggplot() +
  ggtitle("Order Analysis") +
  xlab("Order (p)") +
  ylab("Log-likelihood ratio") + 
  geom_line(data = df, aes(x=x, y=y), color='red')
ggplotly(p)

p analysis

Interestingly enough it seems that AR models of higher orders yielded much higher likelihood than lower orders, possibly indicating that stock daily returns is under the influence of an AR effect, but let’s not jump into any conclusion just yet!

So the best performing model was found to have p=20, log-likelihood ratio of 7.0685184 and the following set of parameters:

  • σ → 0.0178999192
  • μ → 0.0010709814
  • [kp, kp-1, … , k1] → [-0.0210373765, 0.0038455320, -0.0233608796, 0.0329653224, -0.0824597966, 0.0177617412, 0.0142196315, 0.0313809394, 0.0519001996, -0.0562944497, 0.0034231849, 0.0255535975, 0.0824235508, -0.0832489175, -0.0863261198, -0.0008716017, -0.0996273827, -0.0920698729, -0.0613492897, -0.0640396245]

Which raises the question:

Is this result statistically significant to confirm an AR effect in the analyzed data?

Well, we will find that out in the next section.

Probability Value

In statistical hypothesis testing, the probability value or p-value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming, in our case, that there is no autoregressive effect in place. If the p-value is low enough, we can determine, with resonable confidence, that the non autoregressive model hypothesis, as it stands, is false.

In order to calculate the p-value initially we need to estimate the observed log-likelihood ratio probability distribution under the hypothesis that our process follows a non autoregressive Gaussian model (control process). So let’s replicate the analysis above under this assumption:


iterations <- 1000   # number of analysis per order
max_order <- 20      # same number of orders as before
num_samples <- 247   # replicating observed data length (1 year)
wiener_std <- 0.018  # Control model standard deviation

res <- lapply(1:iterations, function(z) {
  
  # generates independent normally distributed samples
  x <- rnorm(num_samples, 0, wiener_std)
  
  y <- lapply(0:max_order, function(n) {
    return (optim(c(0.01, 0.0, rep(-0.01, n)), function(k) likelihood(k, x), method="BFGS"))
  })
  
  return (sapply(y, function(e) c(-e[[2]], e[[4]])))
})

Beware that this code may take a couple of hours to complete running, or more, dependeding on your notebook configuration. It generates 1000 log-likelihood ratio samples for each model order. The box plot below gives us a qualitative understanding of this output:


df <- lapply(0:max_order, function(k) {
  vec <- t(sapply(1:length(res), function(l) {
    return (c(k, res[[l]][1, k + 1] - res[[l]][1, 1]))
  }))
  return (data.frame(
    period = vec[,1],
    ratio = vec[,2]
  ))
})

df <- do.call(rbind, df)
df$period <- as.factor(df$period)

ggplot(df, aes(x=period, y=ratio)) +
  ggtitle("Control process analysis") +
  xlab("Order (p)") +
  ylab("Log-likelihood ratio") + 
  geom_boxplot()

control process likelihood

This result took me by surprise. As the order of the AR model grows, the better it becomes in fitting samples generated by a non AR model! It sure looks counterintuitive at first, but after a second thought it’s not that surprising at all, since as we add more parameters to the AR model we are actually making it easier for it to mimic this set of generated samples.

Moving forward, I defined a function that calculates the one-tailed p-value from this resulting res list. A margin of error of 3% (95% confidence interval) is expected since 1000 samples were used:


# caluates the one-tailed p-value
# res: pre-processed log-likelihood ratioss
# n: the model order
# lke: log-likelihood ratio whose p-value should be obtained

p_value <- function (res, n, lke) {
  vec <- sapply(1:length(res), function(l){
       return (res[[l]][1, n + 1] - res[[l]][1, 1])
  })
  d_lke <- density(vec)
  i <- findInterval(lke, d_lke[["x"]])
  l <- length(d_lke[["x"]])
  if (i == l) {
    return (0.00)
  }
  return (integrate.xy(d_lke[["x"]][i:l], d_lke[["y"]][i:l]))
}

Thus, calling this function like so p_value(res, 20, 7.0685184) for our original “best fit” yields a p-value of 0.814. This figure is nowhere near the typical 0.05 threshold required for rejecting the non AR model hypothesis.

Also notice that the log-likelihood ratio alone is not sufficient for determining the AR model order that best outperforms the control model, as we need to look at the p-value instead. I ran the analysis for the previous years as well and the complete p-value results (2017, 2018, 2019) are displayed below:

Order (p) 2017 2018 2019
1 0.000 0.069 0.399
2 0.002 0.195 0.502
3 0.005 0.005 0.420
4 0.000 0.011 0.336
5 0.001 0.000 0.465
6 0.001 0.000 0.338
7 0.003 0.000 0.295
8 0.005 0.000 0.296
9 0.008 0.001 0.360
10 0.014 0.002 0.452
11 0.011 0.001 0.456
12 0.019 0.001 0.479
13 0.024 0.002 0.540
14 0.028 0.001 0.614
15 0.043 0.002 0.661
16 0.014 0.003 0.617
17 0.013 0.004 0.670
18 0.011 0.005 0.725
19 0.016 0.005 0.775
20 0.007 0.005 0.814

Despite of the weak results for 2019, after taking the margin of error into account 17 out of 20 model orders passed the significance test for the year of 2017 and 18 out of 20 model orders passed the significance test for the year of 2018.

Conclusion

I’ve got mixed results in this analysis. If I were to look to the stock’s daily returns in the year of 2019 alone I might conclude that the control model hypothesis couldn’t be rejected in favor of the AR model. However, the results for the years of 2017 and 2018 seem strong enough to conclude otherwise. This may indicate an intermittent characteristic of the AR behavior intensity. Further investigation is required.

Dec 23, 2019 - Stock option pricing inference

I’m a stock market enthusiast 📈, so lately I have spent my spare time on the topic of stock option pricing, and I was curious to find out whether or not I could build a script that accurately reflected real market option prices. It turns out I came close, and learned a lot along the way, which I share in this article.

My analysis uses the Monte Carlo method for pricing options, which from all available pricing models in the literature is the most straight forward, practical one for a software developer to start working with, when compared to closed form, analytic models, such as the Black–Scholes model.

The code I share in this article is written in the R programming language, using the following packages: sfsmisc, plotly, ggplot2.

Data Set

I had to pick a stock to work on as the starting point, so I decided to use PETR4 which is the most actively traded stock in B3, the largest stock exchange in Brazil, and also the stock with the most actively traded options.

On December 17, 2019, while PETR4 was priced at R$ 29.71, I spoted the following european call option prices:

Symbol Strike Bid Ask
PETRA232 22.81 6.92 7.09
PETRA239 23.81 5.93 6.09
PETRA252 24.81 4.98 5.09
PETRA257 25.31 4.48 4.64
PETRA261 25.81 4.01 4.15
PETRA266 26.31 3.57 3.68
PETRA271 26.81 3.12 3.22
PETRA277 27.31 2.67 2.76
PETRA281 27.81 2.24 2.33
PETRA285 28.31 1.88 1.90
PETRA289 28.81 1.52 1.54
PETRA296 29.31 1.19 1.21
PETRA301 29.81 0.92 0.93
PETRA307 30.31 0.69 0.71
PETRA311 30.81 0.52 0.53
PETRA317 31.31 0.38 0.39
PETRA325 32.31 0.20 0.22
PETRA329 32.81 0.16 0.17
PETRA333 33.31 0.12 0.13
PETRA342 33.81 0.10 0.11
PETRA343 34.31 0.07 0.09
PETRA351 34.81 0.06 0.07
PETRA358 35.81 0.04 0.05
PETRA363 36.31 0.03 0.04
PETRA368 36.81 0.02 0.04
PETRA373 37.31 0.02 0.03
PETRA378 37.81 0.02 0.03
PETRA383 38.31 0.01 0.02
PETRA388 38.81 0.01 0.02
PETRA393 39.31 0.01 0.02
PETRA398 39.81 0.01 0.02

The maturity of these options is the same, January 20, 2020, i.e., 20 business days from the recorded prices date. I saved this data into a CSV formatted file that you can find here.

Besides that I also used six months worth of daily PETR4 log returns from 17/Jun/2019 to 16/Dec/2019 (126 samples), which you can find here.

Distribution of Stock Returns

Under the Monte Carlo method, in order to calculate the price of a stock option, first we need a probability distribution model of the stock returns. Let’s take a look at the probability density estimate for our stock:


x <- read.table("PETR4_Returns.txt")[["V1"]]

d_real <- density(x)
df <- data.frame(x = d_real[["x"]], y = d_real[["y"]])

p <- ggplot() +
  ggtitle("PETR4 daily returns density estimate") +
  geom_line(data = df, aes(x=x, y=y), colour="black") 
ggplotly(p)

PETR4 desnity

Because the number of samples is small, we don’t get that fine-looking symmetric density estimate we would like. Instead, what we get is a density estimate somewhat sensitive to trends and macroeconomical events of this six month period. If we were to look at the previous six month period this observed density estimate would certainly show differences.

The next step is to find a model that reasonably fits this observed density. While there isn’t an agreement on which probability density function best describes stock returns, I decided to turn to two notable, widely adopted, basic functions, the Gaussian distribution and the Cauchy Distribution.

Guassian Model

Let’s try to fit a Gaussian density function to our market data, and see what it looks like. First I will define a couple of auxiliary functions:


# generates a truncated gaussian density
dgauss <- function(std, mean, min, max) {
  d.x <- (0:511) * (max - min) / 511 + min
  d.y <- dnorm(d.x, mean, sd = std)
  d.y <- d.y / integrate.xy(d.x, d.y)
  f <- data.frame(x = d.x, y = d.y)
  return (f)
}

# interpolates data frame
interpolate <- function(df, min, max, n) {
  step <- (max - min) / n
  pol <- data.frame(
    with(df, 
         approx(x, y, xout = seq(min, max, by = step))
    )
  )
  return (pol)
}

# simple numeric maximum likelihood estimation
# (resolution reffers to that of daily log returns)
mle <- function(dist, x) {
  min_x <- min(x)
  max_x <- max(x)
  resolution <- 10000
  n = (max_x - min_x) * resolution
  rev <- interpolate(data.frame(x = dist$x, y = dist$y), min_x, max_x, n)
  indexes = sapply(x, function (x) { return ((x - min_x) * resolution + 1) })
  return (log(prod(rev$y[indexes])))
}

With these functions I developed a simple brute force script to find the Gaussian distribution standard deviation that best fits the observed density estimate:


max_mle <- -Inf
d_norm <- c()

for(k in seq(0.001, 0.1, by = 0.0001)) {
  dist <- dgauss(k, mean(x), -6 * sd(x), 6 * sd(x))
  val <- mle(dist, x)
  if (val > max_mle) {
    d_norm <- dist
    max_mle <- val
  }
}

df1 <- data.frame(x = d_real[["x"]], y = d_real[["y"]])
df2 <- data.frame(x = d_norm[["x"]], y = d_norm[["y"]])

p <- ggplot() +
  ggtitle("Probability densities") +
  scale_colour_manual("", values = c("red", "blue")) +
  geom_line(data = df1, aes(x=x, y=y, colour="Observed")) +
  geom_line(data = df2, aes(x=x, y=y, colour="Gaussian"))
ggplotly(p)

Gaussian density

Both densities display a similar shape, but it’s perceptible that the Gaussian density decreases marginally faster than the observed density.

Despite the fact that the Gaussian distribution is widely used in fianacial models, it has some well known pitfalls, namely its inability to encompass fat tails observed in historical market data. As a result it will fail accurately describe extreme volatility events, which can lead to possibly underpriced options.

Cauchy Model

The Cauchy distribution, just like the Gaussian distribution, is a stable distribution, but one with fat tails. Over the years several academic papers have looked into it as an alternative to best describe these extreme volatility events.

So let’s see how it performs. Initially, I defined the following auxiliary function in order to fit it to the observed density.


# generates a truncated cauchy density
dcauchy <- function(s, t, min, max) {
  d.x <- (0:511) * (max - min) / 511 + min
  d.y <- 1 / (s * pi * (1 + ((d.x - t) / s) ^ 2))
  d.y <- d.y / integrate.xy(d.x, d.y)
  f <- data.frame(x = d.x, y = d.y)
  return (f)
}

The brute force fitting script is the same as the previous one, replacing the call to dgauss with dcauchy:


max_mle <- -Inf
d_cauchy <- c()

for(k in seq(0.001, 0.1, by = 0.0001)) {
  dist <- dcauchy(k * (sqrt(2/pi)), mean(x), -6*sd(x), 6*sd(x))
  val <- mle(dist, x)
  if (val > max_mle) {
    d_cauchy <- dist
    max_mle <- val
  }
}

df1 <- data.frame(x = d_real[["x"]], y = d_real[["y"]])
df2 <- data.frame(x = d_cauchy[["x"]], y = d_cauchy[["y"]])

p <- ggplot() +
  ggtitle("Probability densities") +
  scale_colour_manual("", values = c("red", "blue")) +
  geom_line(data = df1, aes(x=x, y=y, colour="Observed")) +
  geom_line(data = df2, aes(x=x, y=y, colour="Cauchy"))
ggplotly(p)

Cauchy density

Intuitively, by looking at this plot, it feels that the Cauchy density isn’t as much as a good fit as the Gaussian density. Indeed, the log-likelihood ratio of the Gaussian Model to the Cauchy Model is 7.5107, quite significant.

The general consensus for the Cauchy distribution is that it has too fat tails, which in our case may lead to overpriced options. Nevertheless, it can be very useful for defining a high volatility scenario for risk analysis.

Now that we have our probability distribution models we can go ahead and calculate option prices based on them.

Option Pricing

The technique applied to calculate the fair price of an option under the Monte Carlo method is:

(1) to generate a large number of possible, but random, price paths for the underlying stock via simulation, and (2) to then calculate the associated exercise value (i.e. “payoff”) of the option for each path. (3) These payoffs are then averaged and (4) discounted to today. This result is the value of the option.1

To accomplish that I defined, yet again, more auxiliary functions:


# generates random samples for a probability density
dist_samples <- function(dist, n) {
  samples <- (approx(
    cumsum(dist$y)/sum(dist$y),
    dist$x,
    runif(n)
  )$y)
  samples[is.na(samples)] <- dist$x[1]
  return (samples)
}

# calculates the expected value of a probability density
exp_val <- function(dist) {
  v <- dist[["x"]] * dist[["y"]]
  return (sum(v) / sum(dist[["y"]]))
}

# converts a single day probability density to a n-day period
# (removes any drift from the input density)
dist_convert <- function(dist, n) {
  dd <- data.frame(x = dist$x - exp_val(dist), y = dist$y)
  dn <- density(sapply(1:100000, function(x) (sum(dist_samples(dd, n)))))
  return (dn)
}

# calculates the price of a call option from a probability density
call_price <- function(dist, strike, quote) {
  v <- (((exp(1) ^ dist[["x"]]) * quote) - strike) * dist[["y"]]
  v <- pmax(v, 0)
  return (sum(v) / sum(dist[["y"]]))
}

And here’s the final script for calculating option prices based on the fitted Gaussian and Cauchy distribution, for each strike price, and then plot the result against actual (Bid) prices:


stock_quote <- 29.71
days_to_maturity <- 20

df1 <- read.csv("PETR4_Calls.txt", header = TRUE, sep = ";")
strikes <- df1[["Strike"]]

dn <- dist_convert(d_norm, days_to_maturity)
y2 <- sapply(strikes, function(x) call_price(dn, x, stock_quote))
df2 <- data.frame(x = strikes, y = y2)

dn <- dist_convert(d_cauchy, days_to_maturity)
y3 <- sapply(strikes, function(x) call_price(dn, x, stock_quote))
df3 <- data.frame(x = strikes, y = y3)

p <- ggplot() +
  ggtitle("Option Prices") +
  xlab("Strike Price") +
  ylab("Option Price") +
  scale_colour_manual("", values = c( "black", "red", "blue")) +
  geom_line(data = df1, aes(x=Strike, y=Bid, colour="Actual")) +
  geom_line(data = df2, aes(x=x, y=y, colour="Gaussian")) +
  geom_line(data = df3, aes(x=x, y=y, colour="Cauchy"))
ggplotly(p)

Pricing models

As speculated, actual market prices are slightly above the Gaussian model prices (underpriced), and below the Cauchy model prices (overpriced). The Gaussian model came really close to the actual option prices, which makes sense, since it’s likelihhod to the observed density was much higher.

So after analysing this plot I had an idea, since this final script can take any probability density to calculate option prices, what if I used the observed density estimate instead, would it yield better, more accurate option prices?


stock_quote <- 29.71
days_to_maturity <- 20

df1 <- read.csv("PETR4_Calls.txt", header = TRUE, sep = ";")
strikes <- df1[["Strike"]]

dn <- dist_convert(d_real, days_to_maturity)
y2 <- sapply(strikes, function(x) call_price(dn, x, stock_quote))
df2 <- data.frame(x = strikes, y = y2)

p <- ggplot() +
  ggtitle("Option Prices") +
  xlab("Strike Price") +
  ylab("Option Price") +
  scale_colour_manual("", values = c( "black", "red")) +
  geom_line(data = df1, aes(x=Strike, y=Bid, colour="Actual")) +
  geom_line(data = df2, aes(x=x, y=y, colour="Empirical"))
ggplotly(p)

Empirical model

In fact, to my surprise, it did. Visually the lines appear to be on top of each other. The table below displays the standard deviation from the models estimates to the actual market prices:

Model σ
Gaussian 0.0463
Cauchy 0.1369
Empirical 0.0290

Conclusion

The “empirical” model turned out to be the one that best approximates market prices, for this particular stock (PETR4) in the B3 stock exchange. Market option prices are determined by automated trading systems, usually employing much more sophisticated techniques than the one presented here, in real-time, almost instantaneously reflecting price shifts of the underlying stock. Even so, this simplified approach yielded great results, worth exploring more deeply.


Notes

  • I submitted this article on Hacker News and it received a lot of attention there. One comment correctly pointed out that the standard technique for fitting probability distributions is to use maximum likelihood estimation (MLE), instead of the least squares regression approach that I originally employed. So I ran the analysis again using MLE, observed improved results for both models, and updated the code and plots.
  • I also fixed a minor issue in the function that generates random samples for probability densities, and recalculated the “empirical” model standard deviation.
  • For reference you can find the original article here.

Sources

[1] Don Chance: Teaching Note 96-03: Monte Carlo Simulation