Introduction

Security is an ever-increasing risk. As we add more components to our infrastructure the number of attack vectors increase. When building an authorization system such as Udaru, it is just as important to think of the security of Udaru itself as the security of the systems that Udaru protects.

Creating an intrusion-detection system for Udaru poses some challenges; Udaru itself doesn’t have any restrictions on what type of application or domain it can be used for. Because of this, it is very difficult to create feature/application specific rules. Secondly, there is often no, or very little, data about when something is known to be an intrusion attempt and especially when we want to detect intrusion patterns that haven’t been considered.

Therefore, our intrusion-detection system dynamically creates rules based on a training dataset of previously seen valid requests. This kind of data modelling is known as zero-negative classification. Typical modern approaches such as machine learning are not very suited for these kind of problems. Instead we use frequentist and bayesian statistics for the models as those approaches are more suited for this type of problem.

Udaru

Udaru is a Policy Based Access Control (PBAC) authorization system. Let’s take a user in an organisation. This user asks to perform an action on a certain resource. Udaru policies can control if this action is allowed or not. For example, we can validate if a user can write to a specific file on a server.

Udaru information flow
Udaru information flow: Udaru decouples the user login from the access control itself. This means when Udaru gets an authorization request, it assumes that the user login has already been validated.

It is important to understand that Udaru itself is not responsible for the user login. Intrusion-detection for attack vectors there are meant to get user credentials, such as, brute force attackstiming attacks or SQL injections, is therefore not within the scope of intrusion-detection for Udaru.

Intrusion Strategies

The intrusion-detection system looks at IP addresses and timestamps to see if a user moves too fast. This may indicate that an intruder has gained another person’s user credentials.

Secondly, the intrusion-detection system looks at the resource string to see if this is similar to previously seen resource strings. If not, this might indicate that an intruder is trying to gain access to resources through obscure means. Such as by using special characters (resource:server:/pubic/server/../../../etc/passwd) or unusual syntax (resource:server:/public/server:/etc/passwd). In these cases, a poorly written policy may allow access because resource:server:/public/server matches the poorly written policy. Even if the polices are fine, it is still important to know that someone is attempting to break the authorization system.

Udaru poses some challenges in the second case because it doesn’t assume anything about the application. By default, it does not have any restrictions on how resource strings are supposed to look. For example, in the example above, the resources are directories on servers. However, they can also be physical doors in a corporate building or personal information used by human resources.

Transaction audit log

To facilitate logging the authorization data, nearForm have recently developed a new audit log service called Trail. Trail lets you index the authorization data and make it easily accessible to the intrusion-detection system. To use Trail with Udaru there is also an integration model called Udaru-Trail.

Geolocation Anomalies

For detecting users who “move” too fast, the IP-address is translated to longitude and latitude using the MaxMind GeoLite2 Database. Using the longitude and latitude of the IP-address of the last login and the new login, the distance can be calculated using the Haversine formula:

d = r \cdot \mathrm{archav}\left(\sqrt{ \mathrm{hav}(\phi_2 - \phi_1) + \cos(\phi_1)\cos(\phi_2)\mathrm{hav}(\lambda_2 - \lambda_1) }\right)

Based on the fact that the Earth’s radius is approximately v = \frac{d}{t_2 - t_1}v=t2​−t1​d

With the traveled distance calculated, it is straightforward to compute the velocity assuming infinite acceleration:

v = \frac{d}{t_2 - t_1}v=t2​−t1​d

These equations require the following parameters:

  • (\lambda_1, \phi_1, t_1)(λ1​,ϕ1​,t1​) longitude, latitude, and timestamp for the last login
  • (\lambda_2, \phi_2, t_2)(λ2​,ϕ2​,t2​) longitude, latitude, and timestamp for the last login

For the Udaru intrusion-detection, a velocity above the speed of sound (343 m/s) is considered an intrusion.

The Haversine functions is not a part of the standard Python math library; but they can easily be implemented using the following trigonometric identities:

\begin{aligned} \mathrm{hav}(\theta) &= \sin^2\left(\frac{\theta}{2}\right) \\ \mathrm{archav}(h) &= 2 \arcsin\left(\sqrt{h}\right) = 2 \mathrm{arctan2}\left(\sqrt{h}, \sqrt{1 - h}\right)\end{aligned}

Resource Anomalies

To detect anomalies in the resource strings, 3 different models are used:

  1. Check if the string length matches what was previously seen.
  2. Check if the characters used matches what was previously seen.
  3. Check if the grammatical structure matches what was previously seen.

These models are based on the paper “Anomaly Detection of Web-based Attacks” but with some modifications especially in the case of the grammatical structure model.

When validating a resource string, the fastest approach would be to validate the resource string using the length model first; as this is very cheap and the others have a computational complexity that scales with the string length. Next, one should use the character model. However, for the purpose of demonstration the results of all 3 models are presented here.

1. String Length Model

A big problem in creating a statistical model based on string length is that it is very hard to assume anything about the length. The same instance of Udaru can be used for handling resources of different types, such as door numbers and file paths. Therefore the empirical distribution of the resource string length can easily look something like this:

Hypothetical string length histogram
Hypothetical string length histogram: bimodal distributions or worse can be expected. In general, it is very hard to assume anything about the distribution of the resource string length.

Because it is very hard to assume anything about the distribution, a good idea is to use the Chebyshev inequality theorem as this doesn’t assume anything about the distribution:

P(|x - \mu| > t) < \frac{\sigma^2}{t^2} The Chebyshev inequality says that the probability that something (x) deviates more from the mean than a threshold (t), is less than \frac{\sigma^2}{t^2}. This is reformulated to “the probability that something (x) deviates more from the mean than the current deviation |l - \mu| from the mean. Where l is the current resource string length. P(|x - \mu| > |l - \mu|) < \frac{\sigma^2}{\left(l - \mu\right)^2} To validate a string length the lower-bound probability P(|x - \mu| > |l - \mu|) is then thresholded by some fixed value. In this case, 10% is used. Meaning if the probability is greater than 10%, the string length is considered valid.

10\% \le P(|x - \mu| > |l - \mu|) < \frac{\sigma^2}{\left(l - \mu\right)^2}

2. Character Distribution Model

The character distribution model computes the empirical distribution from the training dataset. For validation, it then compares this with the character distribution of new resource string using a \chi^2-test. If the p-value is greater than 10%, it is considered a valid string.

To increase the degrees of freedom, some characters are grouped into the same symbol. The character groupings are 0-9, a-f, g-z, A-F, and G-Z. The letters are split on F and G to allow the model to easily treat hexadecimal strings differently from strings that use all English letters.

3. Grammatical Model

The Grammatical model is by far the most complex model. The idea is to dynamically build a finite automaton that represents all valid resource strings. The difficult part is creating a finite automaton that doesn’t just fit the training dataset but generalizes to the actual pattern. However, the finite automaton must not end up being so general that it fits everything.

Grammatical Model
Grammatical Model: This describes resource strings such as resource:server:/public/server and rejects resource strings such as resource:server:/public/server:/etc/passwd.

This balance between fitting the training dataset and generalizing is accomplished using Bayesian statistics.

\begin{aligned} P(Model|Dataset) &= \frac{P(Dataset|Model)P(Model)}{P(Dataset)} \\ &\propto P(Dataset|Model)P(Model) \end{aligned}

When building the model the finite automaton is augmented with emission and transition probabilities, thereby making it a Markov Chain. The process of dynamically building this Markov Chain is called Bayesian Model Merging (see Inducing Probabilistic Grammars by Bayesian Model Merging).

As Markov Chains inherently makes it theoretically trivial to compute the probability of a sequence, computing P(Dataset|Model), there is no universal correct choice. In the udaru-anomaly-detection tool the following prior is used:

P(Model) = \sum_{s \in States} N^{-3 \cdot (e_s - 1) + 1} N^{-t_s}

Where e_s and t_s are the numbers of emissions and transitions for the state s.

Now that the unnormalized P(Model|Dataset) can be calculated, the model merging strategy can be described.


For each sequence in the dataset, do the following greedily, sequence by sequence:

    1. Extend the Markov Model by creating a linked list between the start and the end. Mark each of these states as unmergeable. Meaning, another state cannot be merged into an unmergeable state. However, an unmergeable state can be merged into a mergeable state.
    2. Attempt to merge each unmergeable state into all other mergeable states, do this greedily from start to end.
    3. If any of these merges caused P(Model|Dataset) to increase, consider this the best model the new model. If P(Model|Dataset) wasn’t increased by merging, mark the unmergeable state as mergeable.

To make Bayesian Model Merging work in practice you have to go quite a lot beyond what the original paper describes. Some key observations:

  • Use BeamSearch to approximate P(Dataset|Model). The final grammatical model often has a polynomial complexity. However, when exploring possible models it is almost unavoidable to encounter Markov Models that contain cycles that cause an exponential computational complexity when calculating P(Dataset|Model).
  • If a string already fits the model, simply increase the probabilities along the most likely path.
  • Compute all probabilities in log-space. This is much more numerically stable and it makes the Bayesian prior very cheap to compute.
  • Consider checking for model pruning opportunities. If two states are self-linking and link to each other, attempt to merge those two states.
  • Before merging the new sequence into the model, merge equivalent subsequent emissions together. This dramatically shortens the new sequence and therefore the number of merges to check.

Together these optimizations bring the training time down from a couple of days to less than 5 minutes on 100 resource strings.

Running The Tool

The udaru-anomaly-detection tool is an experiment which is why there is no UI and the CLI is limited in terms of configuration.

To install the udaru-anomaly-detection tool, run:

Copy to Clipboard

You can then set up Udaru with Trail integration using the following code:

Copy to Clipboard

Alternatively, if you just want to try it out, you can insert some data into the Trail server. In this case you still need the Trail server but not the Udaru server. To start a stand alone Trail server use npx run trail-hapi-server. With the Trail server running, you can insert some fake data with:

Copy to Clipboard
It is then possible to train and test the model with:
Copy to Clipboard

Demo


Training: Shows training on all observations between 2017-01-01 and 2017-12-31, there are 100 resource strings in this interval. The trained models are stored in ./cli_model. Training the length model and character distribution model is very fast. The grammatical model is much more expensive but still only takes about 3 minutes. It appears to hang a few times, especially at resource string number 47. This is because it contains a “-” character, which is not a part of the existing grammatical model. The training algorithm therefore has to work a lot to extend the grammatical model appropriately.

Share Me

Related Reading

Newsletter

Don’t miss a beat

Get all the latest NearForm news, from technology to design. Sign up for our newsletter.

Follow us for more information on this and other topics.