Using Selectors For Open Source Intelligence

A “selector” is not a generally defined term in enterprise security, but selectors are important for understanding open source intelligence and investigations in the digital realm. Building on our previous technical blog defining a selector, we will be diving deeper into selectors and how they enable external threat hunting, attribution, and open source intelligence analysis.

Criticality of Uniqueness

Using selectors, we can identify specific entities that are unique in a digital realm that allow us to understand the “how,” “why,” and potentially “who” of suspicious activity. With this context, organizations can decide if they should dedicate resources chasing the threat.

Selectors are entities such as MAC addresses, email addresses, phone numbers, and IP addresses. Each one of these entities allows us to analyze it to determine how unique it is, and then potentially correlate it with other selectors and the entity where we found it.

We can then cross reference that selector in other data for greater refinement and enrichment.

Data Engineering Considerations for Selectors

The first consideration for aggregating data and selectors is mapping to a common data model. An organization needs to be able to identify the elements of a particular data set that are shared amongst the other data they have.

They need to consider the fields to capture in the data and how they want to standardize the data, all in the context of the investigative problem they are trying to solve.


The availability of the data is an added component in the data aggregation and engineering process.

Key questions and considerations include:

  • How many data sources can you find that include a specific selector?
  • How expensive are those data sources?
  • How reliable are they?
  • How well documented are they?

The quality of documentation leads to how much community knowledge there is about that type of selector.


When aggregating data for analysis, the provenance of data – that is, the ability to identify the source of the data – is probably the most critical element from a data engineering standpoint.

For example, during an investigation, if we discovered an email address associated with a password, and another email address associated with the same password, we might conclude they are likely the same individual.

However, it’s critical to know where those two sets of usernames and passwords came from. If we can’t track back to the specific data set, then we can’t know whether this shared password is coincidence or correlation.

Usability of Data

Finally, the usability of the data sets as a whole contribute to our ability to use the selector individually. Questions to ask include:

  • What is the volume of data that an organization is going to have to absorb?
  • What is the density of the information or significant information?

For example, if an organization has to review multiple petabytes from a single data source in order to find the one or two useful selectors, they are probably consuming the wrong dataset.

Operationalizing Data Engineering for Open Source Intelligence

As an analyst, it’s important to understand not only what data is available, but how that data is relevant to an investigative problem being solved. Ultimately, an analyst must be able to use data in a time sensitive manner.

For example, a phone number without a country code or an area code would be very easy to confuse with multiple other phone numbers that look the exact same. If the analyst does not have a good sense of where that data came from, they could be using the wrong selector for their use case.

Further, if an analyst does not want to make multiple queries, the data engineers may have to store multiple variants of the data in the organization’s data model, thus increasing the amount of data stored.

For example, an organization might want to store password data from third-party breaches to alert their security team if employees have non-service credentials that match corporate account credentials.

They might want to store hashes of those passwords also so that they automatically find correlations and hits off of the passwords.

If they’re trying to automate discovery and detection across their system, having data stored in specific ways is required because they need the system to be able to look across all elements and all variations necessary.

However, if an organization is conducting on-demand analysis to minimize data storage requirements, this prevents it from being able to perform appropriate discovery.

Importance of Context

A key element for an analyst to understand is the context of a data set used for a query. This context is largely dependent on the elements of the data described above.

Since selectors are often the basic building blocks of analytical conclusions, bad context could lead to inaccurate intelligence analysis.

Therefore it’s critical for analysts to have an idea what goes into the backend data engineering regarding availability, usability, and provenance of selectors.

For example, it’s very helpful to know the Organizationally Unique Identifier (OUI) tied to a MAC address. If an analyst is familiar with the data model they are using, they can request the engineers tag any MAC address with the OUI.

In addition to understanding the data model itself, it’s also important for the analyst to know there might be data that isn’t being returned during a query.

This is often where enrichment falls short.

If an analyst is querying a specific IP address, knowing the timestamp that’s associated with that IP address is very important. If an analyst doesn’t understand the data model, they might not know if they do or don’t have an IP address with a timestamp.

If the analyst’s query doesn’t return that element, there might be instances when an analyst doesn’t see the data that is relevant because the query tool isn’t properly tuned for the query the analyst is running.

Bringing Together the Framework During an Investigation

In a recent investigation, we were tasked to investigate a mobile application which was being used for fraud. We were able to identify the developer certificate within the application. The certificate was a very valuable selector when judged against the following criteria:

Scope: Scope explains which systems can access the identifier. Developer certificates were single scoped to a very specific realm.

Uniqueness: Uniqueness establishes the likelihood that identical identifiers exist within the associated scope. This selector was unique because one developer certificate was assigned per developer.

Reset-ability and Persistence: Reset-ability and persistence define the lifespan of the identifier and explain how it can be reset. In this instance, the certificate can not be reset, so it is one hundred percent persistent.

Integrity: A selector that is difficult to spoof or replay can be used to prove that the associated device or account has certain properties. We know the developer certificate can’t be spoofed because it’s defined by the development platform within its own unique systems.

Non-Repudiation: Non-repudiation means the service provides proof of the integrity and origin of data creating an authentication that can be said to be genuine with high confidence. There isn’t necessarily proof of non-repudiation, but the developer certificates are single sourced and by the provider, which in this case was a trusted provider.

Provenance: Provenance provides the ability to trace information to the source of the data. Similar to non-repudiation, the development platform is the source.

Based on this evaluation, the developer certificate was an excellent starting point to obtain additional context about the actor group that represents a threat, in this instance app-based fraud. Ultimately, we were able to work with the developer platform to remove the account being used for fraud, which not only shut down this fraudulent app, but prevented the developer from using the same infrastructure to develop other apps as well.