Five Critical Data Source Considerations for Adversary Attribution
Strong intelligence is the base of adversary attribution; nothing can replace the holistic picture created by technical indicators in combination with HUMINT and OSINT sources.
While many cyber threat intelligence teams focus on technical events and indicators that lead to creating signatures and detecting behaviors from those with malicious intent, other intelligence units within the organization are interested in identifying the human aspect behind those malicious actions.
The basic difference between cyber threat intelligence when compared to e-crime, insider threat attribution, and trust and safety concerns is that, for the latter, an organization knows that an event has occurred. Cyber threat hunting and intelligence takes a more proactive stance, searching for not only events that may or may not be occurring within an organization’s own network, but also within similar-sectored organizations around the world.
Obviously internal datasets from all organizations around the world are not feasible to acquire, therefore reliance is placed on various external datasets (i.e. netflow outside the firewalls, DNS, adtech data).
For adversary attribution, there is often no associated selector (IP address, date/time, email address, phone number) to use as a jumping off point with which to start the investigation. While host-based artifacts typically provide the highest fidelity, legal issues often put constraints on the investigation. Further, the host-based artifacts are often non-existent to begin with because the organization does not know to collect them.
Many assume that if an organization does not have host-based artifacts, an investigation is dead in the water. Below are some critical datasets to solve a range of e-crime, insider threat, and trust and safety investigations. For more information about datasets to assist with external threat hunting.
Darknet and Clearweb Forums
While darknet data can produce a lot of noise for cyber threat intelligence investigations, gathering this type of data is critical when targeting a very specific requirement. We typically see darknet data particularly useful after an incident has taken place to catch leads from observing specific actors in the forums. Conducting backstopped persona management or having sources is often critical to finding the start point.
A collection of legally-acquired datasets from the clear and dark web that contain publicly available user account information is often critical. Equally critical are the holder’s obligations to maintain such datasets subject to legally-compliant rules of engagement including but not limited to storage off-sight in encrypted form and restricting access to active engagements under controlled circumstances.
When properly and legally accessed, this data can add value by proactively preventing unauthorized personnel from accessing a network, and with the successful implementation of Two Factor Authentication provides a good level of defense.
To assist adversary attribution, the credential collections can be utilized to identify accounts sharing passwords. This allows for the potential identification of additional selectors and in some cases, effectively the threat actor directly.
Online fraudsters, disinformation actors, and scammers often slip up and reuse passwords on numerous different accounts and make mistakes that reveal their own true identity.
Under the right circumstances, breach datasets can be a critical component in defeating bad actors at their own game.
Open Source Media
Open source financial and geopolitical data not only add valuable context to adversary attribution, oftentimes it’s the only place to start.
Data sources that curate foreign press, sanctions lists, litigation records, NGO info, and corporate registries are critical starting points. Blockchain analysis tools for tracing crypto scams, public information lookup sites like Intelius (which is typically only useful for the US), searching host country court records with certain info which may disclose additional PII, business registries, or internet records like domain registrations, are often critical.
Even performing open source analysis, such as querying persona selectors in potential local university records, can yield valuable insights.
Technical Techniques in Social Media Data
The most fundamental aspect of advanced analysis and targeting of social media data and its technical elements relies on a defined data model and schema. This allows for acquired data to be normalized in a way that can leverage the same sources of data, creating temporal, contextual, and discoverable content.
The bulk acquisition of platform data brings the data into a stored and normalized schema which allows both strategic and tactical analysis and targeting of specific content and events, without the limits of the platform. It also allows temporal analysis of critical events and time windows related to the investigation.
For instance, techniques such as bulk processing social media videos to create a lexicon of audio and screen-to-text allow driven discovery against large, unordered data. This creates a more complete understanding of the content by providing analysis using keyword discovery, sentiment analysis, and latent semantic indexing of content.
Other important techniques include leveraging the accessibility features within the web site code, allowing textual analysis of social media image posts.
Mobile data and adtech data collection is used to target advertisements to users through mobile apps and browser data. This data can sometimes contain personal information but more often than not carries a unique advertising identifier that does not identify an individual by name but rather by device attributes.
Some of these attributes associated with ad tech data include WiFi networks the device is connected to, IP addresses the device has been assigned, geographic location, model of phone/computer, browser version and in some cases deeper historical data centered around purchasing interests.
Using this data, an analyst can identify a single device by IP or location and follow that device chronologically to determine activities that device performed from different addresses and networks.