Ask the Analyst: Nisos Anti-scraping Expert Scott Tessier

by | May 30, 2024 | Blog

The global market for web scraping surpassed $600 million in 2023, with some estimates projecting the market to balloon to $1.8 billion by 2031. The growth of web scraping is driven, in part, by the movement toward data-driven decision-making across various industries. This thirst for data has spawned an ecosystem of legitimate businesses and illicit operations focused on scraping and reselling data for profit.

While web scraping often violates online platforms’ terms of service, it can be challenging for even the most sophisticated security teams to combat. In addition to exposing users’ sensitive personal data, online platforms have suffered server overloads and service disruptions due to unauthorized web scraping of their properties.

In this blog, we interview Nisos Intelligence Analyst and anti-scraping expert Scott Tessier about the rise in web scraping. Scott is a seasoned intelligence analyst who helps leading online platforms uncover and unmask threats. His work has given him unique insight into the marketplace for legitimate and underground web scraping services.

1. Who are the players in the scraping ecosystem? And how do their motivations differ?

There are a lot of different groups that all coalesce around the idea of web scraping. The biggest and most important are what we call the commercial scraping-as-a-service industry. These range from prominent companies to smaller one- or two-person outfits that scrape for-profit, like a small business operation that just provides scraping tools, apps, or services to customers for money. This ranges from very sophisticated scraping products that are partnered with web unblockers and proxy management services that allow people to scrape without fear of being blocked all the way down to very rudimentary scraping code. So they’re definitely the biggest player in the ecosystem.

There’s a large freelancer or sole proprietor population. Anyone with an account can go on sites like Fiverr, Freelancer, or Upwork and post a job to hire somebody to scrape a website. They can very often hire somebody for under $100, sometimes even significantly cheaper, and there are thousands of people doing this on these freelancing sites.

Large scraping cohorts of academics also exist for very different reasons, which are generally to solve some sort of social science question or to do things like train an AI or ML model, so they want a huge data set. So they’ll go out and scrape massive data sets to help train their models.

Then, marketing and business intelligence ecosystems are another one. There are vast amounts of insights to be gained from harvesting data on social media platforms, which hold insights into people’s likes, dislikes, brand resonance advertising campaigns, and all sorts of things that can be gleaned from that information.

Finally, there are cyber criminals. They scrape data to harvest and sell personally identifiable information. And they will buy scraped data to advance social engineering campaigns or other illicit activities within the underground economy.

2. How big do you think this scraping as a service is? Who do they generally serve? Who’s generally going to and paying for those services?

We’ve found it’s large. There are tens of thousands of people involved in the ecosystem who are actually doing the scraping. There are a lot of different growth forecasts for the industry, and the specifics widely differ. It’s an industry that’s generally forecasted for double-digit compound growth annually for the foreseeable future. So it’s a problem that’s really only going to get bigger.

In terms of who’s buying it, it’s a really diverse set of actors. The biggest consumer of this data is the marketing business intelligence lead generation traffic generation kind of industry because you have the ability to harvest hundreds of thousands or millions of emails. (This) is obviously very valuable for them. But, we found people from all different walks of life going out and searching for this data –from cyber criminals to even small businesses like Dance Studios – really anybody who’s looking for any sort of insights that can help them grow their business.

3. How are cybercriminals using scraped data today? And do they typically buy or try to steal the data themselves?

There’s a pretty robust ecosystem of threat actors on a large number of cybercrime forums. These actors really focus more on hacked rather than scraped data, but there is still a large amount of scraped data being trafficked on these sites.

Very often, we’ll see scraped data sets with tens of millions or even hundreds of millions of lines of user data for sale. And for these actors, their motivation is a bit different. People posting the data sets are doing it essentially exclusively for profit. They know there’s value in the data, so they scrape it and sell it. In terms of the people buying the scraped data, ultimately, we don’t have a round truth on why a cybercriminal might be buying a scraped data set, but we can certainly make inferences the data sets are being used for social engineering campaigns, other fraud, and illicit activities, at a large scale with a large number of peoples data at once. The data can obviously be very valuable for doing things like social engineering campaigns or furthering some other sort of fraud or other capitalistic activities at a large scale where you can do it with a large number of people’s data at once.

4. What can online platforms do to combat scraping from their platform?

The most important thing that we see that deters people is just having really robust anti-bot protocols, browser fingerprinting, and physical defenses against scrapers in particular.

If you’re trying to deter cyber criminals, their sophistication varies widely, and there are a lot of people out there who I think are kind of easily deterred. If you have pretty robust anti-bot mechanisms and browser fingerprinting and things of that nature, it is going to be a significant deterrent for many people, particularly more casual and recreational scrapers. We’ve also seen a lot of companies try and go the legal route, and certainly, this can be successful. But there have been some changes in legal precedent in the last year that I think are making this route a little bit more difficult.

Companies can also work with threat intelligence companies who can provide insights into who is scraping their data, how they are scraping it, and why they are doing it. Intelligence providers can engage with the threat actors and obtain insights into their infrastructure and methods. This data can help companies better hone their defenses or conduct targeted outreach to web hosts or third-party websites that are hosting or enabling their activities to try to mitigate the threat further upstream.

5. What’s next? How will AI potentially evolve scraping and the attacks that follow?

I think we’re really only starting to see the beginning of AI-enabled scraping, and it’s really been interesting to see how much this has evolved, even just within the last year. I think if you’re a really sophisticated scraper, or some sort of marketing company who’s got a third-party commercial provider who is partnering that scraping technology with proxy management and headless browsers and a lot of other things that you need to avoid detection, then there’s not really a whole lot right now that the AI-enabled scraping is doing for those people.

What it is doing, however, is allowing anybody without any real degree of technical sophistication or coding experience to go into their AI platform of choice and come up with pretty sophisticated scraping code; in a lot of instances the code is very good and many models are good at troubleshooting code as well. So it’s probably not changing the degree of the most sophisticated threats, but what it’s really doing is broadening the pool of people and lowering the barriers to entry into the market.

About Nisos®

Nisos is the Managed Intelligence Company. We are a trusted digital investigations partner, specializing in unmasking threats to protect people, organizations, and their digital ecosystems in the commercial and public sectors. Our open source intelligence services help security, intelligence, legal, and trust and safety teams make critical decisions, impose real world consequences, and increase adversary costs. For more information, visit: