The Role Of Ethical Web Scraping In The Development Of The Modern Internet

Oxylabs 1 — The process of automated public web data collection is the foundation of many business models. Oxylabs

A decade ago barely anyone knew what web scraping was. Today, the process of automated public web data collection is the foundation of many business models. Some digital companies could not even exist without proxies or web scraping.

We’re delving deeper into the history and importance of web scraping with the Lead of Commercial Product Owners at Oxylabs, Nedas Višniauskas. He is currently working with some of the leading automated publicly available data acquisition solutions and has seen them go from basic prototypes to highly specialized scraping giants.

How did you get familiar with web scraping, proxies, and other data-gathering solutions?

I was familiar with the concept of proxies and their use way before I joined Oxylabs. Of course, my familiarity with them was in large part as a consumer or layman and less of a business.

At Oxylabs, I got to see and experience the scraping industry first-hand. As I began my career here as an account manager, I wasn’t directly involved with public web data acquisition, but I got to take care of a lot of businesses that were. Such a learning experience was a great introduction to the industry as I got to see both the development, challenges, and victories of web scraping from many different angles. I got to see how valuable publicly accessible data could be and the potential business benefits (such as competitor analysis) it could bring.

What do you think has been the impact of web scraping on the internet at large? How has it changed the way regular users perceive the internet?

I would say that web scraping has liberated the accessibility of goods and services. What I mean by this is that automated data acquisition has opened the door to aggregation, which makes information more accessible to the regular internet user or consumer.

Customers will always look for best deals, whether in quality, price, or any other measure of worth. Previously we had to manually go through search results (or, even earlier, catalogs). Web scraping has enabled aggregators (e.g., Idealo, Skyscanner) to exist where consumers can find comparisons for thousands if not millions of products at once.

That has changed how businesses have to compete. Providing just the best product or service is no longer enough. Now the “best deal” includes a lot of conveniences such as shipping speed, guarantees, customer service, etc. Of course, that causes the competition to become increasingly fierce.

On the other hand, businesses have benefited from web scraping as well. Data is, at least nowadays, the lifeblood of all business. The automated collection has created a new type - external data that can be used to validate or generate insights.

Let’s take an e-commerce store as an example. Online retail has always been data-hungry but was mostly limited to internal sources. The predictions made according to those sources were always partly incomplete as the data is somewhat biased. With external data, however, businesses can access information previously unavailable. It’s also much “closer” to the consumer.

External data is being used everywhere. Companies predict market trends, perform research, improve products and services all by scraping and analyzing public data from the internet. It has been a treasure trove of information for anyone who can get over the barrier to entry.

Finally, I’d say that web scraping is the reason the modern internet exists at all. Most miss out on the fact that Google, Bing, Yahoo, and every other search engine is based on crawling practices. Page indexing follows much of the same process as any other scraping venture. So, without scraping, there would be no search engines - the core of the current iteration of the internet.

What do you think lies in the future for web scraping, proxies, and other, similar tools?

Tech-wise the most likely scenario, I think, is the accelerated evolution of scrapers and solution providers. Datacenter and residential proxies should remain the backbone of the business. But we can already see the shift towards the provision of scraping solutions rather than the resources for them.

Getting into automated large-scale data acquisition is prohibitively expensive and difficult. Smaller projects are viable and provide a lot of insight, yes, but scaling scraping is a completely different beast. Highly skilled in-house developers and teams are required, powerful infrastructure has to be maintained, and lots of know-how has to be collected before you can really get into the groove of things.

Small businesses usually don’t have much of a good shot at getting started in web scraping. Even if a business were to meet all of the above requirements, we have to keep in mind that websites are constantly changing in many ways. They change scraper detection methods, redesign layouts, and reinvent data loading practices (such as dynamic loading with JavaScript).

That means scraping tools will break frequently, which will require knowledge, time, and resources to be dedicated to fixing them. As such, the barrier to entry is so high that it pushes out some businesses before they can even begin considering doing scraping.

How has Oxylabs approached these challenges and the future of web scraping as a whole?

At Oxylabs, we’ve been looking to help businesses avoid the barrier altogether. After all, nearly every business truly wants the data, not the scraping process. So, we give them the tools they need to access it and take care of all the intricacies and annoyances on our side.

Another shift that we have noticed and that has led us to the current iteration of our Scraper API solutions is that data and its extraction is becoming highly specialized. Businesses usually want data from some set of sources (say, e-commerce websites) instead of something generic. As a way to resolve these needs, we’ve separated our single endpoint into three APIs, which has provided more flexibility for our partners and an easier time managing solutions for us.

Our previous solution has been separated into a Web Scraper API, which is used for generic websites, E-commerce Scraper API, dedicated to e-commerce websites, and SERP Scraper API, one made for search engines. Each of our scrapers has unique features that are useful for those particular tasks. E-commerce, for example, has a proprietary machine learning-driven parsing tool. All of our scrapers are intended to make data as accessible and as cheap as possible.

Have there been any important legal developments for web scraping? How do you think it might change in the future?

Legality is a tricky topic for me, but I do see two trends I think are likely. There has been a buzz about publicly accessible and privately-owned data. While I cannot comment on all of the legal details, there is a clear distinction apparent. Unfortunately, social media platforms and others seem to be highly protective of such data, even if it’s publicly accessible. There have been and still are highly important cases that will shape the future of web scraping.

Currently, most things are decided by previous case law or on a case-by-case basis. However, over time, the decisions made in either direction will build the foundation of the legitimacy of automated data collection.

Additionally, ethical residential proxy acquisition has been in the limelight. We took charge to turn the tides towards ethical acquisition by using software that acquires informed consent, provides analytics, and even provides a monetary reward to people who intentionally turn their devices into proxies.

Finally, I’d add that the legal side of web scraping is so tricky that you should always have a lawyer at hand. There are so many intricacies involved that only a legal professional can truly know them.

What business models or industries would you recommend spending more time and effort on web scraping?

My answer is quite ironic -- e-commerce businesses. While they have been by far the most external data-focused out of all, the potential remaining is still immense. What we’re currently seeing is just one small grain out of the entire bag of possibilities.

On the other hand, I would say that all digital businesses should invest more in external data acquisition. Competing in the digital sphere means competing on information. Dynamic pricing, intent and sentiment scraping, etc., are all things that every business can benefit from. In the not-so-distant future, I see most, if not all, digital businesses engaging with web scraping and data analysis. Data is simply the future of competition.