So you want to scrape like the big boys (2021)

incolumitas.com

321 points by aragonite 15 days ago


KieranMac - 15 days ago

I'm a lawyer that works in the web-scraping space, and I always chuckle when I read threads like this. Almost every company that we now consider a monopolist (or their affiliates) in the tech space used scraping a part of their process to build their business, and almost every one of those same monopolists now prohibits startups and competitors from scraping their data (which, invariably, is not actually "their" data in any sort of legally cognizable sense). And so perhaps the ethics of web scraping are not so straightforward. And neither are the legal issues associated with it.

I wrote an article about that last fall that got some attention here.

https://news.ycombinator.com/item?id=37264676

anyfactor - 15 days ago

I was a professional web scraper. I still keep up to date with the industry.

These days, you do not make money by doing web scraping; you make money selling services to web scrapers. There are tons of web scraping SAAS and services out there, as well as dozens of residential proxy providers.

Most anti-bot mechanisms evolve so quickly that you can make a decent income just by working in a traditional software engineering role dedicated entirely to engineering anti-anti-bot solutions. As these mechanisms evolve rapidly, working for a web scraping company is more stable than pursuing web scraping as a profession.

Web scrapers get paid by projects, making it an unstable job in the long run. High-level web scraping requires operational investments in residential proxies and renting out servers. Additionally, low-end jobs pay very little. Brightdata hosting a conference on web scraping, which should indicate the profitability of selling services in large-scale web scraping.

gwittel - 15 days ago

I’m really mixed on this. Anti bot stuff is increasingly a pain point for security research. Working in this space, I have to work against these systems.

Threat actors use Cloudflare and other services to gate their payloads. That’s a problem for our customers who are trying to find/detect things like brand impersonation and credential phish. Cloudflare has been completely unhelpful. They just don’t care.

qweqwe14 - 14 days ago

> Those companies employ ill-adjusted individuals that do nothing else than look for the most recent techniques to fingerprint browsers [...] When normal people are out drinking beers in the pub on Friday night, these individuals invent increasingly bizarre ways to fingerprint browsers and detect bots ;)

What's the deal with "ill-adjusted" and "normal people"? I'm gonna say it right now, the reason why these individuals do this is because it's way more interesting and fun than building some bullshit React website for some boring business for the 20th time (this is just an example, not attacking React here, no need to freak out)

It's fun because you get to solve an actual real-world challenge and find new ways to do something. Same with things like developing exploits. Those who do this are not "ill-adjusted", they are in fact normal people that do what they are passionate about.

The whole mentality of "anyone who does something I don't like is ill-adjusted" is just absolutely insane.

graemep - 15 days ago

Anti bot stuff also seems to be a security threat and privacy threat: preventing users from accessing your site if using VMs, port scanning, various froms of fingerprinting

dang - 15 days ago

Discussed at the time:

Scrape like the big boys - https://news.ycombinator.com/item?id=29117022 - Nov 2021 (189 comments)

- 15 days ago
[deleted]
lyu07282 - 14 days ago

> Every website can access rotation and velocity data from Android data without asking for permission.

What????!!! That's nuts

- 15 days ago
[deleted]
Havoc - 12 days ago

Interesting. Busy building a project that requires scraping (pretty low rate)

Have been puzzling what to do about the rejection cases. A single cheap android might just fill the gap.

blantonl - 15 days ago

This tends to be a very unpopular opinion around here, but in almost all cases I find Internet scraping to be unethical and downright malicious. I'm not saying all cases, but I'm saying almost.

A lot of the actors involved tend to be hustle culture types who think they are OWED your data, regardless of the ethics, laws, being a good citizen, whatever. They will blatantly disregard terms of service and hide behind massive setups such as these to circumvent protection etc.

And the problem is, if you run any sort of business or service that is data oriented, there will be thousands of people that will do this, which will cause you to devote enormous amounts of time, effort, money, and infrastructure just to mitigate the issues involved with data scraping. That's before you are even addressing whether or not these people are "stealing" your data. People who feel they are entitled to the crux of your business aren't bothered by being nice in the way they take it - they'll launch services that will cripple infrastructure.

Whenever I deal with a scraping process that decides it wants my entire business, and it wants all of it RIGHT NOW, or in 5 minutes, I want to find the person and sit them down in a room and tell them "hey, develop your own ideas and business. Ok? Thanks"

And if you think this was a problem before, it's exponentially worse over the past few months with every Tom, Susan, and Harry deciding they must have all your data to train their new LLM AI model. By the thousands.