So you want to scrape like the big boys (2021)
incolumitas.com321 points by aragonite 15 days ago
321 points by aragonite 15 days ago
I'm a lawyer that works in the web-scraping space, and I always chuckle when I read threads like this. Almost every company that we now consider a monopolist (or their affiliates) in the tech space used scraping a part of their process to build their business, and almost every one of those same monopolists now prohibits startups and competitors from scraping their data (which, invariably, is not actually "their" data in any sort of legally cognizable sense). And so perhaps the ethics of web scraping are not so straightforward. And neither are the legal issues associated with it.
I wrote an article about that last fall that got some attention here.
Same thing with Facebook and identity. IIRC they leveraged Google’s address book to get traction, but will go after you if you try store FB social graph data long term for anything outside their garden.
You try to block the tricks you used to get growth, basically.
> And so perhaps the ethics of web scraping are not so straightforward.
It strikes me that the _ethics_ of web scraping are extremely straightforward and cognizable with a terse analysis:
* You can respond however you like to my HTTP request, and I can parse your response however I like.
Simple, traditional, common. This is the way that conversations have occurred since the dawn of human communication, no?
> the legal issues associated with it.
But aren't these, without exception, fabrics spun out of the cloth that shields established players with the threat of state violence? This is not particularly new, and seems to fit in the pathetic-and-predictable file.
Moreover, the broader cheap attempt to cast this in "intellectual" property terms, and to attach that to protection of artists and creators, warrants a very particular eye-roll for its illogic.
Do you apply this ethics to webs scraping only, or to all other network communications too?
Because if that's your general principles, you are making the internet much shittier. I still remember the old internet with open SMTP servers, easy-to-use comment forms, and forums which did not require emails and capthas. But people with "You can respond however you like to my HTTP request" attitude ruined it with spam, scam and SEO.
If you only apply this to web scraping, then where do you draw the line and why? Can you scrape at maximum rate server can support? Can you scrape if this requires active action (like account creation?) As long as you scrape, can you also post some links to improve your SEO?
> But people with "You can respond however you like to my HTTP request" attitude ruined it with spam, scam and SEO.
I don’t see how those things relate. They all have separate ethical issues. You can believe it’s ok to scrape whatever info you can find online at the same time as believing it’s not ok to scam people.
> Do you apply this ethics to webs scraping only, or to all other network communications too?
I mean... if you're keying in at 20MHz and blasting a gigawatt of noise, then yeah you've certainly run afoul of decency and just law. You're changing the physical shape of the network environment.
But if the concern is just that we don't like the bytes to which your signal decodes, or we don't like what you're doing with the response we give you, then it seems more like a speech/press issue.
The internet needs to grow resilience such that annoyances in the logical layers are easy to ignore if you have the will. But that almost certainly means that you don't get to police what people do with the content you willingly hand over, pursuant to the protocol in use.
If I say, “Hey, please don’t text me anymore. I’m going to block this number,” and you respond by buying 500 phones in five cities and text me nonstop, is that ethical?
Not sure the metaphor works here. For example most sites let Google scrape them as much as it likes, but go out of their way to block other robots. By doing so they are effectively forcing the whole world to use (or support, since smaller search engines have to piggyback on the big ones wih special status, and pay them) proprietary spyware.
In your analogy, most websites block everyone except the biggest pervert known to man.
Isn’t that a choice the website owner should be able to make?
Of course it's your choice to make.
Is someone forcing you to respond to requests you'd prefer to ignore?
yes, people like OP who get the farms of scrapers.
The website owners make their preferences clear with robots.txt, IP blocks and other antibot technology. Scrapers intentionally ignore owners' desires and force the to respond.
If crawlers are stealth DDoSing my site then I lose the ability to respond entirely.
It's your job to separate the wheat from the chaff at the boundary of your network interface. In fact, personal boundaries of all sorts, from informational to emotional to physical to economic, are of paramount importance in the information age.
Nobody (and certainly not the state) is going to erect your personal boundaries for you by ensuring justice in the face of spammy text messages (or, for that matter, hypnotic and manipulative social media). This is your job - maybe your most important job.
Just as its your job to protect your personal health and safety. Nobody (and certainly not the state) is going to do that for you.
Is there something about the trajectory of evolution of the internet that suggests to you that this is incorrect?
I observe continually (seemingly perpetually) increasing traffic, and continually (seemingly perpetually) increasing capacity for general purpose computing. I also observe enormous empathy and cyberpunk traditions in our communities, protecting each other. Do my eyes and ears deceive me?
Restraining orders are a thing for a reason. It's cheaper to harass someone out of business (intentionally or otherwise) than to compete on a level playing field.
Being a good neighbor requires restraining oneself and making requests with consideration for the other party.
Full disclosure: I worked for a price monitoring service that prided itself on crawling up to every 3 hours. Steps were always taken to mitigate the impact. Sometimes even asking hosts to allow-list the crawlers.
> Restraining orders are a thing for a reason.
Sure, but for the purposes of this conversation, saying "for a reason" regarding a function which is presently delegated to the state is fraught with all sorts of future-proofing concerns.
It seems to me that, as a baseline, we have to agree to observe the apparent trend of the internet to supplant the state - to resist its censorship and influence almost entirely - as an indicator that our long-term thinking needs to put those relatively few state functions which are essential to a peaceful society (such as restraining orders) in the purview of the internet... somehow. Maybe that will prove to be unnecessary, but in the case that the state fades, we'll be happy we had the foresight.
Internet traffic is barely (and arguably, already not) under human control as it is. And in another century, it will almost certainly be impossible to tell the machines 'enhance your calm or else'. Or else what?
I agree wholeheartedly about your qualities of good neighbor roles. But I don't think they extrapolate the way you think they do.
Consider this: at every moment, your house - your literal dwelling - is bombarded with high-level, semantic radio traffic, from way down where the messages bounce off the ionosphere all the way up to 10GHz and beyond. But this doesn't bother you. You ignore what you don't need! You draw boundaries and personally work on strengthening them - with the help of your friends and neighbors.
The internet needs help taking this shape at the application layer (and really, at all layers). And that part is up to us. We can't just throw our hands up and say "<legacy state function> exists for some reason, doesn't it?"
The government is our tool for regulating society when self regulation fails. It may be a blunt instrument and a last resort. Yet there is a place for it. We cannot entirely outsource all boundaries to individuals and private institutions.
I agree it would be ideal if the Internet could be as opt-in and benign as you suggest. Though I'm not even sure such an architecture is possible. How do you drive down the cost of listening and filtering to near zero whilst still allowing the desired signal?
And even if it were possible, consider that we do rely on governments to regulate the limited radio spectrum that we all have to share. Otherwise it wouldn't be an option to opt in to. The signal would be drown out by whomever has the strongest transmitters.
> The government is our tool for regulating society when self regulation fails. It may be a blunt instrument and a last resort. Yet there is a place for it. We cannot entirely outsource all boundaries to individuals and private institutions.
I don't know who "our" refers to here, but if humans are evolving into "the internet", or however you want to think of this creature which is emerging over the course of this century (and appears wont to accelerate over the next few centuries), then I don't think the state is "ours". We can't just cover our eyes when presented with the proclivity of the internet not to tolerate the state.
> I agree it would be ideal if the Internet could be as opt-in and benign as you suggest. Though I'm not even sure such an architecture is possible. How do you drive down the cost of listening and filtering to near zero whilst still allowing the desired signal?
Cryptography.
> And even if it were possible, consider that we do rely on governments to regulate the limited radio spectrum that we all have to share. Otherwise it wouldn't be an option to opt in to. The signal would be drown out by whomever has the strongest transmitters.
...really? Do you really believe that the state is a force for coordination and openness in radio?
The only bands which reliably continue to have these characteristics are the amateur bands, which have been defended by users for decades against constant encroachment by a state which, if it had its druthers, would've sold these bands to AT&T a long time ago.
My sense is that, if the government thought we weren't watching, they'd simply cancel the amateur radio license program. It is people standing to be counted (by taking the test) that keeps these bands viable _despite_ the FCC, not the other way around.
I was a professional web scraper. I still keep up to date with the industry.
These days, you do not make money by doing web scraping; you make money selling services to web scrapers. There are tons of web scraping SAAS and services out there, as well as dozens of residential proxy providers.
Most anti-bot mechanisms evolve so quickly that you can make a decent income just by working in a traditional software engineering role dedicated entirely to engineering anti-anti-bot solutions. As these mechanisms evolve rapidly, working for a web scraping company is more stable than pursuing web scraping as a profession.
Web scrapers get paid by projects, making it an unstable job in the long run. High-level web scraping requires operational investments in residential proxies and renting out servers. Additionally, low-end jobs pay very little. Brightdata hosting a conference on web scraping, which should indicate the profitability of selling services in large-scale web scraping.
I've long thought that the use of residential proxies for things like scraping and operating large-scale bot networks is a necessity, but I've never really dabbled in using them, so I've never confirmed my suspicions about how residential proxies are used at a scale like this. Do you know if insecure IoT devices and malware-infected consumer hardware as common as one might think for this? I can't imagine it would either be profitable or even possible to work with an ISP to acquire residential IPs, which kinda leaves me thinking that the only option for a residential proxy service would be pretty clandestine.
If you just search for "residential proxy" you'll find a lot of them are basically Raspberry Pis or similar shipped to people who are then paid for the amount of traffic that goes through it. Others are agents running on user's computers, I suspect at least some of these proxy providers aren't overly thorough about due diligence on how that agent got installed.
Is there a conference you would suggest that is the closest to scraping, generally speaking? As far as I know there isn't a scraping conference or strong community anywhere, and I'd like to learn and improve my skills.
The scientific aspects (algorithms, incl. implementations, performance evaluation) of Web crawling (including focused crawling) is covered by conferences like WWW, ACM SIGIR, BCS ECIR, ACM WSDM and ACM CIKM.
But you may refer to informal MeetUps or trade fairs; if so, google "Web Data Extraction Summit", "OxyCon Web Scraping Conference", "ScrapeCon 2024" (all past) or the forthcoming: https://www.ipxo.com/events/web-data-extraction-summit-2024/
The edge that every web scraper has is the knowledge they possess. In my opinion, conference presentations are usually too generalized or geared towards pitching services related to web scraping solutions.
There are some communities you can find in Discord, Telegram and most professional web scrapers are pretty active in LinkedIn and Twitter. The fun communities are in fact small groups of people with shared values and interests.
I've been writing scrapers on Upwork for many years. I'm sick of doing project based work and want to work at/start a scraping SaaS. Any advice?
I would recommend checking Google to see if you can find any job openings. Please remember that it is a niche industry, so there may not be many companies currently hiring. But honestly, if you are looking to make a full-time living, consider choosing another niche as web scraping jobs require you to consistently stay on top of your game. Most full-time jobs involve scraping data from big tech companies, and you are on your own to find solutions in bypassing anti-bot measures.
The irony is that before I realized it was so easy I would just open source the code - not on Github, mind you, since the likes of Akamai would DMCA pretty quickly, but playing a little bit of jurisdictional arbitrage I put it on Gitee - the Chinese copycat of Github. I don't have a background in any of this, but companies like the brag and it's not hard to put two and two together. It also was a practical way to enable me to place wagers on sports automatically - which was more or less my actual day job - and was pretty good for learning programming quickly in your late 20s.
Instead almost immediately I got inundated by sneaker botters in China and in English from somewhere that doesn't use it as a native language, judging from the idiosyncratic use. I kept the code up for a bit but took it down not because of any legal threats (good luck with DMCA-ing a platform endorsed by the CCP, even though I have no love for the party, I also find the American attitude that places intellectual property over real property in practice - from my experience as a defense attorney - to be just as screwed up in terms of priorities, just a matter of degrees. What made me take it down was the fact that I did not want to work in a customer service job or really for anyone, and judging by the requests, it was mostly consisted of "you do the work but we'll split the profits", which I can't believe anyone would fall for.
But since the internet is forever, some parts of code that specifically worked to emulate Cyberfed-Akamai from 0.8 to 2.3 are probably still floating around. My bad. I don't wear shoes normally - flip flops or nothing after having to wear a suit to work for a decade - and have no idea beyond what happens in NBA2K. Although cybersecurity firms making products that someone who learned how to program in their mid 20s and put online within 3 years and had it work should be pretty ashamed of how much they charge, considering that I haven't even taken a math course since 11th grade and had too much of an ADHD problem to watch videos or even read more than blog posts or documentation. Everything I learned, I learned by copying from Github and similar services until it worked. There must be a lot of snake oil being sold out there, maybe most of it, since the insidiousness of the whole thing is that selling bunk solutions seldom gets you in trouble anyway, while actual crime - rape, murder, robbery and the like - are largely lagging because the police simply prefer to complain about culture war bs instead of actually, you know, do their jobs. Who knew Judith Butler was THIS spot on.
Thank you very much for sharing your story. From what I know these days, sneaker bots as an industry have pretty much gone downhill. Not because of anti-bot measures, but because the entire industry has essentially shifted from retail stores to eBay resllers. Everyone is competing to buy the first batch to the point that it is not worth building a sneaker bot anymore.
How do you keep up with the industry?
It is kind of like Fight Club. There are 2-3 good communities that I lurk in. The people won't walk you through your scraping problems, but if you ask the questions to the right person politely, they often help.
Many residential proxy and scraping experts are pretty active on LinkedIn. But they do not talk about scraping data, just news around web scraping.
I’m really mixed on this. Anti bot stuff is increasingly a pain point for security research. Working in this space, I have to work against these systems.
Threat actors use Cloudflare and other services to gate their payloads. That’s a problem for our customers who are trying to find/detect things like brand impersonation and credential phish. Cloudflare has been completely unhelpful. They just don’t care.
Seconding this. Evading detection has become a real cake-walk since threat actors are able to sign up for a free Cloudflare account and then put their phishing site on their 2-hours old domain behind a level of protection backed by a $20B company. Funny that you almost never see phishing on Akamai ;)
Disclaimer: We operate in this space so we obviously have an interest in being able to detect these threats going forward.
Other than being the cheapest & easiest to use, is Cloudflare doing a particular evil here?
As a webmaster I don’t want non-user traffic except search engines. It’s a waste of money and often entails security, privacy and commercial risk.
Without Cloudflare I’d achieve only slightly less effective results using an AWS WAF, another CDN, or hand rolling solutions out of ipinfo etc.
Excuse my bias, as I work for IPinfo. Rolling your own bot detection service is something you should explore if you want near-absolute coverage.
We intentionally do not provide an IP reputation service as many sophisticated bots mimic the "good reputational" aspect of IP addresses. Usage of residential connections or essentially being vetted by CDN/cloud services makes making bot detection ambiguous.
That is why we provide accurate IP metadata information. Whenever you detect patterns of bot-like behavior, look up the metadata such as privacy service usage, ASN, or assigned company, and then start blocking them via the firewall.
They could police their content. Or if they don’t want to, they could meaningfully partner with the security industry - create a “security bots” program, respond to takedown requests in days not months, etc.
I suppose that Cloudflare scanning payloads for known malware could potentially be effective if they could make the performance work.
Closed partnerships programs are a bit concerning though. Once they’re up and running there’s an enormous economic incentive for CF to squeeze members with fees that capture the economic upside.
Cloudflare is the ultimate example of creating the problem and selling the solution.
I was under the (naive?) impression that Cloudflare a SaaS startup poster child. Do you mind expanding on your comment?
Among other things, cloudflare hosts DoS services while selling DoS protection.
I think you can get a bot allowed by all of Cloudflare at https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pR.... The blog post I read didn't make it clear if it would apply to all of Cloudflare or just customer sites though.
You can. Sort of. The good bots list is basically driven by a fixed user agent. And customers can set their preference to not allow “good bots”.
Not so good for security work.
It’s similar to their abuse reporting. They give your info to the site owner. Gee thanks, that’s just what I want to do.
I feel like we'll eventually arrive to some kind of micro-payment mechanism to solve this issue
> Those companies employ ill-adjusted individuals that do nothing else than look for the most recent techniques to fingerprint browsers [...] When normal people are out drinking beers in the pub on Friday night, these individuals invent increasingly bizarre ways to fingerprint browsers and detect bots ;)
What's the deal with "ill-adjusted" and "normal people"? I'm gonna say it right now, the reason why these individuals do this is because it's way more interesting and fun than building some bullshit React website for some boring business for the 20th time (this is just an example, not attacking React here, no need to freak out)
It's fun because you get to solve an actual real-world challenge and find new ways to do something. Same with things like developing exploits. Those who do this are not "ill-adjusted", they are in fact normal people that do what they are passionate about.
The whole mentality of "anyone who does something I don't like is ill-adjusted" is just absolutely insane.
That entire paragraph is a joke. That’s why there is a little wink at the end.
It's not clear if it's a joke specifically because of the addition of "ill-adjusted"
Anti bot stuff also seems to be a security threat and privacy threat: preventing users from accessing your site if using VMs, port scanning, various froms of fingerprinting
I prefer the approach of an algorithmic challenge that forces the "new visitor" to spend some CPU cycles.
It's a clear process, doesn't involve privacy risks or strange sneaky games, and tends to fail in ways that a human can at least see and report, as opposed to mysterious outages.
.. and also annoys people with slow hardware while costing very little to serious scrapers?
How much CPU time can you burn so people on 3 year old phones can see it, and how much will it cost scrapers?
Even a very slight challenge is a problem for scrapers: they have to do it far more frequently.
Its better than captchas and whatever Cloudflare does in terms of overall nuisance.
Discussed at the time:
Scrape like the big boys - https://news.ycombinator.com/item?id=29117022 - Nov 2021 (189 comments)
> Every website can access rotation and velocity data from Android data without asking for permission.
What????!!! That's nuts
Interesting. Busy building a project that requires scraping (pretty low rate)
Have been puzzling what to do about the rejection cases. A single cheap android might just fill the gap.
This tends to be a very unpopular opinion around here, but in almost all cases I find Internet scraping to be unethical and downright malicious. I'm not saying all cases, but I'm saying almost.
A lot of the actors involved tend to be hustle culture types who think they are OWED your data, regardless of the ethics, laws, being a good citizen, whatever. They will blatantly disregard terms of service and hide behind massive setups such as these to circumvent protection etc.
And the problem is, if you run any sort of business or service that is data oriented, there will be thousands of people that will do this, which will cause you to devote enormous amounts of time, effort, money, and infrastructure just to mitigate the issues involved with data scraping. That's before you are even addressing whether or not these people are "stealing" your data. People who feel they are entitled to the crux of your business aren't bothered by being nice in the way they take it - they'll launch services that will cripple infrastructure.
Whenever I deal with a scraping process that decides it wants my entire business, and it wants all of it RIGHT NOW, or in 5 minutes, I want to find the person and sit them down in a room and tell them "hey, develop your own ideas and business. Ok? Thanks"
And if you think this was a problem before, it's exponentially worse over the past few months with every Tom, Susan, and Harry deciding they must have all your data to train their new LLM AI model. By the thousands.
I use web scraping to identify and monitor fraud.
Exhibit A: https://archive.ph/0ZUA8
This website is used to recruit people to set up "lead generation" Google Business Profiles and leave paid reviews.
Exhibit B: https://archive.ph/WWZuw
This is an example of the Craigslist ad used to initially attract people to the website above.
Exhibit C: https://archive.ph/wip/7Xig4
This is one of the Google Maps contributors which left paid reviews.
If you start with the reviews on that profile, you'll find a network of Google Business Profiles for fake service-area businesses connected through paid reviews.
Web scraping allows me to collect this type of data at scale.
I also use scraping to monitor the status of fake listings. If they are removed, the actor behind them will often get them reinstated. This allows me to report them again.
I don't care if you use Web scraping to solve the Israeli / Palestinian conflict. You're not entitled to anyone's data, computers, services, etc because you've decided for altruistic reasons that it is appropriate.
Cool use case. Love it. Fascinating stuff. But if Google told you to stop, would you? Or would you instead decide to build a 5 server cluster of 200 4G modems spread across continents to continue your work? Because if you did I would assume that you've decided to move on from a cute little altruistic process into a commercial use of someone else's data to make a profit.
Wait - so you are saying that information on the public internet isn’t public? Man, I wish people would remember the origin of the web and the entire reason it exists. If you don’t want information public, protect it - otherwise, I say it’s fair game.
Remember the OP article is about a system that is designed to completely and directly circumvent protections.
If an organization puts a series of processes in place to prevent scrapers from wholesale taking data in violation of terms of service, and you develop a 5 server cluster of 200x 4G modems it's no longer "fair game" and you're directly being unethical in your use of someone else's services.
Yeah, I think it's fair to say that in the presence of anti-bot measures (whether they work or not) that the content on the website isn't public anymore.
Available to someone meeting certain criteria (student discount, senior discount) doesn't mean available to anyone. I see no reason that "not available to be consumed by autonomous agents" is somehow invalid in a way that unlimited refills is only available to humans and not robots.
I agree that there is a line at using someone else’s data to make a profit, but it is kind of ironic that you mention Google, because their exact business model is scraping websites to feed their search results and litter it with ads to make a profit. For me there is a big line between aggregating publicly available data (search results, reviews, news, job postings, etc. ) and intentionally violating terms of service like signing up for fake accounts an harvesting user data. So entitled maybe not (sites can try to prevent you from scraping), but if you make something publicly available you shouldn’t be surprised when people use it in ways you may not originally have intended (within legal boundaries of course).
>I don't care if you use Web scraping to solve the Israeli / Palestinian conflict.
Maybe you should though. It's always worth it to think about which giant's shoulder you're standing on. It's giants all the way down.
> cute little altruistic process
Maybe it is not the opinion which is unpopular, but the way it is being presented.
> Whenever I deal with a scraping process that decides it wants my entire business, and it wants all of it RIGHT NOW, or in 5 minutes, I want to find the person and sit them down in a room and tell them "hey, develop your own ideas and business. Ok? Thanks"
That's a lot of righteous anger for somebody building a business on top of other people's data.
"Broadcastify is the worlds largest source of public safety, aircraft, rail, and marine radio live audio streams."
I have no sympathy whatsoever. You're just complaining about the very thing you're doing. If it's fair for you to do that, it's fair for others to do it to you.
They volunteer to provide the data to us. Every single last one of them. Nowhere in our business model did we make the conscious decision to say "hey, look at that business, they have something, and I'm going to take it."
Reading public website data is not "taking it". It is still there.
Observing publicly available information is not theft, nor is it illegal.
Of course copyright rules apply, but that is for if you reproduce something.
reproduce something
No one is developing a 5 server cluster with 200+ 4g modems to observe publicly available information. They are using said cluster to deliberately work around blocks, rate limits, and restrictions on scrapers who are scraping content solely to reproduce the data and use it for commercial purposes (make money)
Aren't you also volunteering your data? Don't browsers just talk to your webserver and say "Hey, what do you have?" and your site responds in kind.
There's a lot of local history locked up in facebook's nostalgia groups. I want to archive it in an open format.
I want to grab new rental listings and put them in an RSS feed, so I only look at each one once.
That's my uses for data scraping right now. If that destroys someone's business, I don't actually care. Maybe it's selfish, but my right to re-format data for my own convenience outweighs their right to make a profit.
Not that I think you shouldn't do it or you're doing something wrong, but describing it as a right irks me the wrong way. You don't have any right to expect someone else's computers to work for you.
I'm not sure how to phrase it except in terms of competing rights, but I take your point.
At the point where I'm scraping, the data's on my computer though.
You could call them interests .
It's often in a business's interest to format data in a specific way to make money, for example interlacing it with ads.
If that destroys someone's business, I don't actually care. Maybe it's selfish, but my right to re-format data for my own convenience outweighs their right to make a profit.
Exhibit A
Yeah, it's as unsympathetic framing of my position as I can offer.
But it's basically the same question as adblockers: Can I do what I want with the 1's and 0's on my own machine?
I'm not going to accept that I owe anyone a business model.
I'm not going to disagree with your use case here.
But I'm going to assume that you have some level of a conscious and you don't really mean you could give 3 shits about someone else's hard work so you can have some satisfaction at home. Because at face value that's exactly what you said.
No, I think that's fair. Unsympathetic framing, but not inaccurate. It's that whole "information wants to be free" thing.
BTW, kudos for presenting your point of view in a hostile forum and holding your own. I should have said that up front.
Is it unethical for a mouse to eat the cheese without triggering the trap?
> hustle culture types
It seems like you have this imaginary strawman that you hate and it seems like that's the foundation of why you dislike this.
No. The foundation of why I dislike it is simple. If I own some data, then I get to dictate the terms of how that data is used. Period.
“Hustle culture types” is simply a little anecdote about the types that would look you in the eye and tell you they are entitled to disregard what I said above. They’ll usually wrap it in some altruistic bs to justify as well.
Why do you put it on the open internet if you don't want machines to find and read it?
ToS is nice but you can't expect that it applies - the user (of the machine doing the scraping) might be a child which makes the potential contract automatically void, for example. Also, there are people under jurisdictions where such things have no power, or that don't recognize your rights to the data.
And the whole thing of putting data out publicly and then just expecting machines to see the pile of data and go "oh so where do I sign the ToS?" is weird...
Just put it behind a rate limited API key...
As an analogy, imagine that a gardener builds a beautiful flower garden, bisected by a cute stone path, which she invites the public to view freely, save for a single restriction; a sign reading "keep off the flower beds."
There is a well-understood social contract here. I should not drive my car along the path, even if don't crush the flowers. I shouldn't walk on the flower beds, even if that sign isn't legally enforceable. And if a runaway lawnmower, RC car, or some other machine of mine does end up in the garden, I am responsible, because it was my machine.
With websites, there is even a TOS specifically for scrapers - robots.txt. The fact that it is easy to bypass or ignore is no excuse for actually bypassing or ignoring it.
The anonymity of the Internet functions as a ring of Gyges, where since people don't face consequences (even social ones), they feel entitled to do as they will. However, just because you can do something does not mean you have a right to do something.
I think this analogy would be improved if the sign said "Please don't take any pictures." This is far more restrictive than a sign saying "Please don't take any seeds or cuttings." The latter is more understandable because such activity damages the flower garden (particularly if everyone starts taking seeds and cuttings).
Now let's say a photographer visits the flower garden, takes images, and sells them online as post cards? As long as the photographer is not hindering other people (flooding the site with repeat requests, in the analogy), it doesn't seem to be a problem.
On the other hand, let's say we don't have a flower garden, we have an art gallery or a street artist's display - or the pages of a recently published book. Now the issue is distributing copyrighted material without paying the creator... but what if there's a broad social consensus that copyright is out of control and should have been radically shortened decades ago?
The vast majority of data being scraped is not copyrightable creative work, however, so as long as you're not obnoxiously hammering a site, scraping seems perfectly ethical.
Robots.txt is definitely not any kind of ToS - some people (Google) said they will respect it. No reason to expect people even knowing about the concept - practically nobody knows about it, not even most developers.
And again - there are countries where any ToS without explicit signature or other kind of legal agreement don't apply at all.
Just like writing "by using the toilet you agree to transfer your soul for infinity" on a piece of toilet paper taped somewhere in the vicinity of a toilet gives you nothing - even if it was a more reasonable contract, nobody agreed to anything.
As for your other point, I think this is more like standing next to a highway with a sign that reads "don't drive cars here" and expecting people to stop and turn around. They didn't even see your sign at their speed and it's kinda unreasonable to expect they would be checking for that kind of a sign on a highway. At least make it properly - big, red, reflective (e.g. a Connection Reset, or at least 403 Forbidden).
Yes, there is no legal enforcement mechanism behind robots.txt. Nor do I particularly want there to be. However, most people agree that reasonable requests made regarding the use of someone's property should be followed. The capability to do something without consequences is not the same as the right to do something.
Our gardener should not need to build a brick wall around their public garden to keep your lawnmower out.
[flagged]
Is it? Just ask around. I have web app devs around me, they don't know it. Only those who actually specialize on web sites (for presentation) do.
I couldn't set up a web server to save my life and I know what robots.txt is.