A lost decade chasing distributed architectures for data analytics?

duckdb.org

213 points by andreasha 5 days ago


braza - 2 days ago

This has the same energy of this article named "Command-line Tools can be 235x Faster than your Hadoop Cluster" [1]

[1] - https://adamdrake.com/command-line-tools-can-be-235x-faster-...

rr808 - 2 days ago

Ugh I have joined a big data team. 99% of the feeds are less than a few GB yet we have to use Scala and Spark. Its so slow to develop and slow to run.

Mortiffer - 2 days ago

The R community has been hard at work on small data. I still highly prefer working on on memory data in R dplyr DataTable are elegant and fast.

The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.

zkmon - 2 days ago

A database is not only about disk size and query performance. Database reflects the company's culture, processes, workflows, collaboration etc. It has an entire ecosystem around it - master data, business processes, transactions, distributed applications, regulatory requirements, resiliency, Ops, reports, tooling etc,

The role of a database is not just to deliver query performance. It needs to fit into the ecosystem, serve the overall role on multiple facets, deliver on a wide range of expectations - tech and non-tech.

While the useful dataset itself may not outpace the hardware advancements, the ecosystem complexity will definitely outpace any hardware or AI advancements. Overall adaptation to the ecosystem will dictate the database choice, not query performance. Technologies will not operate in isolation.

braza - 2 days ago

> History is full of “what if”s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened?

I like the gist of the article, but the conclusion sounds like 20/20 hindsight.

All the elements were there, and the author nails it, but maybe the right incentive structure wasn't there to create the conditions to make it able to be done.

Between 2010 and 2015, there was a genuine feeling from almost all industry that we would converge to massive amounts of data, because until this time, the industry had never faced a time with so much abundance of data in terms of data capture and ease of placing sensors everywhere.

The natural step in this scenario won't be, most of the time, something like "let's find efficient ways to do it with the same capacity" but instead "let's invest to be able to process this in a distributed manner independent of the volume that we can have."

It's the same thing between OpenAI/ChatGPT and DeepSeek, where one can say that the math was always there, but the first runner was OpenAI with something less efficient but with a different set of incentive structures.

twic - 2 days ago

This feels like a companion to classic 2015 paper "Scalability! But at what COST?":

https://www.usenix.org/system/files/conference/hotos15/hotos...

bhouston - 2 days ago

I have a large analytics dataset in BigQuery and I wrote an interactive exploratory UI on top of it and any query I did generally finished in 2s or less. This led to a very simple app with infinite analytics refinement that was also fast.

I would definitely not trade that for a pre-computed analytics approach. The freedom to explore in real time is enlightening and freeing.

I think you have restricted yourself to recomputed fix analytics but real time interactive analytics is also an interesting area.

culebron21 - 2 days ago

A tangential story. I remember, back in 2010, contemplating the idea of completely distributed DBs inspired by then popular torrent technology. In this one, a client would not be different from a server, except by the amount of data it holds. And it would probably receive the data in torrents manner.

What puzzled me was that a client would want others to execute its queries, but not want to load all the data and make queries for the others. And how to prevent conflicting update queries sent to different seeds.

I also thought that Crockford's distributed web idea (where every page is hosted like on torrents) was a good one, even though I didn't think deep of this one.

Until I saw the discussion on web3, where someone pointed out that uploading any data on one server would make a lot of hosts to do the job of hosting a part of it, and every small movement would cause tremendous amounts of work for the entire web.

roenxi - 2 days ago

> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.

This isn't really saying much. It is a bit like saying the 1:1000 year storm levy is overbuilt for 99.9% of storms. They aren't the storms the levy was built for, y'know. It wasn't set up with them close to the top of mind. The database might do 1,000 queries in a day.

The focus for design purposes is really to queries that live out on the tail - can they be done on a smaller database? How much value do they add? What capabilities does the database need to handle them? Etc. That is what should justify a Redshift database. Or you can provision one to hold your 1Tb of data because red things go fast and we all know it :/

willvarfar - 2 days ago

I only retired my 2014 MBP ... last week! It started transiently not booting and then, after just a few weeks, it switched to be only transiently booting. Figured it was time. My new laptop is actually a very budget buy, and not a mac, and in many things a bit slower than the old MBP.

Anyway, the old laptop is about par with the 'big' VMs that I use for work to analyse really big BQ datasets. My current flow is to do the kind of 0.001% queries that don't fit on a box on BigQuery and massage things with just enough prepping to make the intermediate result fit on a box. Then I extract that to parquet stored on the VM and do the analysis on the VM using DuckDB from python notebooks.

DuckDB has revolutionised not what I can do but how I can do it. All the ingredients were around before, but DuckDB brings it together and makes the ergonomics completely different. Life is so much easier with joins and things than trying to do the same in, say, pandas.

simlevesque - 2 days ago

I'm working on a big research project that uses duckdb, I need a lot of compute resources to develop my idea but I don't have a lot of money.

I'm throwing a bottle into the ocean: if anyone has spare compute with good specs they could lend me for a non-commercial project it would help me a lot.

My email is in my profile. Thank you.

fulafel - 2 days ago

Related in the big-data-benchmarks-on-old-laptop department: https://www.frankmcsherry.org/graph/scalability/cost/2015/01...

npalli - 2 days ago

DuckDB works well if

* you have a small datasets (total, not just what a single user is scanning)

* no real-time updates, just a static dataset that you can analyze at leisure

* only few users and only one doing any writes

* several seconds is an OK response time, get's worse if you have to load your scanned segment into DuckDB node.

* generally read-only workloads

So yeah, not convinced we lost a decade.

hodgesrm - 2 days ago

> If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way.

I'm really skeptical arguments that say it's OK to be slow. Even on the modern laptop example queries still take up to 47 seconds.

Granted, I'm not looking at the queries but the fact is that there are a lot of applications where users need results back in less than a second.[0] If the results are feeding automated processes like page rendering they need it back in 10s of millisecond at most. That takes hardware to accomplish consistently. Especially if the datasets are large.

The small data argument becomes even weaker when you consider that analytic databases don't just do queries on static datasets. Large datasets got that way by absorbing a lot of data very quickly. They therefore do ingest, compaction, and transformations. These require resources, especially if they run in parallel with query on the same data. Scaling them independently requires distributed systems. There isn't another solution.

[0] SIEM, log management, trace management, monitoring dashboards, ... All potentially large datasets where people sift through data very quickly and repeatedly. Nobody wants to wait more than a couple seconds for results to come back.

PotatoNinja - 2 days ago

Krazam did a brilliant video on Small Data: https://youtu.be/eDr6_cMtfdA?si=izuCAgk_YeWBqfqN

steveBK123 - 2 days ago

Maybe it was all VC funded solutions looking for problems?

It's a lot easier to monetize data analytics solutions if users code & data are captive in your hosted infra/cloud environment than it is to sell people a binary they can run on their own kit...

All the better if its an entire ecosystem of .. stuff.. living in "the cloud", leaving end users writing checks to 6 different portfolio companies.

jandrewrogers - 2 days ago

> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.

There is some circular reasoning embedded here. I've seen many, many cases of people finding ways to cut up their workloads into small chunks because the performance and efficiency of these platforms is far from optimal if you actually tried to run your workload at its native scale. To some extent, these "small reads" reflect the inadequacy of the platform, not the desire of a user to run a particular workload.

A better interpretation may be that the existing distributed architectures for data analytics don't scale well except for relatively trivial workloads. There has been an awareness of this for over a decade but a dearth of platform architectures that address it.

mangecoeur - 2 days ago

Did my phd around that time and did a project “scaling” my work on a spark cluster. Huge pita and no better than my local setup which was an MBP15 with pandas a postgres (actually I wrote+contributed a big chunk of pandas read_sql at that time to make is postgres compatible using sqlalchemy)

mehulashah - 2 days ago

For those of you from the AI world, this is the equivalent of the bitter lesson and DeWitts argument about database machines from the early 80s. That is, if you wait a bit with the exponential pace of Moores law (or modern equivalents), improvements in “general purpose” hardware will obviate DB specific improvements. The problem is that back in 2012, we had customers that wanted to query terabytes of logs for observability, or analyze adtech streams, etc. So, I feel like this is a pointless argument. If your data fit on an old MacBook Pro, sure you should’ve built for that.

godber - 2 days ago

This makes a completely valid point when you constrain the meaning of Big Data to “the largest dataset one can fit on a single computer”.

drewm1980 - 2 days ago

I mean, not everyone spent their decade on distributed computing. Some devs with a retrogrouch inclination kept writing single threaded code in native languages on a single node. Single core clock speed stagnated, but it was still worth buying new CPU's with more cores because they also had more cache, and all the extra cores are useful for running ~other peoples' bloated code.

carlineng - 2 days ago

This is really a question of economics. The biggest organizations with the most ability to hire engineers have need for technologies that can solve their existing problems in incremental ways, and thus we end up with horrible technologies like Hadoop and Iceberg. They end up hiring talented engineers to work on niche problems, and a lot of the technical discourse ends up revolving around technologies that don't apply to the majority of organizations, but still cause FOMO amongst them. I, for one, am extremely happy to see technologies like DuckDB come along to serve the long tail.

querez - 2 days ago

> The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.

Why do they use the geometric mean to average execution times?

tonyhart7 - 2 days ago

is there open source project analytics that build on top of duck db yet????

I mostly see clickhouse,postgress etc

mediumsmart - 2 days ago

I am on the late 2015 version and I have an ebay body stashed for when the time comes to refurbish that small data machine.

hobs - 2 days ago

I have worked for a half dozen companies all swearing up and down they had big data and meaningfully one customer had 100TB of logs and another 10TB of stuff, everyone else when actually thought of properly and had just utter trash removed was really under 10TB.

Also - sqlite would have been totally fine for these queries a decade ago or more (just slower) - I messed with 10GB+ datasets with it more than 10 years ago.

bobchadwick - 2 days ago

It's not the point of the blog post, but I love the fact that the author's 2012 MacBook Pro is still useable. I can't imagine there are too many Dell laptops from that era still alive and kicking.

conorjh - 2 days ago

[dead]