Adopt1. Apache Spark
The mature distributed engine for large-scale batch and streaming, with DataFrame, SQL, and structured-streaming APIs. Spark 4.x adds Spark Connect (a thin client/server protocol) and broad ANSI-SQL alignment, and it remains the backbone of most managed lakehouse platforms.
Why here: A proven, predictable default for petabyte-scale ETL and analytics, with the widest ecosystem and operational track record.
Adopt2. Apache Flink
A true stateful stream processor with event-time semantics, exactly-once state, and a mature SQL layer. It is the standard for low-latency, event-driven pipelines and continuous ETL where correctness under out-of-order data matters.
Why here: The reference engine for real streaming — battle-tested at scale and unmatched for event-time, stateful workloads.
Adopt3. DuckDB
An in-process columnar OLAP engine — "SQLite for analytics" — that reads Parquet, Arrow, and lakehouse tables directly and runs vectorised queries on a single node. The 1.x line has a stable storage format and an established extension ecosystem.
Why here: The default for single-node analytics and embedded query: zero-ops, fast, and now stable enough to standardise on.
Trial4. Polars
A Rust DataFrame library with a lazy, query-optimised API and strong multi-threaded performance on a single machine; the 1.x API is stable and Python adoption is climbing fast. A distributed cloud engine is still maturing.
Why here: Excellent for single-node DataFrame work and a strong pandas alternative — trial it where the in-memory model fits before standardising.
Trial5. Ray
A distributed runtime for Python that spans data loading, training, and serving, with Ray Data for streaming batch inference and large-scale transforms. Powerful for ML-adjacent data work, but operating a cluster is a real commitment.
Why here: Compelling where data and ML pipelines converge; pilot it with clear ownership rather than adopting it as a general-purpose ETL engine.
Assess6. Daft
A Rust-backed distributed DataFrame aimed at multimodal data (text, images, embeddings, URLs) alongside tabular columns, with native lakehouse and object-store integration. Young but advancing quickly.
Why here: A promising take on multimodal and ML-feature pipelines; assess it on bounded workloads while the ecosystem settles.
Hold7. Hadoop MapReduce
The original disk-based batch model of the Hadoop era. Superseded for new work by in-memory and vectorised engines, and effectively in maintenance — most platforms have moved compute off it entirely.
Why here: Don't start new pipelines on it; keep legacy jobs running and migrate them to Spark, Flink, or a warehouse engine.