Strata

Mapping the maturity of the modern data stack.

Strata is a vendor-neutral technology radar for data engineering: a working catalog of the engines, storage and table formats, transformation and orchestration tools, and query and serving systems that teams use to move and model data at scale. We are not tied to any single platform — open-source and managed offerings sit side by side, batch next to streaming, mature defaults next to promising newcomers. Every entry is placed in one of four quadrants and one of four maturity rings. The aim is to give a data engineer a fast, honest bearing: what you can already build on, what is worth piloting, what to keep an eye on, and what to migrate away from.

Adopt

A mature default. Proven under real production load, predictable to operate, with stable interfaces — safe to standardise on for new and existing systems.

Trial

Ready for pilots. Production-viable and clearly valuable, but still demands real adoption effort, profile-specific tuning, or is maturing — trial it on bounded workloads with good checks before it becomes a default.

Assess

Worth watching. A promising direction with a fast-moving ecosystem or a fresh release; operational and governance patterns are still forming — explore it in isolated scenarios, not yet in critical paths.

Hold

Hold off. A stalled, wound-down, or superseded option. Not a sensible pick for new systems; keep existing deployments running and migrate off in an orderly way.

25 entries · 4 quadrants · 4 maturity rings

Engines

distributed and in-process compute

Adopt1. Apache Spark

The mature distributed engine for large-scale batch and streaming, with DataFrame, SQL, and structured-streaming APIs. Spark 4.x adds Spark Connect (a thin client/server protocol) and broad ANSI-SQL alignment, and it remains the backbone of most managed lakehouse platforms.

Why here: A proven, predictable default for petabyte-scale ETL and analytics, with the widest ecosystem and operational track record.

Adopt2. Apache Flink

A true stateful stream processor with event-time semantics, exactly-once state, and a mature SQL layer. It is the standard for low-latency, event-driven pipelines and continuous ETL where correctness under out-of-order data matters.

Why here: The reference engine for real streaming — battle-tested at scale and unmatched for event-time, stateful workloads.

Adopt3. DuckDB

An in-process columnar OLAP engine — "SQLite for analytics" — that reads Parquet, Arrow, and lakehouse tables directly and runs vectorised queries on a single node. The 1.x line has a stable storage format and an established extension ecosystem.

Why here: The default for single-node analytics and embedded query: zero-ops, fast, and now stable enough to standardise on.

Trial4. Polars

A Rust DataFrame library with a lazy, query-optimised API and strong multi-threaded performance on a single machine; the 1.x API is stable and Python adoption is climbing fast. A distributed cloud engine is still maturing.

Why here: Excellent for single-node DataFrame work and a strong pandas alternative — trial it where the in-memory model fits before standardising.

Trial5. Ray

A distributed runtime for Python that spans data loading, training, and serving, with Ray Data for streaming batch inference and large-scale transforms. Powerful for ML-adjacent data work, but operating a cluster is a real commitment.

Why here: Compelling where data and ML pipelines converge; pilot it with clear ownership rather than adopting it as a general-purpose ETL engine.

Assess6. Daft

A Rust-backed distributed DataFrame aimed at multimodal data (text, images, embeddings, URLs) alongside tabular columns, with native lakehouse and object-store integration. Young but advancing quickly.

Why here: A promising take on multimodal and ML-feature pipelines; assess it on bounded workloads while the ecosystem settles.

Hold7. Hadoop MapReduce

The original disk-based batch model of the Hadoop era. Superseded for new work by in-memory and vectorised engines, and effectively in maintenance — most platforms have moved compute off it entirely.

Why here: Don't start new pipelines on it; keep legacy jobs running and migrate them to Spark, Flink, or a warehouse engine.

Storage & Formats

table formats, file formats, and the lakehouse

Adopt8. Apache Parquet

The de-facto columnar file format for analytics: efficient encodings, predicate push-down, and universal engine support. It underpins essentially every open table format in use today.

Why here: The safe, universal default for analytical storage on object stores — interoperable everywhere and unlikely to be displaced soon.

Adopt9. Apache Arrow

A language-agnostic in-memory columnar standard plus zero-copy IPC and the Arrow Flight transport. It is the lingua franca that lets engines and libraries exchange data without serialisation overhead.

Why here: A foundational interchange layer baked into most modern tools — adopt it as the default for cross-engine and cross-language data movement.

Adopt10. Apache Iceberg

An open table format with hidden partitioning, snapshot isolation, schema and partition evolution, and a REST catalog standard. Engine support is broad and most managed catalogs now speak Iceberg natively.

Why here: The table format with the widest, most neutral engine and catalog support — a sound default for a new open lakehouse.

Trial11. Delta Lake

An ACID table format built around a transaction log, with strong support across the Spark ecosystem and a UniForm mode that exposes Iceberg-compatible metadata. Mature, though historically engine-centric.

Why here: A solid choice inside a Spark-centric stack; trial it against Iceberg where multi-engine neutrality and catalog choice are priorities.

Assess12. Apache Hudi

A table format specialised for incremental ingestion, upserts, and change streams, with copy-on-write and merge-on-read modes. Capable, but the open lakehouse ecosystem is increasingly converging on Iceberg.

Why here: Assess it for upsert-heavy, CDC-style ingestion; weigh the convergence of catalogs and engines on Iceberg before committing.

Assess13. Lance

A modern columnar format optimised for random access and vector search, with versioning and a fast path for embeddings and multimodal data. Young, with a fast-moving spec and tooling.

Why here: Worth assessing for embedding and feature stores where random access and vector queries dominate; not yet a general analytical format.

Transform & Orchestration

modeling, pipelines, and schedulers

Adopt14. dbt

The standard for SQL-based transformation: versioned models, tests, documentation, and lineage over your warehouse or lakehouse engine. A large ecosystem and well-understood operating model.

Why here: The default transformation layer for analytics engineering — proven, widely hired-for, and stable enough to build a practice on.

Adopt15. Apache Airflow

The widely deployed Python-defined workflow scheduler for batch orchestration. Airflow 3.x modernises the scheduler, adds DAG versioning, and decouples task execution — a meaningful step up in operability.

Why here: A mature, ubiquitous orchestrator with deep integrations; a safe default where general-purpose batch scheduling is the need.

Trial16. Dagster

An asset-oriented orchestrator that models data assets and their lineage rather than bare tasks, with strong typing, partitions, and built-in observability. Production-ready and growing.

Why here: A strong fit for asset- and lineage-first teams; trial it where data-asset semantics and testability matter more than raw task scheduling.

Trial17. Prefect

A Pythonic orchestrator with a dynamic, code-first flow model and a managed control plane option. Pleasant developer experience and flexible, with a smaller ecosystem than Airflow.

Why here: Worth trialing where developer ergonomics and dynamic workflows win out; validate the operating model against your scale.

Assess18. SQLMesh

A transformation framework with column-level lineage, virtual data environments, and automatic incremental/backfill handling — positioned as a more stateful alternative to plain SQL transforms. Promising and fast-moving.

Why here: Assess it where backfills and environment isolation are pain points; the ideas are strong but the ecosystem is still young.

Hold19. Apache Oozie

The XML-configured workflow scheduler from the classic Hadoop stack. Effectively legacy, with little new investment and a poor fit for modern, code-defined pipelines.

Why here: Don't adopt for new work; migrate existing Oozie workflows to Airflow, Dagster, or a managed orchestrator.

Query & Serving

interactive SQL and real-time analytics

Adopt20. Trino

A distributed SQL query engine for federated, interactive analytics across lakehouse tables and dozens of connectors, without moving data. Active development and broad production use.

Why here: The default for interactive, federated SQL over a lake — fast, well-supported, and the actively-maintained line of the PrestoSQL lineage.

Adopt21. ClickHouse

A columnar OLAP database built for very fast aggregation over large tables, with strong ingestion throughput and a rich SQL dialect. Widely used for analytics and observability backends.

Why here: A proven default for high-throughput, low-latency analytical queries; predictable and operationally mature at scale.

Trial22. Apache Druid

A real-time analytics database for high-concurrency, time-series and event queries, with streaming ingestion and sub-second slice-and-dice. Powerful, but the multi-process architecture is involved to operate.

Why here: A good fit for real-time dashboards over event streams; trial it where its operational footprint is justified by query latency needs.

Trial23. StarRocks

An MPP analytical database with a vectorised engine and a strong lakehouse query path (including direct Iceberg/Hudi/Delta access) plus materialised views for acceleration. Maturing quickly.

Why here: Compelling for fast lakehouse queries and BI acceleration; trial it against Trino and ClickHouse for your specific query mix.

Assess24. Apache Pinot

A real-time OLAP store designed for ultra-low-latency, user-facing analytics at high concurrency, with pluggable indexes and streaming ingestion. Capable but operationally specialised.

Why here: Assess it for user-facing, high-QPS analytics; the niche is real but the operating model is demanding relative to the alternatives.

Hold25. Presto

The original Facebook-born query engine (PrestoDB). The bulk of community momentum and connector development has moved to the Trino fork, leaving PrestoDB comparatively quiet.

Why here: For new deployments choose Trino; keep existing PrestoDB clusters running and plan a migration.