{
  "intro": "Strata is a vendor-neutral technology radar for data engineering: a working catalog of the engines, storage and table formats, transformation and orchestration tools, and query and serving systems that teams use to move and model data at scale. We are not tied to any single platform — open-source and managed offerings sit side by side, batch next to streaming, mature defaults next to promising newcomers. Every entry is placed in one of four quadrants and one of four maturity rings. The aim is to give a data engineer a fast, honest bearing: what you can already build on, what is worth piloting, what to keep an eye on, and what to migrate away from.",
  "methodology": "A ring reflects not abstract \"quality\" but how ready a technology is for responsible adoption: how well proven it is under real production load, how predictable it is to operate, how stable its interfaces are, and how well it fits the surrounding stack. We weigh maturity and support, availability (open-source vs managed), reproducible benchmarks, total cost of ownership, and the health of the surrounding ecosystem. The radar is deliberately vendor-neutral: neighbours in the same ring are alternatives to compare, not a better/worse ranking. Placement reflects the state of the field on the revision date and is revisited as releases ship, projects change status, and operational experience accumulates; stalled or superseded tools drift to Hold, and ones that prove themselves are promoted inward.",
  "title": "Strata",
  "tagline": "Mapping the maturity of the modern data stack.",
  "ring_defs": [
    {
      "ring": "Adopt",
      "def": "A mature default. Proven under real production load, predictable to operate, with stable interfaces — safe to standardise on for new and existing systems."
    },
    {
      "ring": "Trial",
      "def": "Ready for pilots. Production-viable and clearly valuable, but still demands real adoption effort, profile-specific tuning, or is maturing — trial it on bounded workloads with good checks before it becomes a default."
    },
    {
      "ring": "Assess",
      "def": "Worth watching. A promising direction with a fast-moving ecosystem or a fresh release; operational and governance patterns are still forming — explore it in isolated scenarios, not yet in critical paths."
    },
    {
      "ring": "Hold",
      "def": "Hold off. A stalled, wound-down, or superseded option. Not a sensible pick for new systems; keep existing deployments running and migrate off in an orderly way."
    }
  ],
  "quadrants": [
    {
      "quadrant": "Engines",
      "entries": [
        {
          "name": "Apache Spark",
          "ring": "Adopt",
          "description": "The mature distributed engine for large-scale batch and streaming, with DataFrame, SQL, and structured-streaming APIs. Spark 4.x adds Spark Connect (a thin client/server protocol) and broad ANSI-SQL alignment, and it remains the backbone of most managed lakehouse platforms.",
          "rationale": "A proven, predictable default for petabyte-scale ETL and analytics, with the widest ecosystem and operational track record."
        },
        {
          "name": "Apache Flink",
          "ring": "Adopt",
          "description": "A true stateful stream processor with event-time semantics, exactly-once state, and a mature SQL layer. It is the standard for low-latency, event-driven pipelines and continuous ETL where correctness under out-of-order data matters.",
          "rationale": "The reference engine for real streaming — battle-tested at scale and unmatched for event-time, stateful workloads."
        },
        {
          "name": "DuckDB",
          "ring": "Adopt",
          "description": "An in-process columnar OLAP engine — \"SQLite for analytics\" — that reads Parquet, Arrow, and lakehouse tables directly and runs vectorised queries on a single node. The 1.x line has a stable storage format and an established extension ecosystem.",
          "rationale": "The default for single-node analytics and embedded query: zero-ops, fast, and now stable enough to standardise on."
        },
        {
          "name": "Polars",
          "ring": "Trial",
          "description": "A Rust DataFrame library with a lazy, query-optimised API and strong multi-threaded performance on a single machine; the 1.x API is stable and Python adoption is climbing fast. A distributed cloud engine is still maturing.",
          "rationale": "Excellent for single-node DataFrame work and a strong pandas alternative — trial it where the in-memory model fits before standardising."
        },
        {
          "name": "Ray",
          "ring": "Trial",
          "description": "A distributed runtime for Python that spans data loading, training, and serving, with Ray Data for streaming batch inference and large-scale transforms. Powerful for ML-adjacent data work, but operating a cluster is a real commitment.",
          "rationale": "Compelling where data and ML pipelines converge; pilot it with clear ownership rather than adopting it as a general-purpose ETL engine."
        },
        {
          "name": "Daft",
          "ring": "Assess",
          "description": "A Rust-backed distributed DataFrame aimed at multimodal data (text, images, embeddings, URLs) alongside tabular columns, with native lakehouse and object-store integration. Young but advancing quickly.",
          "rationale": "A promising take on multimodal and ML-feature pipelines; assess it on bounded workloads while the ecosystem settles."
        },
        {
          "name": "Hadoop MapReduce",
          "ring": "Hold",
          "description": "The original disk-based batch model of the Hadoop era. Superseded for new work by in-memory and vectorised engines, and effectively in maintenance — most platforms have moved compute off it entirely.",
          "rationale": "Don't start new pipelines on it; keep legacy jobs running and migrate them to Spark, Flink, or a warehouse engine."
        }
      ]
    },
    {
      "quadrant": "Storage & Formats",
      "entries": [
        {
          "name": "Apache Parquet",
          "ring": "Adopt",
          "description": "The de-facto columnar file format for analytics: efficient encodings, predicate push-down, and universal engine support. It underpins essentially every open table format in use today.",
          "rationale": "The safe, universal default for analytical storage on object stores — interoperable everywhere and unlikely to be displaced soon."
        },
        {
          "name": "Apache Arrow",
          "ring": "Adopt",
          "description": "A language-agnostic in-memory columnar standard plus zero-copy IPC and the Arrow Flight transport. It is the lingua franca that lets engines and libraries exchange data without serialisation overhead.",
          "rationale": "A foundational interchange layer baked into most modern tools — adopt it as the default for cross-engine and cross-language data movement."
        },
        {
          "name": "Apache Iceberg",
          "ring": "Adopt",
          "description": "An open table format with hidden partitioning, snapshot isolation, schema and partition evolution, and a REST catalog standard. Engine support is broad and most managed catalogs now speak Iceberg natively.",
          "rationale": "The table format with the widest, most neutral engine and catalog support — a sound default for a new open lakehouse."
        },
        {
          "name": "Delta Lake",
          "ring": "Trial",
          "description": "An ACID table format built around a transaction log, with strong support across the Spark ecosystem and a UniForm mode that exposes Iceberg-compatible metadata. Mature, though historically engine-centric.",
          "rationale": "A solid choice inside a Spark-centric stack; trial it against Iceberg where multi-engine neutrality and catalog choice are priorities."
        },
        {
          "name": "Apache Hudi",
          "ring": "Assess",
          "description": "A table format specialised for incremental ingestion, upserts, and change streams, with copy-on-write and merge-on-read modes. Capable, but the open lakehouse ecosystem is increasingly converging on Iceberg.",
          "rationale": "Assess it for upsert-heavy, CDC-style ingestion; weigh the convergence of catalogs and engines on Iceberg before committing."
        },
        {
          "name": "Lance",
          "ring": "Assess",
          "description": "A modern columnar format optimised for random access and vector search, with versioning and a fast path for embeddings and multimodal data. Young, with a fast-moving spec and tooling.",
          "rationale": "Worth assessing for embedding and feature stores where random access and vector queries dominate; not yet a general analytical format."
        }
      ]
    },
    {
      "quadrant": "Transform & Orchestration",
      "entries": [
        {
          "name": "dbt",
          "ring": "Adopt",
          "description": "The standard for SQL-based transformation: versioned models, tests, documentation, and lineage over your warehouse or lakehouse engine. A large ecosystem and well-understood operating model.",
          "rationale": "The default transformation layer for analytics engineering — proven, widely hired-for, and stable enough to build a practice on."
        },
        {
          "name": "Apache Airflow",
          "ring": "Adopt",
          "description": "The widely deployed Python-defined workflow scheduler for batch orchestration. Airflow 3.x modernises the scheduler, adds DAG versioning, and decouples task execution — a meaningful step up in operability.",
          "rationale": "A mature, ubiquitous orchestrator with deep integrations; a safe default where general-purpose batch scheduling is the need."
        },
        {
          "name": "Dagster",
          "ring": "Trial",
          "description": "An asset-oriented orchestrator that models data assets and their lineage rather than bare tasks, with strong typing, partitions, and built-in observability. Production-ready and growing.",
          "rationale": "A strong fit for asset- and lineage-first teams; trial it where data-asset semantics and testability matter more than raw task scheduling."
        },
        {
          "name": "Prefect",
          "ring": "Trial",
          "description": "A Pythonic orchestrator with a dynamic, code-first flow model and a managed control plane option. Pleasant developer experience and flexible, with a smaller ecosystem than Airflow.",
          "rationale": "Worth trialing where developer ergonomics and dynamic workflows win out; validate the operating model against your scale."
        },
        {
          "name": "SQLMesh",
          "ring": "Assess",
          "description": "A transformation framework with column-level lineage, virtual data environments, and automatic incremental/backfill handling — positioned as a more stateful alternative to plain SQL transforms. Promising and fast-moving.",
          "rationale": "Assess it where backfills and environment isolation are pain points; the ideas are strong but the ecosystem is still young."
        },
        {
          "name": "Apache Oozie",
          "ring": "Hold",
          "description": "The XML-configured workflow scheduler from the classic Hadoop stack. Effectively legacy, with little new investment and a poor fit for modern, code-defined pipelines.",
          "rationale": "Don't adopt for new work; migrate existing Oozie workflows to Airflow, Dagster, or a managed orchestrator."
        }
      ]
    },
    {
      "quadrant": "Query & Serving",
      "entries": [
        {
          "name": "Trino",
          "ring": "Adopt",
          "description": "A distributed SQL query engine for federated, interactive analytics across lakehouse tables and dozens of connectors, without moving data. Active development and broad production use.",
          "rationale": "The default for interactive, federated SQL over a lake — fast, well-supported, and the actively-maintained line of the PrestoSQL lineage."
        },
        {
          "name": "ClickHouse",
          "ring": "Adopt",
          "description": "A columnar OLAP database built for very fast aggregation over large tables, with strong ingestion throughput and a rich SQL dialect. Widely used for analytics and observability backends.",
          "rationale": "A proven default for high-throughput, low-latency analytical queries; predictable and operationally mature at scale."
        },
        {
          "name": "Apache Druid",
          "ring": "Trial",
          "description": "A real-time analytics database for high-concurrency, time-series and event queries, with streaming ingestion and sub-second slice-and-dice. Powerful, but the multi-process architecture is involved to operate.",
          "rationale": "A good fit for real-time dashboards over event streams; trial it where its operational footprint is justified by query latency needs."
        },
        {
          "name": "StarRocks",
          "ring": "Trial",
          "description": "An MPP analytical database with a vectorised engine and a strong lakehouse query path (including direct Iceberg/Hudi/Delta access) plus materialised views for acceleration. Maturing quickly.",
          "rationale": "Compelling for fast lakehouse queries and BI acceleration; trial it against Trino and ClickHouse for your specific query mix."
        },
        {
          "name": "Apache Pinot",
          "ring": "Assess",
          "description": "A real-time OLAP store designed for ultra-low-latency, user-facing analytics at high concurrency, with pluggable indexes and streaming ingestion. Capable but operationally specialised.",
          "rationale": "Assess it for user-facing, high-QPS analytics; the niche is real but the operating model is demanding relative to the alternatives."
        },
        {
          "name": "Presto",
          "ring": "Hold",
          "description": "The original Facebook-born query engine (PrestoDB). The bulk of community momentum and connector development has moved to the Trino fork, leaving PrestoDB comparatively quiet.",
          "rationale": "For new deployments choose Trino; keep existing PrestoDB clusters running and plan a migration."
        }
      ]
    }
  ]
}
