DataSet

DataSet manifests describe derived tables built using SQL or PRQL (Pipelined Relational Query Language). Datasets typically select from DataSource views and may depend on other datasets.

Spec overview

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
spec:
  query: |
    SELECT ...
  dependencies:
    - sales_csv
    - exchange_rates

Fields:

spec.query – SQL query, either inline or loaded from an external file. Required if spec.prql and spec.source are not set.
spec.prql – PRQL query, either inline or loaded from an external file. When set, takes precedence over spec.query. The PRQL is sent directly to the query engine which compiles it via the prql extension.
spec.source – direct pass-through to a DataSource without transformation. Mutually exclusive with query and prql.
spec.dependencies – optional list of datasource references. Can be string names or inline DataSource definitions. Inline definitions are referenced via @inline(N) syntax in queries.

External query files

For complex queries, you can store them in separate .sql or .prql files and reference them using the $file syntax. This keeps your YAML manifests clean and enables syntax highlighting in your editor.

Loading SQL from an external file

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
spec:
  query:
    $file: ./queries/sales_summary.sql
  dependencies:
    - sales_csv

Loading PRQL from an external file

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: customer_orders
spec:
  prql:
    $file: ./queries/customer_orders.prql
  dependencies:
    - orders
    - customers

File path resolution

Paths are resolved relative to the manifest file containing the $file reference
Both ./relative/path.sql and relative/path.sql formats are supported
Absolute paths are also supported but not recommended for portability

Auto-reload and caching

External query files are fully integrated with bino’s caching and hot-reload system:

Cache invalidation: When an external SQL or PRQL file changes, the dataset cache is automatically invalidated
Hot-reload in preview: Changes to external query files trigger automatic refresh in bino preview mode
Dependency graph: External files appear in bino graph output for visibility

Organizing query files

A recommended project structure for external queries:

my-report/
├── bino.toml
├── manifests/
│   ├── datasources.yaml
│   └── datasets.yaml
├── queries/
│   ├── sales_by_region.sql
│   ├── monthly_revenue.sql
│   └── customer_analysis.prql
└── data/
    └── sales.csv

Simple aggregation

---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_csv
spec:
  type: csv
  path: ./data/sales.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_by_region
spec:
  query: |
    SELECT
      region,
      SUM(amount) AS total_amount
    FROM sales_csv
    GROUP BY region
  dependencies:
    - sales_csv

Dataset depending on multiple inputs

---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_fact
spec:
  type: parquet
  path: ./warehouse/fact_sales/*.parquet
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: currency_rates
spec:
  type: csv
  path: ./data/currency_rates.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: revenue_eur
spec:
  query: |
    SELECT
      f.region,
      f.booking_date,
      f.amount * r.rate_to_eur AS revenue_eur
    FROM sales_fact f
    JOIN currency_rates r
      ON r.currency = f.currency
     AND r.valid_on = f.booking_date;
  dependencies:
    - sales_fact
    - currency_rates

Datasets are referenced from layouts via their metadata.name using the dataset field of charts, tables, and text components.

Using PRQL

PRQL is a modern language for transforming data that compiles to SQL. It offers a more readable, pipeline-based syntax compared to traditional SQL.

When you specify spec.prql, bino automatically loads the PRQL extension and sends your PRQL query directly to the query engine for compilation and execution.

Simple PRQL example

---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_csv
spec:
  type: csv
  path: ./data/sales.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_by_region
spec:
  prql: |
    from sales_csv
    filter amount > 0
    group {region} (
      aggregate {
        total_amount = sum amount,
        order_count = count this
      }
    )
    sort {-total_amount}
  dependencies:
    - sales_csv

PRQL with joins and transformations

---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: orders
spec:
  type: csv
  path: ./data/orders.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: customers
spec:
  type: csv
  path: ./data/customers.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: customer_orders
spec:
  prql: |
    from orders
    join customers (==customer_id)
    derive full_name = f"{customers.first_name} {customers.last_name}"
    select {
      order_id,
      full_name,
      order_date,
      total
    }
    sort {-order_date}
    take 100
  dependencies:
    - orders
    - customers

For more PRQL syntax and examples, see the PRQL documentation.

Standard Dataset Schema

To work correctly with visualization components like Charts and Tables, your dataset queries should return rows that conform to the standard schema. While not all fields are required for every component, following this structure ensures proper aggregation, drill-down, and filtering behavior.

Measure Values (Scenarios)

The schema supports four parallel sets of scenario columns for comparative analysis (e.g., Actual vs Plan).

Column	Description
`ac1` … `ac4`	Actual values (current measurements)
`pp1` … `pp4`	Previous Period values (Year-over-Year, Month-over-Month)
`fc1` … `fc4`	Forecast values (predictions)
`pl1` … `pl4`	Plan/Budget values (targets)

Grouping Dimensions

These fields determine how data is hierarchically organized and sorted.

Dimension	Index Column	Description
`rowGroup`	`rowGroupIndex`	Top-level row grouping (e.g., “Revenue”, “Costs”).
`category`	`categoryIndex`	Primary data dimension (e.g., “Product A”, “Region North”).
`subCategory`	`subCategoryIndex`	Detail dimension for drill-down (creates “thereOf” rows).
`columnGroup`	`columnGroupIndex`	Column-level grouping for breaking down measures.

Note: When providing a grouping dimension string (e.g., rowGroup), you should likewise provide the corresponding index (e.g., rowGroupIndex) to ensure consistent sort ordering.

Metadata & Control Fields

Column	Type	Description
`date`	`string`	Required for TimeCharts. ISO 8601 date (e.g., `2024-01-15`). Used for time-series axes and normalization.
`operation`	`string`	Aggregation sign: `'+'` (add) or `'-'` (subtract). Defaults to `'+'`. Useful for P&L structures where costs subtract from totals.
`setname`	`string`	Dataset identifier. Used by some charts to distinguish multiple query results.

Example Query

Here is an example SQL query producing a compliant dataset:

SELECT
  'Revenue' as rowGroup,
  1 as rowGroupIndex,
  product_name as category,
  product_rank as categoryIndex,
  '2024-01-15' as date,
  sum(sales_amount) as ac1,
  sum(budget_amount) as pl1
FROM sales_data
GROUP BY product_name, product_rank

Data Validation

bino can validate query results against the standard schema at build time. This catches common data issues early:

Validation checks

Check	Description
Type validation	String fields (`category`, `rowGroup`, etc.) must be strings
Type validation	Number fields (`ac1`, `categoryIndex`, etc.) must be numbers
Enum validation	`operation` must be `"+"` or `"-"`
Date format	`date` must be ISO 8601 format (`YYYY-MM-DD`)
Dependent required	`rowGroup` requires `rowGroupIndex`, `category` requires `categoryIndex`, etc.

Enabling validation

Data validation is enabled by default in warn mode. Configure via CLI flags:

# Log warnings and continue (default)
bino build --data-validation=warn

# Treat validation errors as fatal
bino build --data-validation=fail

# Skip validation entirely
bino build --data-validation=off

For lint-only validation (without building), use --execute-queries:

bino lint --execute-queries

Sample size

For efficiency, bino validates only the first N rows per dataset. Configure via environment variable:

BNR_DATA_VALIDATION_SAMPLE_SIZE=100 bino build

Default sample size is 10 rows.

Conditional inclusion with constraints

DataSet documents support metadata.constraints to conditionally include them for specific artefacts, modes, or environments.

Preview-only datasets

Include debug or diagnostic datasets only during development:

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: debug_metrics
  constraints:
    - mode==preview
spec:
  query: |
    SELECT * FROM raw_data LIMIT 100
  dependencies:
    - raw_data

Format-specific datasets

Use different datasets for different output formats:

# Detailed dataset for print (A4/Letter)
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
  constraints:
    - spec.format in [a4,letter]
spec:
  query: |
    SELECT region, product, month, revenue, costs, margin
    FROM sales_fact
    GROUP BY region, product, month

---
# Summarized dataset for screen (XGA)
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
  constraints:
    - field: spec.format
      operator: "=="
      value: xga
spec:
  query: |
    SELECT region, SUM(revenue) as total_revenue
    FROM sales_fact
    GROUP BY region

For the full constraint syntax and operators, see Constraints and Scoped Names.

Inline DataSet definitions

Instead of defining DataSets as separate documents, you can define them inline directly within components. This is useful for simple, component-specific data transformations.

Inline DataSet in a Chart

apiVersion: bino.bi/v1alpha1
kind: ChartStructure
metadata:
  name: sales_chart
spec:
  dataset:
    query: |
      SELECT region, SUM(amount) as total
      FROM sales_csv
      GROUP BY region
    dependencies:
      - sales_csv
  chartTitle: Sales by Region
  chartType: bar
  # ... chart configuration

Inline DataSet with inline DataSource

You can nest inline DataSource definitions within an inline DataSet. Use the @inline(N) syntax to reference them by index:

apiVersion: bino.bi/v1alpha1
kind: ChartStructure
metadata:
  name: quick_chart
spec:
  dataset:
    dependencies:
      - type: csv
        path: ./data/sales.csv
    query: |
      SELECT region, SUM(amount) as total
      FROM @inline(0)
      GROUP BY region
  chartTitle: Sales Overview
  # ... chart configuration

The @inline(0) reference points to the first item in the dependencies array (0-indexed). When you have multiple inline DataSources:

spec:
  dataset:
    dependencies:
      - type: csv
        path: ./data/orders.csv
      - type: csv
        path: ./data/customers.csv
    query: |
      SELECT c.name, COUNT(*) as order_count
      FROM @inline(0) o
      JOIN @inline(1) c ON o.customer_id = c.id
      GROUP BY c.name

Mixing inline and named references

You can mix inline DataSource definitions with named references:

spec:
  dataset:
    dependencies:
      - sales_csv # Named reference to existing DataSource
      - type: csv # Inline definition
        path: ./data/rates.csv
    query: |
      SELECT s.*, r.rate
      FROM sales_csv s
      JOIN @inline(1) r ON s.currency = r.code

Note: Named references use their document name directly in SQL, while inline definitions use @inline(N).

Direct source pass-through

For simple cases where you want to use a DataSource directly without transformation, use the source field:

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_passthrough
spec:
  source: sales_csv

This is equivalent to SELECT * FROM sales_csv but more explicit and efficient. The source field is mutually exclusive with query and prql.

Inline source pass-through

You can also use inline DataSource definitions with source:

apiVersion: bino.bi/v1alpha1
kind: Table
metadata:
  name: raw_data_table
spec:
  dataset:
    source:
      type: csv
      path: ./data/report.csv
  tableTitle: Raw Data
  # ... table configuration

How inline definitions work

When bino processes inline definitions:

Materialization: Inline definitions are converted to synthetic documents during YAML loading
Hash-based naming: Generated documents get unique names like _inline_datasource_a1b2c3d4
Deduplication: Identical inline definitions share the same generated document
Labels: Generated documents have bino.bi/generated: "true" and bino.bi/inline: "true" labels

This means inline definitions are fully compatible with caching, the dependency graph, and all other bino features.

Naming restrictions

User-defined document names must not start with _inline_ as this prefix is reserved for generated inline definitions. The lint rule inline-naming-conflict enforces this.