Skip to content
GitHub

DataSet

DataSet manifests describe derived tables built using SQL or PRQL (Pipelined Relational Query Language). Datasets typically select from DataSource views and may depend on other datasets.

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
spec:
  query: |
    SELECT ...
  dependencies:
    - sales_csv
    - exchange_rates

Fields:

  • spec.query – SQL query, either inline or loaded from an external file. Required if spec.prql and spec.source are not set.
  • spec.prqlPRQL query, either inline or loaded from an external file. When set, takes precedence over spec.query. The PRQL is sent directly to the query engine which compiles it via the prql extension.
  • spec.source – direct pass-through to a DataSource without transformation. Mutually exclusive with query and prql.
  • spec.dependencies – optional list of datasource references. Can be string names or inline DataSource definitions. Inline definitions are referenced via @inline(N) syntax in queries.

For complex queries, you can store them in separate .sql or .prql files and reference them using the $file syntax. This keeps your YAML manifests clean and enables syntax highlighting in your editor.

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
spec:
  query:
    $file: ./queries/sales_summary.sql
  dependencies:
    - sales_csv
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: customer_orders
spec:
  prql:
    $file: ./queries/customer_orders.prql
  dependencies:
    - orders
    - customers
  • Paths are resolved relative to the manifest file containing the $file reference
  • Both ./relative/path.sql and relative/path.sql formats are supported
  • Absolute paths are also supported but not recommended for portability

External query files are fully integrated with bino’s caching and hot-reload system:

  • Cache invalidation: When an external SQL or PRQL file changes, the dataset cache is automatically invalidated
  • Hot-reload in preview: Changes to external query files trigger automatic refresh in bino preview mode
  • Dependency graph: External files appear in bino graph output for visibility

A recommended project structure for external queries:

my-report/
├── bino.toml
├── manifests/
│   ├── datasources.yaml
│   └── datasets.yaml
├── queries/
│   ├── sales_by_region.sql
│   ├── monthly_revenue.sql
│   └── customer_analysis.prql
└── data/
    └── sales.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_csv
spec:
  type: csv
  path: ./data/sales.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_by_region
spec:
  query: |
    SELECT
      region,
      SUM(amount) AS total_amount
    FROM sales_csv
    GROUP BY region
  dependencies:
    - sales_csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_fact
spec:
  type: parquet
  path: ./warehouse/fact_sales/*.parquet
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: currency_rates
spec:
  type: csv
  path: ./data/currency_rates.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: revenue_eur
spec:
  query: |
    SELECT
      f.region,
      f.booking_date,
      f.amount * r.rate_to_eur AS revenue_eur
    FROM sales_fact f
    JOIN currency_rates r
      ON r.currency = f.currency
     AND r.valid_on = f.booking_date;
  dependencies:
    - sales_fact
    - currency_rates

Datasets are referenced from layouts via their metadata.name using the dataset field of charts, tables, and text components.

PRQL is a modern language for transforming data that compiles to SQL. It offers a more readable, pipeline-based syntax compared to traditional SQL.

When you specify spec.prql, bino automatically loads the PRQL extension and sends your PRQL query directly to the query engine for compilation and execution.

---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_csv
spec:
  type: csv
  path: ./data/sales.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_by_region
spec:
  prql: |
    from sales_csv
    filter amount > 0
    group {region} (
      aggregate {
        total_amount = sum amount,
        order_count = count this
      }
    )
    sort {-total_amount}
  dependencies:
    - sales_csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: orders
spec:
  type: csv
  path: ./data/orders.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: customers
spec:
  type: csv
  path: ./data/customers.csv
---
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: customer_orders
spec:
  prql: |
    from orders
    join customers (==customer_id)
    derive full_name = f"{customers.first_name} {customers.last_name}"
    select {
      order_id,
      full_name,
      order_date,
      total
    }
    sort {-order_date}
    take 100
  dependencies:
    - orders
    - customers

For more PRQL syntax and examples, see the PRQL documentation.

To work correctly with visualization components like Charts and Tables, your dataset queries should return rows that conform to the standard schema. While not all fields are required for every component, following this structure ensures proper aggregation, drill-down, and filtering behavior.

The schema supports four parallel sets of scenario columns for comparative analysis (e.g., Actual vs Plan).

ColumnDescription
ac1ac4Actual values (current measurements)
pp1pp4Previous Period values (Year-over-Year, Month-over-Month)
fc1fc4Forecast values (predictions)
pl1pl4Plan/Budget values (targets)

These fields determine how data is hierarchically organized and sorted.

DimensionIndex ColumnDescription
rowGrouprowGroupIndexTop-level row grouping (e.g., “Revenue”, “Costs”).
categorycategoryIndexPrimary data dimension (e.g., “Product A”, “Region North”).
subCategorysubCategoryIndexDetail dimension for drill-down (creates “thereOf” rows).
columnGroupcolumnGroupIndexColumn-level grouping for breaking down measures.

Note: When providing a grouping dimension string (e.g., rowGroup), you should likewise provide the corresponding index (e.g., rowGroupIndex) to ensure consistent sort ordering.

ColumnTypeDescription
datestringRequired for TimeCharts. ISO 8601 date (e.g., 2024-01-15). Used for time-series axes and normalization.
operationstringAggregation sign: '+' (add) or '-' (subtract). Defaults to '+'. Useful for P&L structures where costs subtract from totals.
setnamestringDataset identifier. Used by some charts to distinguish multiple query results.

Here is an example SQL query producing a compliant dataset:

SELECT
  'Revenue' as rowGroup,
  1 as rowGroupIndex,
  product_name as category,
  product_rank as categoryIndex,
  '2024-01-15' as date,
  sum(sales_amount) as ac1,
  sum(budget_amount) as pl1
FROM sales_data
GROUP BY product_name, product_rank

bino can validate query results against the standard schema at build time. This catches common data issues early:

CheckDescription
Type validationString fields (category, rowGroup, etc.) must be strings
Type validationNumber fields (ac1, categoryIndex, etc.) must be numbers
Enum validationoperation must be "+" or "-"
Date formatdate must be ISO 8601 format (YYYY-MM-DD)
Dependent requiredrowGroup requires rowGroupIndex, category requires categoryIndex, etc.

Data validation is enabled by default in warn mode. Configure via CLI flags:

# Log warnings and continue (default)
bino build --data-validation=warn

# Treat validation errors as fatal
bino build --data-validation=fail

# Skip validation entirely
bino build --data-validation=off

For lint-only validation (without building), use --execute-queries:

bino lint --execute-queries

For efficiency, bino validates only the first N rows per dataset. Configure via environment variable:

BNR_DATA_VALIDATION_SAMPLE_SIZE=100 bino build

Default sample size is 10 rows.

DataSet documents support metadata.constraints to conditionally include them for specific artefacts, modes, or environments.

Include debug or diagnostic datasets only during development:

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: debug_metrics
  constraints:
    - mode==preview
spec:
  query: |
    SELECT * FROM raw_data LIMIT 100
  dependencies:
    - raw_data

Use different datasets for different output formats:

# Detailed dataset for print (A4/Letter)
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
  constraints:
    - spec.format in [a4,letter]
spec:
  query: |
    SELECT region, product, month, revenue, costs, margin
    FROM sales_fact
    GROUP BY region, product, month

---
# Summarized dataset for screen (XGA)
apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
  constraints:
    - field: spec.format
      operator: "=="
      value: xga
spec:
  query: |
    SELECT region, SUM(revenue) as total_revenue
    FROM sales_fact
    GROUP BY region

For the full constraint syntax and operators, see Constraints and Scoped Names.

Instead of defining DataSets as separate documents, you can define them inline directly within components. This is useful for simple, component-specific data transformations.

apiVersion: bino.bi/v1alpha1
kind: ChartStructure
metadata:
  name: sales_chart
spec:
  dataset:
    query: |
      SELECT region, SUM(amount) as total
      FROM sales_csv
      GROUP BY region
    dependencies:
      - sales_csv
  chartTitle: Sales by Region
  chartType: bar
  # ... chart configuration

You can nest inline DataSource definitions within an inline DataSet. Use the @inline(N) syntax to reference them by index:

apiVersion: bino.bi/v1alpha1
kind: ChartStructure
metadata:
  name: quick_chart
spec:
  dataset:
    dependencies:
      - type: csv
        path: ./data/sales.csv
    query: |
      SELECT region, SUM(amount) as total
      FROM @inline(0)
      GROUP BY region
  chartTitle: Sales Overview
  # ... chart configuration

The @inline(0) reference points to the first item in the dependencies array (0-indexed). When you have multiple inline DataSources:

spec:
  dataset:
    dependencies:
      - type: csv
        path: ./data/orders.csv
      - type: csv
        path: ./data/customers.csv
    query: |
      SELECT c.name, COUNT(*) as order_count
      FROM @inline(0) o
      JOIN @inline(1) c ON o.customer_id = c.id
      GROUP BY c.name

You can mix inline DataSource definitions with named references:

spec:
  dataset:
    dependencies:
      - sales_csv # Named reference to existing DataSource
      - type: csv # Inline definition
        path: ./data/rates.csv
    query: |
      SELECT s.*, r.rate
      FROM sales_csv s
      JOIN @inline(1) r ON s.currency = r.code

Note: Named references use their document name directly in SQL, while inline definitions use @inline(N).

For simple cases where you want to use a DataSource directly without transformation, use the source field:

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_passthrough
spec:
  source: sales_csv

This is equivalent to SELECT * FROM sales_csv but more explicit and efficient. The source field is mutually exclusive with query and prql.

You can also use inline DataSource definitions with source:

apiVersion: bino.bi/v1alpha1
kind: Table
metadata:
  name: raw_data_table
spec:
  dataset:
    source:
      type: csv
      path: ./data/report.csv
  tableTitle: Raw Data
  # ... table configuration

When bino processes inline definitions:

  1. Materialization: Inline definitions are converted to synthetic documents during YAML loading
  2. Hash-based naming: Generated documents get unique names like _inline_datasource_a1b2c3d4
  3. Deduplication: Identical inline definitions share the same generated document
  4. Labels: Generated documents have bino.bi/generated: "true" and bino.bi/inline: "true" labels

This means inline definitions are fully compatible with caching, the dependency graph, and all other bino features.

User-defined document names must not start with _inline_ as this prefix is reserved for generated inline definitions. The lint rule inline-naming-conflict enforces this.