Skip to content
GitHub

DataSource

DataSource manifests describe how bino loads raw data into the query engine. Each datasource becomes a view named after metadata.name.

metadata.name for DataSource must match the sqlIdentifier pattern:

  • ^[a-z_][a-z0-9_]*$
  • Lowercase letters, digits, and underscores only
  • Must start with a letter or underscore

Use these names directly in DataSet.spec.query.

apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_csv
spec:
  type: csv # inline | excel | csv | parquet | postgres_query | mysql_query
  inline: {} # for type: inline
  content: [] # alternative inline content
  path: ./data/*.csv # for file-based types
  connection: {} # for database queries
  query: "" # SQL for postgres_query / mysql_query
  ephemeral: false # optional caching hint
  sample: 1000 # optional row sampling (number, string, or object)

  # CSV reader options (type: csv only)
  delimiter: ";" # field delimiter
  header: true # first row is header
  skipRows: 0 # lines to skip before data
  thousands: "." # thousands separator
  decimalSeparator: "," # decimal point character
  dateFormat: "%d/%m/%Y" # date parsing format
  columnNames: [a, b, c] # explicit column names
  columns: # column name → DuckDB type
    amount: "DECIMAL(10,2)"

Type-specific rules (simplified from the schema):

  • type: inline – requires either inline (object with content) or content (array or JSON string).
  • type: excel | csv | parquet – requires path.
  • type: postgres_query | mysql_query – requires connection and query.

See the JSON schema for precise conditions.

---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: kpi_inline
spec:
  type: inline
  inline:
    content:
      - { label: "Revenue", value: 123.45 }
      - { label: "EBIT", value: 12.34 }
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_daily
spec:
  type: csv
  path: ./data/sales_daily/*.csv

When the default auto-detection does not produce the right result, add CSV reader options to spec. Any option that is set switches bino from read_csv_auto to read_csv with explicit parameters.

FieldTypeDuckDB parameterDescription
delimiterstringdelimField delimiter character (for example ";" or "|").
headerbooleanheaderWhether the first row defines column names. Default true.
skipRowsintegerskipNumber of lines to skip before reading data.
thousandsstringthousandsThousands separator in numeric values (for example ".").
decimalSeparatorstringdecimal_separatorDecimal point character (for example ",").
dateFormatstringdateformatDate format using DuckDB strftime specifiers (for example "%d/%m/%Y").
columnsobjectcolumnsMap of column name to DuckDB type. Mutually exclusive with columnNames.
columnNamesstring[]namesExplicit column names. Mutually exclusive with columns.
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: eu_sales
spec:
  type: csv
  path: ./data/eu_sales.csv
  delimiter: ";"
  thousands: "."
  decimalSeparator: ","
  dateFormat: "%d/%m/%Y"
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sensor_data
spec:
  type: csv
  path: ./data/sensors.csv
  header: false
  columns:
    ts: "TIMESTAMP"
    device_id: "INTEGER"
    reading: "DECIMAL(8,3)"
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: raw_export
spec:
  type: csv
  path: ./data/export.csv
  header: false
  columnNames: [date, region, amount]
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: fact_sales_parquet
spec:
  type: parquet
  path: ./warehouse/fact_sales/*.parquet
  ephemeral: false # allow caching between builds
---
apiVersion: bino.bi/v1alpha1
kind: ConnectionSecret
metadata:
  name: postgresCredentials
spec:
  type: postgres
  postgres:
    passwordFromEnv: POSTGRES_PASSWORD
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_from_postgres
spec:
  type: postgres_query
  connection:
    host: ${DB_HOST:db.example.com}
    port: 5432
    database: analytics
    schema: public
    user: reporting
    secret: postgresCredentials
  query: |
    SELECT *
    FROM fact_sales
    WHERE booking_date >= DATE '2024-01-01';
---
apiVersion: bino.bi/v1alpha1
kind: ConnectionSecret
metadata:
  name: mysqlCredentials
spec:
  type: mysql
  mysql:
    passwordFromEnv: MYSQL_PASSWORD
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales_from_mysql
spec:
  type: mysql_query
  connection:
    host: ${DB_HOST:db.example.com}
    port: 3306
    database: analytics
    user: reporting
    secret: mysqlCredentials
  query: |
    SELECT * FROM fact_sales WHERE year = 2024;

For more on secrets and object storage, see ConnectionSecret.

The sample property lets you load only a subset of rows from a datasource using DuckDB’s USING SAMPLE clause. This is ideal for working with large datasets during preview and development, where you need fast iteration without waiting for millions of rows to load.

sample accepts three forms:

FormExampleDuckDB clause
numbersample: 1000USING SAMPLE 1000
stringsample: "10%"USING SAMPLE 10%
objectsample: { size: 1000, method: reservoir }USING SAMPLE 1000 (reservoir)

When using the object form you can specify a method:

  • bernoulli - evaluates each row independently with the given probability. Good accuracy, works well for most datasets.
  • system - samples entire vector chunks. Faster than bernoulli but higher variance, not recommended for small datasets (< 10k rows).
  • reservoir - returns an exact number of rows. Only method that guarantees an exact count.

When no method is specified, DuckDB uses its default (system for percentages, reservoir for row counts).

---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: fact_sales
spec:
  type: parquet
  path: ./warehouse/fact_sales/*.parquet
  sample: 5000 # load only 5000 rows
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: events
spec:
  type: csv
  path: ./data/events_2024.csv
  sample: "10%" # load roughly 10% of rows
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: transactions
spec:
  type: postgres_query
  connection:
    host: ${DB_HOST}
    port: 5432
    database: analytics
    user: reporting
    secret: pgCredentials
  query: |
    SELECT * FROM transactions WHERE year = 2024
  sample:
    size: 10000
    method: reservoir # guarantees exactly 10000 rows

Sampling with constraints for development only

Section titled “Sampling with constraints for development only”

Combine sample with constraints so the sampled datasource is used during preview and development, while the full dataset is used for production builds:

# Development: sampled data for fast iteration
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: fact_sales
  constraints:
    - mode!=build
spec:
  type: parquet
  path: ./warehouse/fact_sales/*.parquet
  sample: "5%"

# Production: full dataset
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: fact_sales
  constraints:
    - mode==build
spec:
  type: parquet
  path: ./warehouse/fact_sales/*.parquet

This pattern keeps bino preview fast and responsive while ensuring bino build always produces reports from the complete dataset.

DataSource documents support metadata.constraints to conditionally include them for specific artefacts, modes, or environments.

Use different data sources for development vs production:

# Mock data for development
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales
  constraints:
    - labels.env==dev
spec:
  type: inline
  content:
    - { region: "Test", amount: 100 }

---
# Production database
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: sales
  constraints:
    - labels.env==prod
spec:
  type: postgres_query
  connection:
    host: prod-db.example.com
    # ...

Match multiple environments at once using either format:

# String format
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: staging_data
  constraints:
    - labels.env in [dev,staging,qa]
spec:
  type: postgres_query
  connection:
    host: staging-db.example.com

---
# Structured format (IDE-friendly)
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
  name: staging_data
  constraints:
    - field: labels.env
      operator: in
      value: [dev, staging, qa]
spec:
  type: postgres_query
  connection:
    host: staging-db.example.com

For the full constraint syntax and operators, see Constraints and Scoped Names.

DataSources can also be defined inline within DataSet dependencies arrays, eliminating the need for separate documents. This is useful for simple, single-use data sources.

apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
  name: sales_summary
spec:
  dependencies:
    - type: csv
      path: ./data/sales.csv
  query: |
    SELECT region, SUM(amount) as total
    FROM @inline(0)
    GROUP BY region

The @inline(0) syntax references the inline DataSource by its position (0-indexed) in the dependencies array.

All DataSource types can be used inline:

# CSV
- type: csv
  path: ./data/sales.csv

# Excel
- type: excel
  path: ./data/report.xlsx

# Parquet
- type: parquet
  path: ./warehouse/*.parquet

# Inline data
- type: inline
  content:
    - { region: "US", amount: 100 }
    - { region: "EU", amount: 200 }

Reference multiple inline DataSources by their index:

spec:
  dependencies:
    - type: csv
      path: ./data/orders.csv
    - type: csv
      path: ./data/customers.csv
  query: |
    SELECT c.name, o.total
    FROM @inline(0) o
    JOIN @inline(1) c ON o.customer_id = c.id

You can combine inline definitions with references to standalone DataSource documents:

spec:
  dependencies:
    - existing_datasource     # Named reference
    - type: csv               # Inline definition
      path: ./data/extra.csv
  query: |
    SELECT * FROM existing_datasource
    UNION ALL
    SELECT * FROM @inline(1)

For more details on inline definitions and the @inline(N) syntax, see DataSet - Inline DataSet definitions.

All document kinds share these metadata fields.

AttributeTypeRequiredDefaultDescription
apiVersionstringyesMust be bino.bi/v1alpha1.
kindstringyesMust be DataSource.
metadata.namestringyesUnique identifier. For DataSource must match ^[a-z_][a-z0-9_]*$ (SQL identifier).
metadata.labelsobjectnoKey-value pairs for categorization and constraint matching.
metadata.annotationsobjectnoArbitrary key-value metadata, not used by the system.
metadata.descriptionstringnoFree-form description.
metadata.constraintsarraynoConditional inclusion rules. See Constraints.
AttributeTypeRequiredDefaultDescriptionSample
spec.typestringyesData source type. Values: inline, excel, csv, parquet, postgres_query, mysql_query.type: csv
spec.inlineobjectconditionalInline data container. Required when type: inline and content is not set.see below
spec.contentarray or stringconditionalInline data as array or JSON string. Required when type: inline and inline is not set.see below
spec.pathstringconditionalFile path, directory, or glob pattern. Required when type is excel, csv, or parquet.path: ./data/*.csv
spec.connectionobjectconditionalDatabase connection details. Required when type is postgres_query or mysql_query.see below
spec.querystringconditionalSQL query string. Required when type is postgres_query or mysql_query.query: SELECT * FROM sales
spec.ephemeralbooleannovariesControls caching. Defaults depend on source type.ephemeral: false
spec.samplenumber, string, or objectnoRow sampling via DuckDB USING SAMPLE.sample: 1000
spec.delimiterstringnoautoCSV field delimiter (max 4 chars).delimiter: ";"
spec.headerbooleannotrueWhether the first row defines column names (CSV only).header: false
spec.skipRowsintegerno0Number of lines to skip before reading data.skipRows: 2
spec.thousandsstringnoThousands separator in numeric values (max 1 char).thousands: "."
spec.decimalSeparatorstringnoDecimal point character (max 1 char).decimalSeparator: ","
spec.dateFormatstringnoDate format using DuckDB strftime specifiers.dateFormat: "%d/%m/%Y"
spec.columnNamesstring[]noExplicit column names. Mutually exclusive with columns.columnNames: [date, region, amount]
spec.columnsobjectnoColumn name to DuckDB type mapping. Mutually exclusive with columnNames.see below
spec:
  type: inline
  inline:
    content:
      - { label: "Revenue", value: 123.45 }
spec:
  connection:
    host: db.example.com
    port: 5432
    database: analytics
    schema: public
    user: reporting
    secret: myConnectionSecret
spec:
  columns:
    ts: "TIMESTAMP"
    device_id: "INTEGER"
    reading: "DECIMAL(8,3)"
spec:
  sample:
    size: 10000
    method: reservoir  # bernoulli | system | reservoir