DataSource
DataSource manifests describe how bino loads raw data into the query engine.
Each datasource becomes a view named after metadata.name.
Name and SQL identifier
Section titled “Name and SQL identifier”metadata.name for DataSource must match the sqlIdentifier pattern:
^[a-z_][a-z0-9_]*$- Lowercase letters, digits, and underscores only
- Must start with a letter or underscore
Use these names directly in DataSet.spec.query.
Spec overview
Section titled “Spec overview”apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: sales_csv
spec:
type: csv # inline | excel | csv | parquet | postgres_query | mysql_query
inline: {} # for type: inline
content: [] # alternative inline content
path: ./data/*.csv # for file-based types
connection: {} # for database queries
query: "" # SQL for postgres_query / mysql_query
ephemeral: false # optional caching hint
sample: 1000 # optional row sampling (number, string, or object)
# CSV reader options (type: csv only)
delimiter: ";" # field delimiter
header: true # first row is header
skipRows: 0 # lines to skip before data
thousands: "." # thousands separator
decimalSeparator: "," # decimal point character
dateFormat: "%d/%m/%Y" # date parsing format
columnNames: [a, b, c] # explicit column names
columns: # column name → DuckDB type
amount: "DECIMAL(10,2)"Type-specific rules (simplified from the schema):
type: inline– requires eitherinline(object withcontent) orcontent(array or JSON string).type: excel | csv | parquet– requirespath.type: postgres_query | mysql_query– requiresconnectionandquery.
See the JSON schema for precise conditions.
Inline datasource
Section titled “Inline datasource”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: kpi_inline
spec:
type: inline
inline:
content:
- { label: "Revenue", value: 123.45 }
- { label: "EBIT", value: 12.34 }CSV files
Section titled “CSV files”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: sales_daily
spec:
type: csv
path: ./data/sales_daily/*.csvCSV reader options
Section titled “CSV reader options”When the default auto-detection does not produce the right result, add CSV reader options to spec.
Any option that is set switches bino from read_csv_auto to read_csv with explicit parameters.
| Field | Type | DuckDB parameter | Description |
|---|---|---|---|
delimiter | string | delim | Field delimiter character (for example ";" or "|"). |
header | boolean | header | Whether the first row defines column names. Default true. |
skipRows | integer | skip | Number of lines to skip before reading data. |
thousands | string | thousands | Thousands separator in numeric values (for example "."). |
decimalSeparator | string | decimal_separator | Decimal point character (for example ","). |
dateFormat | string | dateformat | Date format using DuckDB strftime specifiers (for example "%d/%m/%Y"). |
columns | object | columns | Map of column name to DuckDB type. Mutually exclusive with columnNames. |
columnNames | string[] | names | Explicit column names. Mutually exclusive with columns. |
European number format
Section titled “European number format”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: eu_sales
spec:
type: csv
path: ./data/eu_sales.csv
delimiter: ";"
thousands: "."
decimalSeparator: ","
dateFormat: "%d/%m/%Y"Headerless CSV with typed columns
Section titled “Headerless CSV with typed columns”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: sensor_data
spec:
type: csv
path: ./data/sensors.csv
header: false
columns:
ts: "TIMESTAMP"
device_id: "INTEGER"
reading: "DECIMAL(8,3)"Custom column names
Section titled “Custom column names”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: raw_export
spec:
type: csv
path: ./data/export.csv
header: false
columnNames: [date, region, amount]Parquet files
Section titled “Parquet files”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: fact_sales_parquet
spec:
type: parquet
path: ./warehouse/fact_sales/*.parquet
ephemeral: false # allow caching between buildsPostgreSQL query
Section titled “PostgreSQL query”---
apiVersion: bino.bi/v1alpha1
kind: ConnectionSecret
metadata:
name: postgresCredentials
spec:
type: postgres
postgres:
passwordFromEnv: POSTGRES_PASSWORD
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: sales_from_postgres
spec:
type: postgres_query
connection:
host: ${DB_HOST:db.example.com}
port: 5432
database: analytics
schema: public
user: reporting
secret: postgresCredentials
query: |
SELECT *
FROM fact_sales
WHERE booking_date >= DATE '2024-01-01';MySQL query
Section titled “MySQL query”---
apiVersion: bino.bi/v1alpha1
kind: ConnectionSecret
metadata:
name: mysqlCredentials
spec:
type: mysql
mysql:
passwordFromEnv: MYSQL_PASSWORD
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: sales_from_mysql
spec:
type: mysql_query
connection:
host: ${DB_HOST:db.example.com}
port: 3306
database: analytics
user: reporting
secret: mysqlCredentials
query: |
SELECT * FROM fact_sales WHERE year = 2024;For more on secrets and object storage, see ConnectionSecret.
Sampling
Section titled “Sampling”The sample property lets you load only a subset of rows from a datasource using DuckDB’s USING SAMPLE clause. This is ideal for working with large datasets during preview and development, where you need fast iteration without waiting for millions of rows to load.
sample accepts three forms:
| Form | Example | DuckDB clause |
|---|---|---|
| number | sample: 1000 | USING SAMPLE 1000 |
| string | sample: "10%" | USING SAMPLE 10% |
| object | sample: { size: 1000, method: reservoir } | USING SAMPLE 1000 (reservoir) |
Sampling methods
Section titled “Sampling methods”When using the object form you can specify a method:
bernoulli- evaluates each row independently with the given probability. Good accuracy, works well for most datasets.system- samples entire vector chunks. Faster than bernoulli but higher variance, not recommended for small datasets (< 10k rows).reservoir- returns an exact number of rows. Only method that guarantees an exact count.
When no method is specified, DuckDB uses its default (system for percentages, reservoir for row counts).
Basic usage
Section titled “Basic usage”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: fact_sales
spec:
type: parquet
path: ./warehouse/fact_sales/*.parquet
sample: 5000 # load only 5000 rowsPercentage sampling
Section titled “Percentage sampling”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: events
spec:
type: csv
path: ./data/events_2024.csv
sample: "10%" # load roughly 10% of rowsSampling with a specific method
Section titled “Sampling with a specific method”---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: transactions
spec:
type: postgres_query
connection:
host: ${DB_HOST}
port: 5432
database: analytics
user: reporting
secret: pgCredentials
query: |
SELECT * FROM transactions WHERE year = 2024
sample:
size: 10000
method: reservoir # guarantees exactly 10000 rowsSampling with constraints for development only
Section titled “Sampling with constraints for development only”Combine sample with constraints so the sampled datasource is used during preview and development, while the full dataset is used for production builds:
# Development: sampled data for fast iteration
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: fact_sales
constraints:
- mode!=build
spec:
type: parquet
path: ./warehouse/fact_sales/*.parquet
sample: "5%"
# Production: full dataset
---
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: fact_sales
constraints:
- mode==build
spec:
type: parquet
path: ./warehouse/fact_sales/*.parquetThis pattern keeps bino preview fast and responsive while ensuring bino build always produces reports from the complete dataset.
Conditional inclusion with constraints
Section titled “Conditional inclusion with constraints”DataSource documents support metadata.constraints to conditionally include them for specific artefacts, modes, or environments.
Environment-specific data sources
Section titled “Environment-specific data sources”Use different data sources for development vs production:
# Mock data for development
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: sales
constraints:
- labels.env==dev
spec:
type: inline
content:
- { region: "Test", amount: 100 }
---
# Production database
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: sales
constraints:
- labels.env==prod
spec:
type: postgres_query
connection:
host: prod-db.example.com
# ...Multiple environments with in operator
Section titled “Multiple environments with in operator”Match multiple environments at once using either format:
# String format
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: staging_data
constraints:
- labels.env in [dev,staging,qa]
spec:
type: postgres_query
connection:
host: staging-db.example.com
---
# Structured format (IDE-friendly)
apiVersion: bino.bi/v1alpha1
kind: DataSource
metadata:
name: staging_data
constraints:
- field: labels.env
operator: in
value: [dev, staging, qa]
spec:
type: postgres_query
connection:
host: staging-db.example.comFor the full constraint syntax and operators, see Constraints and Scoped Names.
Inline DataSource definitions
Section titled “Inline DataSource definitions”DataSources can also be defined inline within DataSet dependencies arrays, eliminating the need for separate documents. This is useful for simple, single-use data sources.
Example: Inline DataSource in a DataSet
Section titled “Example: Inline DataSource in a DataSet”apiVersion: bino.bi/v1alpha1
kind: DataSet
metadata:
name: sales_summary
spec:
dependencies:
- type: csv
path: ./data/sales.csv
query: |
SELECT region, SUM(amount) as total
FROM @inline(0)
GROUP BY regionThe @inline(0) syntax references the inline DataSource by its position (0-indexed) in the dependencies array.
Supported types for inline definitions
Section titled “Supported types for inline definitions”All DataSource types can be used inline:
# CSV
- type: csv
path: ./data/sales.csv
# Excel
- type: excel
path: ./data/report.xlsx
# Parquet
- type: parquet
path: ./warehouse/*.parquet
# Inline data
- type: inline
content:
- { region: "US", amount: 100 }
- { region: "EU", amount: 200 }Multiple inline DataSources
Section titled “Multiple inline DataSources”Reference multiple inline DataSources by their index:
spec:
dependencies:
- type: csv
path: ./data/orders.csv
- type: csv
path: ./data/customers.csv
query: |
SELECT c.name, o.total
FROM @inline(0) o
JOIN @inline(1) c ON o.customer_id = c.idMixing inline and named references
Section titled “Mixing inline and named references”You can combine inline definitions with references to standalone DataSource documents:
spec:
dependencies:
- existing_datasource # Named reference
- type: csv # Inline definition
path: ./data/extra.csv
query: |
SELECT * FROM existing_datasource
UNION ALL
SELECT * FROM @inline(1)For more details on inline definitions and the @inline(N) syntax, see DataSet - Inline DataSet definitions.
Attribute Reference
Section titled “Attribute Reference”Common Metadata
Section titled “Common Metadata”All document kinds share these metadata fields.
| Attribute | Type | Required | Default | Description |
|---|---|---|---|---|
apiVersion | string | yes | — | Must be bino.bi/v1alpha1. |
kind | string | yes | — | Must be DataSource. |
metadata.name | string | yes | — | Unique identifier. For DataSource must match ^[a-z_][a-z0-9_]*$ (SQL identifier). |
metadata.labels | object | no | — | Key-value pairs for categorization and constraint matching. |
metadata.annotations | object | no | — | Arbitrary key-value metadata, not used by the system. |
metadata.description | string | no | — | Free-form description. |
metadata.constraints | array | no | — | Conditional inclusion rules. See Constraints. |
Spec Attributes
Section titled “Spec Attributes”| Attribute | Type | Required | Default | Description | Sample |
|---|---|---|---|---|---|
spec.type | string | yes | — | Data source type. Values: inline, excel, csv, parquet, postgres_query, mysql_query. | type: csv |
spec.inline | object | conditional | — | Inline data container. Required when type: inline and content is not set. | see below |
spec.content | array or string | conditional | — | Inline data as array or JSON string. Required when type: inline and inline is not set. | see below |
spec.path | string | conditional | — | File path, directory, or glob pattern. Required when type is excel, csv, or parquet. | path: ./data/*.csv |
spec.connection | object | conditional | — | Database connection details. Required when type is postgres_query or mysql_query. | see below |
spec.query | string | conditional | — | SQL query string. Required when type is postgres_query or mysql_query. | query: SELECT * FROM sales |
spec.ephemeral | boolean | no | varies | Controls caching. Defaults depend on source type. | ephemeral: false |
spec.sample | number, string, or object | no | — | Row sampling via DuckDB USING SAMPLE. | sample: 1000 |
spec.delimiter | string | no | auto | CSV field delimiter (max 4 chars). | delimiter: ";" |
spec.header | boolean | no | true | Whether the first row defines column names (CSV only). | header: false |
spec.skipRows | integer | no | 0 | Number of lines to skip before reading data. | skipRows: 2 |
spec.thousands | string | no | — | Thousands separator in numeric values (max 1 char). | thousands: "." |
spec.decimalSeparator | string | no | — | Decimal point character (max 1 char). | decimalSeparator: "," |
spec.dateFormat | string | no | — | Date format using DuckDB strftime specifiers. | dateFormat: "%d/%m/%Y" |
spec.columnNames | string[] | no | — | Explicit column names. Mutually exclusive with columns. | columnNames: [date, region, amount] |
spec.columns | object | no | — | Column name to DuckDB type mapping. Mutually exclusive with columnNames. | see below |
Inline content
Section titled “Inline content”spec:
type: inline
inline:
content:
- { label: "Revenue", value: 123.45 }Connection object
Section titled “Connection object”spec:
connection:
host: db.example.com
port: 5432
database: analytics
schema: public
user: reporting
secret: myConnectionSecretColumn types
Section titled “Column types”spec:
columns:
ts: "TIMESTAMP"
device_id: "INTEGER"
reading: "DECIMAL(8,3)"Sample object form
Section titled “Sample object form”spec:
sample:
size: 10000
method: reservoir # bernoulli | system | reservoir