Bespoke Data

Bespoke

A commandline tool for DataOps

Release 0.1.10

Getting Started

Install bespoke following the instructions on the downloads page.

> bespoke --help
A commandline tool for DataOps.

Usage: bespoke <COMMAND>

Commands:
  convert   Convert the data from one file format into another.
  merge     Merge a set of input files into a single output file.
  sift      Sift through a data set, validating records against a manifest. Records that
                successfully validate against the manifest will be placed in the output file while
                invalid records will be placed in the quarantine file.
  validate  Validate that the input data matches the types and rules of the specified data
                manifest.
  manifest  Manage data manifests.
  help      Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Bespoke operates around the goal of driving data towards a well documented interface. Within bespoke these interfaces are defined as manifest files. A data manifest describes all the fields that are expected within a data set, the types of these fields, and any rules the fields are expected to adhere to.

# ./sparrow_flights.yaml
name: sparrow_flights
description: >
  A collection of flight tests for swallows. Each flight test is a record of a
  particular location, at a particular time, and under particular weather conditions.
  The collection is used to track the performance of swallows in flight and to
  identify the fastest and most enduring swallows.
fields:
  swallow_id:
    type: string
    description: >
      The identifier of the participating swallow in the flight test.
      These values are expected to be kebab case and start with "swl-".
    rules:
    - starts_with: swl-
    - kebab_case
  flight_at:
    type: timestamp
    description: >
      The local time at which the flight occured. Note that flights may be conducted
      in multiple timezones. This field provides information about the local
      conditions that may have impacted performance.
  flight_site:
    type: string
    description: >
      An identifier for which testing site the flight occured at. Current active
      sites include Doune Castle, Glen Coe, and Castle Stalker
    rules:
    - enumeration:
      - doune_castle
      - glen_coe
      - castle_stalker
  distance:
    type: number
    description: >
      The distance in miles that the swallow flew during the test.
    rules:
    - minimum: 0
    - maximum: 100  # Anything over 100 is certainly errant.
  duration:
    type: number
    description: >
      The duration of the flight in minutes.
    rules:
    - minimum: 0
    - maximum: 180

This manifest is then used to operate on data files.

> bespoke validate -m sparrow_flights.yaml flight_data/
🟢 Validated flight_data/file_1.parquet
🟢 Validated flight_data/file_2.jsonl
🟢 Validated flight_data/file_3.csv.gz
🟢 Validated flight_data/file_4.csv
🟢 Validated flight_data/file_5.jsonl.gz
> bespoke merge -m sparrow_flights.yaml --in flight_data --out flight_data.parquet/
Processing flight_data/file_1.parquet
Processing flight_data/file_2.jsonl
Processing flight_data/file_3.csv.gz
Processing flight_data/file_4.csv
Processing flight_data/file_5.jsonl.gz

All files merged into flight_data.parquet

Command Index

convert

> bespoke convert --help
Convert the data from one file format into another.

Usage: convert [OPTIONS] --manifest <manifest> <input> <output>

Arguments:
<input> An input file to convert into the output, validating against the given data manifest.
Supported formats include CSV, line delimited JSON, and parquet. Files can be gzipped.
<output> The file to which to output the data, its extension indicating the file format.
Supported formats include CSV, JSON record, and parquet. Output files can include a
`.gz` extension.

Options:
-m, --manifest <manifest> Manifest file defining the data set.
--flatten If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be
flattened into the top level of the record. For example, if
the input file is JSON and contains the record `{"foo":
{"bar": 1}}`, then it would be treated as equivalent to
`{"foo.bar": 1}`. This flag is ignored for formats that do
not support nested structures.
--lift If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be lifted
into the top level of the record. For example, if the input
file is JSON and contains the record `{"foo": {"bar": 1}}`,
then it would be treated as equivalent to `{"bar": 1}`. If
multiple leaf values have the same name, an error will be
returned. This flag is ignored for formats that do not
support nested structures.
--csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a
comma (`,`), but other common values include tab (`\t`) and
pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
as aliases for the corresponding characters. [default: ,]
--skip-invalid Skip records that do not match the manifest. This will yeild
the same output as if the invalid records were not present in
the input file.
-h, --help Print help

introspect

> bespoke introspect --help
Introspect the bespoke CLI's commands

Usage: introspect

Options:
  -h, --help  Print help

This command is NOT part of the public interface and may change without notice. It is used by
bespoke's maintainers to generate parts of the documentation.

merge

> bespoke merge --help
Merge a set of input files into a single output file.

Usage: merge [OPTIONS] --in <input> --out <output> --manifest <manifest>

Options:
-i, --in <input> An input file or directory to merge into the output,
validating against the given data manifest. Supported formats
include CSV, line delimited JSON, and parquet. Files can be
gzipped.
-o, --out <output> The file to which to output the data, its extension
indicating the file format. Supported formats include CSV,
JSON record, and parquet. Output files can include a `.gz`
extension.
-m, --manifest <manifest> Manifest file defining the data set.
--flatten If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be
flattened into the top level of the record. For example, if
the input file is JSON and contains the record `{"foo":
{"bar": 1}}`, then it would be treated as equivalent to
`{"foo.bar": 1}`. This flag is ignored for formats that do
not support nested structures.
--lift If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be lifted
into the top level of the record. For example, if the input
file is JSON and contains the record `{"foo": {"bar": 1}}`,
then it would be treated as equivalent to `{"bar": 1}`. If
multiple leaf values have the same name, an error will be
returned. This flag is ignored for formats that do not
support nested structures.
--csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a
comma (`,`), but other common values include tab (`\t`) and
pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
as aliases for the corresponding characters. [default: ,]
--skip-invalid Skip records that do not match the manifest. This will yeild
the same output as if the invalid records were not present in
the input file.
-h, --help Print help

sift

> bespoke sift --help
Sift through a data set, validating records against a manifest. Records that successfully validate
against the manifest will be placed in the output file while invalid records will be placed in the
quarantine file.

Usage: sift [OPTIONS] --in <input> --out <output> --quarantine <quarantine> --manifest <manifest>

Options:
-i, --in <input> Location of the input data. Supported formats include CSV and
line delimited JSON, both gzipped and non-gzipped. This can
be a file or directory.
-o, --out <output> The location at which to store matching records. Supported
formats include CSV, JSON record, and parquet.
-q, --quarantine <quarantine> The location to store non-matching records. The format is
required to match the format of the input.
-m, --manifest <manifest> Manifest file defining the data set.
--flatten If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be
flattened into the top level of the record. For example, if
the input file is JSON and contains the record `{"foo":
{"bar": 1}}`, then it would be treated as equivalent to
`{"foo.bar": 1}`. This flag is ignored for formats that do
not support nested structures.
--lift If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be lifted
into the top level of the record. For example, if the input
file is JSON and contains the record `{"foo": {"bar": 1}}`,
then it would be treated as equivalent to `{"bar": 1}`. If
multiple leaf values have the same name, an error will be
returned. This flag is ignored for formats that do not
support nested structures.
--csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a
comma (`,`), but other common values include tab (`\t`) and
pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
as aliases for the corresponding characters. [default: ,]
-h, --help Print help

validate

> bespoke validate --help
Validate that the input data matches the types and rules of the specified data manifest.

Usage: validate [OPTIONS] --manifest <manifest> <input>

Arguments:
<input> A file or directory of data to validate against the given data manifest. Supported
formats include CSV, line delimited JSON, and parquet. Files may be gzipped. If a
directory is specified, files with unrecognized extensions will be ignored. If a
directory is specified, then the --output argument must also be a directory if specified.

Options:
-m, --manifest <manifest> A data manifest defining the fields, types and rules of valid
input data.
-o, --out <error_output_file> A file in which to place structured invalidations of the
input file. Supported formats include CSV, line delimited
JSON, and parquet. Files may be gzipped. If a directory is
specified, the output file will be named after the input file
and placed in the directory.
-a, --annotate <annotate> Annotate the output file with a string column containing the
value given by the --annotate argument. This is useful for
tracking the source of invalidations in a data pipeline. As
an example the argument `--annotate foo=bar` will add a
column named foo with the value bar to the output file.
--verbose Prints invalid rows to stdout as they are found.
--flatten If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be
flattened into the top level of the record. For example, if
the input file is JSON and contains the record `{"foo":
{"bar": 1}}`, then it would be treated as equivalent to
`{"foo.bar": 1}`. This flag is ignored for formats that do
not support nested structures.
--lift If the input file is of a format that supports nested
structures, this flag will cause the leaf values to be lifted
into the top level of the record. For example, if the input
file is JSON and contains the record `{"foo": {"bar": 1}}`,
then it would be treated as equivalent to `{"bar": 1}`. If
multiple leaf values have the same name, an error will be
returned. This flag is ignored for formats that do not
support nested structures.
--csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a
comma (`,`), but other common values include tab (`\t`) and
pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
as aliases for the corresponding characters. [default: ,]
-h, --help Print help

manifest

> bespoke manifest --help
Manage data manifests.

Usage: manifest <COMMAND>

Commands:
  example   Print an example manifest file.
  lint      Lint the data manifest file.
  rules     Describe the rules that can be used in a manifest file.
  strip     Strip the data manifest file.
  validate  Validate the data manifest file.
  help      Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

manifest example

> bespoke manifest example --help
Print an example manifest file.

Usage: example

Options:
  -h, --help  Print help

manifest lint

> bespoke manifest lint --help
Lint the data manifest file.

Usage: lint [OPTIONS] <manifest_file_or_directory>...

Arguments:
  <manifest_file_or_directory>...  The manifest file to lint.

Options:
  -i, --ignore <ignore>     Ignore the specified rule.
  -m, --include-motivation  Include the motivation for the rule in the output.
  -h, --help                Print help

manifest rules

> bespoke manifest rules --help
Describe the rules that can be used in a manifest file.

Usage: rules [OPTIONS]

Options:
      --json  Output the available rules as a json object..
  -h, --help  Print help

manifest strip

> bespoke manifest strip --help
Strip the data manifest file.

Usage: strip [OPTIONS] <manifest_file>

Arguments:
  <manifest_file>  The manifest file to strip.

Options:
  -o, --output <output_file>  The file to write the stripped manifest to.
  -h, --help                  Print help

Strip the given manifest file or set of manifest files of all non-essential fields. Only fields that
define functional behavior are kept.

manifest validate

> bespoke manifest validate --help
Validate the data manifest file.

Usage: validate <manifest_file_or_directory>...

Arguments:
  <manifest_file_or_directory>...  The manifest file to validate.

Options:
  -h, --help  Print help

Check that the given manifest file or set of manifest files are valid.