Bespoke

A commandline tool for DataOps

Release 0.1.10


Getting Started

Install bespoke following the instructions on the downloads page.

> bespoke --help
A commandline tool for DataOps.

Usage: bespoke <COMMAND>

Commands:
  convert   Convert the data from one file format into another.
  merge     Merge a set of input files into a single output file.
  sift      Sift through a data set, validating records against a manifest. Records that
                successfully validate against the manifest will be placed in the output file while
                invalid records will be placed in the quarantine file.
  validate  Validate that the input data matches the types and rules of the specified data
                manifest.
  manifest  Manage data manifests.
  help      Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Bespoke operates around the goal of driving data towards a well documented interface. Within bespoke these interfaces are defined as manifest files. A data manifest describes all the fields that are expected within a data set, the types of these fields, and any rules the fields are expected to adhere to.

# ./sparrow_flights.yaml
name: sparrow_flights
description: >
  A collection of flight tests for swallows. Each flight test is a record of a
  particular location, at a particular time, and under particular weather conditions.
  The collection is used to track the performance of swallows in flight and to
  identify the fastest and most enduring swallows.
fields:
  swallow_id:
    type: string
    description: >
      The identifier of the participating swallow in the flight test.
      These values are expected to be kebab case and start with "swl-".
    rules:
    - starts_with: swl-
    - kebab_case
  flight_at:
    type: timestamp
    description: >
      The local time at which the flight occured. Note that flights may be conducted
      in multiple timezones. This field provides information about the local
      conditions that may have impacted performance.
  flight_site:
    type: string
    description: >
      An identifier for which testing site the flight occured at. Current active
      sites include Doune Castle, Glen Coe, and Castle Stalker
    rules:
    - enumeration:
      - doune_castle
      - glen_coe
      - castle_stalker
  distance:
    type: number
    description: >
      The distance in miles that the swallow flew during the test.
    rules:
    - minimum: 0
    - maximum: 100  # Anything over 100 is certainly errant.
  duration:
    type: number
    description: >
      The duration of the flight in minutes.
    rules:
    - minimum: 0
    - maximum: 180

This manifest is then used to operate on data files.

> bespoke validate -m sparrow_flights.yaml flight_data/
🟢 Validated flight_data/file_1.parquet
🟢 Validated flight_data/file_2.jsonl
🟢 Validated flight_data/file_3.csv.gz
🟢 Validated flight_data/file_4.csv
🟢 Validated flight_data/file_5.jsonl.gz
> bespoke merge -m sparrow_flights.yaml --in flight_data --out flight_data.parquet/
Processing flight_data/file_1.parquet
Processing flight_data/file_2.jsonl
Processing flight_data/file_3.csv.gz
Processing flight_data/file_4.csv
Processing flight_data/file_5.jsonl.gz

All files merged into flight_data.parquet

Command Index

convert
> bespoke convert --help
Convert the data from one file format into another.

Usage: convert [OPTIONS] --manifest <manifest> <input> <output>

Arguments:
  <input>   An input file to convert into the output, validating against the given data manifest.
            Supported formats include CSV, line delimited JSON, and parquet. Files can be gzipped.
  <output>  The file to which to output the data, its extension indicating the file format.
            Supported formats include CSV, JSON record, and parquet. Output files can include a
            `.gz` extension.

Options:
  -m, --manifest <manifest>            Manifest file defining the data set.
      --flatten                        If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be
                                       flattened into the top level of the record. For example, if
                                       the input file is JSON and contains the record `{"foo":
                                       {"bar": 1}}`, then it would be treated as equivalent to
                                       `{"foo.bar": 1}`. This flag is ignored for formats that do
                                       not support nested structures.
      --lift                           If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be lifted
                                       into the top level of the record. For example, if the input
                                       file is JSON and contains the record `{"foo": {"bar": 1}}`,
                                       then it would be treated as equivalent to `{"bar": 1}`. If
                                       multiple leaf values have the same name, an error will be
                                       returned. This flag is ignored for formats that do not
                                       support nested structures.
      --csv-delimiter <csv_delimiter>  The character to use as the csv delimiter. This is usually a
                                       comma (`,`), but other common values include tab (`\t`) and
                                       pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
                                       as aliases for the corresponding characters. [default: ,]
      --skip-invalid                   Skip records that do not match the manifest. This will yeild
                                       the same output as if the invalid records were not present in
                                       the input file.
  -h, --help                           Print help
introspect
> bespoke introspect --help
Introspect the bespoke CLI's commands

Usage: introspect

Options:
  -h, --help  Print help

This command is NOT part of the public interface and may change without notice. It is used by
bespoke's maintainers to generate parts of the documentation.
merge
> bespoke merge --help
Merge a set of input files into a single output file.

Usage: merge [OPTIONS] --in <input> --out <output> --manifest <manifest>

Options:
  -i, --in <input>                     An input file or directory to merge into the output,
                                       validating against the given data manifest. Supported formats
                                       include CSV, line delimited JSON, and parquet. Files can be
                                       gzipped.
  -o, --out <output>                   The file to which to output the data, its extension
                                       indicating the file format. Supported formats include CSV,
                                       JSON record, and parquet. Output files can include a `.gz`
                                       extension.
  -m, --manifest <manifest>            Manifest file defining the data set.
      --flatten                        If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be
                                       flattened into the top level of the record. For example, if
                                       the input file is JSON and contains the record `{"foo":
                                       {"bar": 1}}`, then it would be treated as equivalent to
                                       `{"foo.bar": 1}`. This flag is ignored for formats that do
                                       not support nested structures.
      --lift                           If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be lifted
                                       into the top level of the record. For example, if the input
                                       file is JSON and contains the record `{"foo": {"bar": 1}}`,
                                       then it would be treated as equivalent to `{"bar": 1}`. If
                                       multiple leaf values have the same name, an error will be
                                       returned. This flag is ignored for formats that do not
                                       support nested structures.
      --csv-delimiter <csv_delimiter>  The character to use as the csv delimiter. This is usually a
                                       comma (`,`), but other common values include tab (`\t`) and
                                       pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
                                       as aliases for the corresponding characters. [default: ,]
      --skip-invalid                   Skip records that do not match the manifest. This will yeild
                                       the same output as if the invalid records were not present in
                                       the input file.
  -h, --help                           Print help
sift
> bespoke sift --help
Sift through a data set, validating records against a manifest. Records that successfully validate
against the manifest will be placed in the output file while invalid records will be placed in the
quarantine file.

Usage: sift [OPTIONS] --in <input> --out <output> --quarantine <quarantine> --manifest <manifest>

Options:
  -i, --in <input>                     Location of the input data. Supported formats include CSV and
                                       line delimited JSON, both gzipped and non-gzipped. This can
                                       be a file or directory.
  -o, --out <output>                   The location at which to store matching records. Supported
                                       formats include CSV, JSON record, and parquet.
  -q, --quarantine <quarantine>        The location to store non-matching records. The format is
                                       required to match the format of the input.
  -m, --manifest <manifest>            Manifest file defining the data set.
      --flatten                        If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be
                                       flattened into the top level of the record. For example, if
                                       the input file is JSON and contains the record `{"foo":
                                       {"bar": 1}}`, then it would be treated as equivalent to
                                       `{"foo.bar": 1}`. This flag is ignored for formats that do
                                       not support nested structures.
      --lift                           If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be lifted
                                       into the top level of the record. For example, if the input
                                       file is JSON and contains the record `{"foo": {"bar": 1}}`,
                                       then it would be treated as equivalent to `{"bar": 1}`. If
                                       multiple leaf values have the same name, an error will be
                                       returned. This flag is ignored for formats that do not
                                       support nested structures.
      --csv-delimiter <csv_delimiter>  The character to use as the csv delimiter. This is usually a
                                       comma (`,`), but other common values include tab (`\t`) and
                                       pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
                                       as aliases for the corresponding characters. [default: ,]
  -h, --help                           Print help
validate
> bespoke validate --help
Validate that the input data matches the types and rules of the specified data manifest.

Usage: validate [OPTIONS] --manifest <manifest> <input>

Arguments:
  <input>  A file or directory of data to validate against the given data manifest. Supported
           formats include CSV, line delimited JSON, and parquet. Files may be gzipped. If a
           directory is specified, files with unrecognized extensions will be ignored. If a
           directory is specified, then the --output argument must also be a directory if specified.

Options:
  -m, --manifest <manifest>            A data manifest defining the fields, types and rules of valid
                                       input data.
  -o, --out <error_output_file>        A file in which to place structured invalidations of the
                                       input file. Supported formats include CSV, line delimited
                                       JSON, and parquet. Files may be gzipped. If a directory is
                                       specified, the output file will be named after the input file
                                       and placed in the directory.
  -a, --annotate <annotate>            Annotate the output file with a string column containing the
                                       value given by the --annotate argument. This is useful for
                                       tracking the source of invalidations in a data pipeline. As
                                       an example the argument `--annotate foo=bar` will add a
                                       column named foo with the value bar to the output file.
      --verbose                        Prints invalid rows to stdout as they are found.
      --flatten                        If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be
                                       flattened into the top level of the record. For example, if
                                       the input file is JSON and contains the record `{"foo":
                                       {"bar": 1}}`, then it would be treated as equivalent to
                                       `{"foo.bar": 1}`. This flag is ignored for formats that do
                                       not support nested structures.
      --lift                           If the input file is of a format that supports nested
                                       structures, this flag will cause the leaf values to be lifted
                                       into the top level of the record. For example, if the input
                                       file is JSON and contains the record `{"foo": {"bar": 1}}`,
                                       then it would be treated as equivalent to `{"bar": 1}`. If
                                       multiple leaf values have the same name, an error will be
                                       returned. This flag is ignored for formats that do not
                                       support nested structures.
      --csv-delimiter <csv_delimiter>  The character to use as the csv delimiter. This is usually a
                                       comma (`,`), but other common values include tab (`\t`) and
                                       pipe (`|`). The words `tab`, `pipe`, and `comma` can be used
                                       as aliases for the corresponding characters. [default: ,]
  -h, --help                           Print help
manifest
> bespoke manifest --help
Manage data manifests.

Usage: manifest <COMMAND>

Commands:
  example   Print an example manifest file.
  lint      Lint the data manifest file.
  rules     Describe the rules that can be used in a manifest file.
  strip     Strip the data manifest file.
  validate  Validate the data manifest file.
  help      Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help
manifest example
> bespoke manifest example --help
Print an example manifest file.

Usage: example

Options:
  -h, --help  Print help
manifest lint
> bespoke manifest lint --help
Lint the data manifest file.

Usage: lint [OPTIONS] <manifest_file_or_directory>...

Arguments:
  <manifest_file_or_directory>...  The manifest file to lint.

Options:
  -i, --ignore <ignore>     Ignore the specified rule.
  -m, --include-motivation  Include the motivation for the rule in the output.
  -h, --help                Print help
manifest rules
> bespoke manifest rules --help
Describe the rules that can be used in a manifest file.

Usage: rules [OPTIONS]

Options:
      --json  Output the available rules as a json object..
  -h, --help  Print help
manifest strip
> bespoke manifest strip --help
Strip the data manifest file.

Usage: strip [OPTIONS] <manifest_file>

Arguments:
  <manifest_file>  The manifest file to strip.

Options:
  -o, --output <output_file>  The file to write the stripped manifest to.
  -h, --help                  Print help

Strip the given manifest file or set of manifest files of all non-essential fields. Only fields that
define functional behavior are kept.
manifest validate
> bespoke manifest validate --help
Validate the data manifest file.

Usage: validate <manifest_file_or_directory>...

Arguments:
  <manifest_file_or_directory>...  The manifest file to validate.

Options:
  -h, --help  Print help

Check that the given manifest file or set of manifest files are valid.