Install bespoke
following the instructions on the downloads page.
> bespoke --help A commandline tool for DataOps. Usage: bespoke <COMMAND> Commands: convert Convert the data from one file format into another. merge Merge a set of input files into a single output file. sift Sift through a data set, validating records against a manifest. Records that successfully validate against the manifest will be placed in the output file while invalid records will be placed in the quarantine file. validate Validate that the input data matches the types and rules of the specified data manifest. manifest Manage data manifests. help Print this message or the help of the given subcommand(s) Options: -h, --help Print help -V, --version Print version
Bespoke operates around the goal of driving data towards a well documented interface.
Within bespoke
these interfaces are defined as manifest files. A data manifest
describes all the fields that are expected within a data set, the types of these fields, and
any rules the fields are expected to adhere to.
# ./sparrow_flights.yaml
name: sparrow_flights
description: >
A collection of flight tests for swallows. Each flight test is a record of a
particular location, at a particular time, and under particular weather conditions.
The collection is used to track the performance of swallows in flight and to
identify the fastest and most enduring swallows.
fields:
swallow_id:
type: string
description: >
The identifier of the participating swallow in the flight test.
These values are expected to be kebab case and start with "swl-".
rules:
- starts_with: swl-
- kebab_case
flight_at:
type: timestamp
description: >
The local time at which the flight occured. Note that flights may be conducted
in multiple timezones. This field provides information about the local
conditions that may have impacted performance.
flight_site:
type: string
description: >
An identifier for which testing site the flight occured at. Current active
sites include Doune Castle, Glen Coe, and Castle Stalker
rules:
- enumeration:
- doune_castle
- glen_coe
- castle_stalker
distance:
type: number
description: >
The distance in miles that the swallow flew during the test.
rules:
- minimum: 0
- maximum: 100 # Anything over 100 is certainly errant.
duration:
type: number
description: >
The duration of the flight in minutes.
rules:
- minimum: 0
- maximum: 180
This manifest is then used to operate on data files.
> bespoke validate -m sparrow_flights.yaml flight_data/ 🟢 Validated flight_data/file_1.parquet 🟢 Validated flight_data/file_2.jsonl 🟢 Validated flight_data/file_3.csv.gz 🟢 Validated flight_data/file_4.csv 🟢 Validated flight_data/file_5.jsonl.gz > bespoke merge -m sparrow_flights.yaml --in flight_data --out flight_data.parquet/ Processing flight_data/file_1.parquet Processing flight_data/file_2.jsonl Processing flight_data/file_3.csv.gz Processing flight_data/file_4.csv Processing flight_data/file_5.jsonl.gz All files merged into flight_data.parquet
> bespoke convert --help Convert the data from one file format into another. Usage: convert [OPTIONS] --manifest <manifest> <input> <output> Arguments: <input> An input file to convert into the output, validating against the given data manifest. Supported formats include CSV, line delimited JSON, and parquet. Files can be gzipped. <output> The file to which to output the data, its extension indicating the file format. Supported formats include CSV, JSON record, and parquet. Output files can include a `.gz` extension. Options: -m, --manifest <manifest> Manifest file defining the data set. --flatten If the input file is of a format that supports nested structures, this flag will cause the leaf values to be flattened into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"foo.bar": 1}`. This flag is ignored for formats that do not support nested structures. --lift If the input file is of a format that supports nested structures, this flag will cause the leaf values to be lifted into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"bar": 1}`. If multiple leaf values have the same name, an error will be returned. This flag is ignored for formats that do not support nested structures. --csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a comma (`,`), but other common values include tab (`\t`) and pipe (`|`). The words `tab`, `pipe`, and `comma` can be used as aliases for the corresponding characters. [default: ,] --skip-invalid Skip records that do not match the manifest. This will yeild the same output as if the invalid records were not present in the input file. -h, --help Print help
> bespoke introspect --help Introspect the bespoke CLI's commands Usage: introspect Options: -h, --help Print help This command is NOT part of the public interface and may change without notice. It is used by bespoke's maintainers to generate parts of the documentation.
> bespoke merge --help Merge a set of input files into a single output file. Usage: merge [OPTIONS] --in <input> --out <output> --manifest <manifest> Options: -i, --in <input> An input file or directory to merge into the output, validating against the given data manifest. Supported formats include CSV, line delimited JSON, and parquet. Files can be gzipped. -o, --out <output> The file to which to output the data, its extension indicating the file format. Supported formats include CSV, JSON record, and parquet. Output files can include a `.gz` extension. -m, --manifest <manifest> Manifest file defining the data set. --flatten If the input file is of a format that supports nested structures, this flag will cause the leaf values to be flattened into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"foo.bar": 1}`. This flag is ignored for formats that do not support nested structures. --lift If the input file is of a format that supports nested structures, this flag will cause the leaf values to be lifted into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"bar": 1}`. If multiple leaf values have the same name, an error will be returned. This flag is ignored for formats that do not support nested structures. --csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a comma (`,`), but other common values include tab (`\t`) and pipe (`|`). The words `tab`, `pipe`, and `comma` can be used as aliases for the corresponding characters. [default: ,] --skip-invalid Skip records that do not match the manifest. This will yeild the same output as if the invalid records were not present in the input file. -h, --help Print help
> bespoke sift --help Sift through a data set, validating records against a manifest. Records that successfully validate against the manifest will be placed in the output file while invalid records will be placed in the quarantine file. Usage: sift [OPTIONS] --in <input> --out <output> --quarantine <quarantine> --manifest <manifest> Options: -i, --in <input> Location of the input data. Supported formats include CSV and line delimited JSON, both gzipped and non-gzipped. This can be a file or directory. -o, --out <output> The location at which to store matching records. Supported formats include CSV, JSON record, and parquet. -q, --quarantine <quarantine> The location to store non-matching records. The format is required to match the format of the input. -m, --manifest <manifest> Manifest file defining the data set. --flatten If the input file is of a format that supports nested structures, this flag will cause the leaf values to be flattened into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"foo.bar": 1}`. This flag is ignored for formats that do not support nested structures. --lift If the input file is of a format that supports nested structures, this flag will cause the leaf values to be lifted into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"bar": 1}`. If multiple leaf values have the same name, an error will be returned. This flag is ignored for formats that do not support nested structures. --csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a comma (`,`), but other common values include tab (`\t`) and pipe (`|`). The words `tab`, `pipe`, and `comma` can be used as aliases for the corresponding characters. [default: ,] -h, --help Print help
> bespoke validate --help Validate that the input data matches the types and rules of the specified data manifest. Usage: validate [OPTIONS] --manifest <manifest> <input> Arguments: <input> A file or directory of data to validate against the given data manifest. Supported formats include CSV, line delimited JSON, and parquet. Files may be gzipped. If a directory is specified, files with unrecognized extensions will be ignored. If a directory is specified, then the --output argument must also be a directory if specified. Options: -m, --manifest <manifest> A data manifest defining the fields, types and rules of valid input data. -o, --out <error_output_file> A file in which to place structured invalidations of the input file. Supported formats include CSV, line delimited JSON, and parquet. Files may be gzipped. If a directory is specified, the output file will be named after the input file and placed in the directory. -a, --annotate <annotate> Annotate the output file with a string column containing the value given by the --annotate argument. This is useful for tracking the source of invalidations in a data pipeline. As an example the argument `--annotate foo=bar` will add a column named foo with the value bar to the output file. --verbose Prints invalid rows to stdout as they are found. --flatten If the input file is of a format that supports nested structures, this flag will cause the leaf values to be flattened into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"foo.bar": 1}`. This flag is ignored for formats that do not support nested structures. --lift If the input file is of a format that supports nested structures, this flag will cause the leaf values to be lifted into the top level of the record. For example, if the input file is JSON and contains the record `{"foo": {"bar": 1}}`, then it would be treated as equivalent to `{"bar": 1}`. If multiple leaf values have the same name, an error will be returned. This flag is ignored for formats that do not support nested structures. --csv-delimiter <csv_delimiter> The character to use as the csv delimiter. This is usually a comma (`,`), but other common values include tab (`\t`) and pipe (`|`). The words `tab`, `pipe`, and `comma` can be used as aliases for the corresponding characters. [default: ,] -h, --help Print help
> bespoke manifest --help Manage data manifests. Usage: manifest <COMMAND> Commands: example Print an example manifest file. lint Lint the data manifest file. rules Describe the rules that can be used in a manifest file. strip Strip the data manifest file. validate Validate the data manifest file. help Print this message or the help of the given subcommand(s) Options: -h, --help Print help
> bespoke manifest example --help Print an example manifest file. Usage: example Options: -h, --help Print help
> bespoke manifest lint --help Lint the data manifest file. Usage: lint [OPTIONS] <manifest_file_or_directory>... Arguments: <manifest_file_or_directory>... The manifest file to lint. Options: -i, --ignore <ignore> Ignore the specified rule. -m, --include-motivation Include the motivation for the rule in the output. -h, --help Print help
> bespoke manifest rules --help Describe the rules that can be used in a manifest file. Usage: rules [OPTIONS] Options: --json Output the available rules as a json object.. -h, --help Print help
> bespoke manifest strip --help Strip the data manifest file. Usage: strip [OPTIONS] <manifest_file> Arguments: <manifest_file> The manifest file to strip. Options: -o, --output <output_file> The file to write the stripped manifest to. -h, --help Print help Strip the given manifest file or set of manifest files of all non-essential fields. Only fields that define functional behavior are kept.
> bespoke manifest validate --help Validate the data manifest file. Usage: validate <manifest_file_or_directory>... Arguments: <manifest_file_or_directory>... The manifest file to validate. Options: -h, --help Print help Check that the given manifest file or set of manifest files are valid.