watchful.attributes module

This script provides the functions required for data enrichment.

watchful.attributes.adjust_span_offsets_from_char_to_byte(cell: str, enriched_cell: List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]]) List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]][source]

This function adjusts all the spans of an enriched cell from character offsets to byte offsets, since Watchful’s data enrichment API takes in byte offsets. This is useful if your data enrichment functions and models creates character offsets.

Parameters:
  • cell (str) – The string value contained in the cell.

  • enriched_cell (EnrichedCell) – A list of attributes for the cell.

Returns:

The list of attributes for the cell whose span offsets have been adjusted.

Return type:

EnrichedCell

watchful.attributes.atterize_values_in_cell(cell: str, attribute_name: str, values: List[Pattern]) List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]][source]

This is a helper function to create_attribute_for_values() for finding the spans for each value in values.

Parameters:
  • cell (str) – The original cell.

  • attribute_name (str) – The attribute name.

Returns:

The enriched cell.

Return type:

EnrichedCell

watchful.attributes.base64(num: int) str[source]

This function takes in an integer value and returns its encoded string value.

Parameters:

num (int) – The integer value.

Returns:

The encoded string value.

Return type:

str

watchful.attributes.base64str(list_of_integers: List[int]) str[source]

This function takes in a list of integers and returns its encoded string value with substrings representing those integers in base64.

Additional compression is done by concatenating consecutive base64 strings of the same length. This compressed encoding is detected by inspecting the first character s[0] in the ASCII range 35 inclusive to 42 inclusive. The rest of the string s[1:] should be partitioned into its own strings with length ascii_code(s[0]) - 34 each.

Examples: “#1234” represents “1,2,3,4” “$1234” represents “12,34” “&1234” represents “1234” (which is never compressed since the original value is shorter than the compressed value)

The ASCII codes for the character prefixes in the examples are: “#” => 35 “$” => 36 “&” => 38

Compression will not be done for a base64 encoded integer if it is preceded or suceeded by a base64 encoded integer of a different length.

The range 35 inclusive to 42 inclusive is chosen because it contains characters that do not need to be escaped in JSON, nor does the range contain comma (“,”) as it is used as a delimiter to concatenate all of the strings.

Parameters:

list_of_integers (List[int]) – The list of integers.

Returns:

The encoded string value.

Return type:

str

watchful.attributes.contig_spans(spans: List[Tuple[int, int]]) List[int][source]

This function decodes a list of spans, i.e. [(start_1, end,_1), …, (start_N, end_N)] to a list of contiguous spans, i.e. [gap_len_1, span_len_1, …, gap_len_N, span_len_N].

Parameters:

spans (List[Tuple[int, int]]) – The list of spans.

Returns:

The list of contiguous spans.

Return type:

List[int]

watchful.attributes.create_attribute_for_values(attribute_name: str, values: List[Pattern]) str[source]

This function takes an attribute name and a list of known values to create attributes for. The list of values will be looked up in each cell of the currently loaded dataset. An attributes file will be prepared to be loaded into the Watchful application. It returns the filename of the created attribute file, which can be used by the attributes action and function: api.load_attributes(dataset_id, attribute_filename).

Parameters:
  • attribute_name (str) – The attribute name.

  • values (List[str]) – The list of known values to create attributes for.

Returns:

The used attributes filename.

Return type:

str

watchful.attributes.enrich(in_file: str, out_file: str, enrich_row_fn: Callable, enrichment_args: Tuple) None[source]

This function enriches a dataset, using an enrichment function that enriches row by row and other enrichment objects, and then produces the attributes.

Parameters:
  • in_file (str) – The filepath of the csv formatted original dataset or the dataset exported from Watchful. This latter will be the former appended with the Watchful columns “Hints” and “HandLabels”. It follows that these columns are reserved for Watchful and should not be present in the original dataset.

  • out_file (str) – The filepath where the enriched attributes in Watchful custom format for ingestion by Watchful application are written to.

  • enrich_row_fn (Callable) – The user custom function for enriching every row of the dataset.

  • enrichment_args (Tuple) – The additional enrichment objects to perform the data enrichment.

watchful.attributes.enrich_row(row: Dict[str | None, str | None]) List[List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]]][source]

This function enriches one row. It takes named cells of an input row and returns an enriched row. The global ENRICHMENT_ARGS would have previously been set so it can be used here.

Parameters:

row (Dict[Optional[str], Optional[str]]) – The dictionary of named cell values in the row.

Returns:

The list of enriched cell values in the row.

Return type:

List[EnrichedCell]

watchful.attributes.enrich_row_with_attribute_data(row: Dict[str | None, str | None]) List[List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]]][source]

This function extracts the attributes from a row of an attributes file. Attributes are associated to the entire text in each named cell of the input dataset row. The entire text in each cell of the input dataset row is identified by its byte start index and byte end index.

Parameters:

row (Dict[Optional[str], Optional[str]]) – The dictionary of named cell values in the row.

Returns:

The list of enriched cell values in the row.

Return type:

List[EnrichedCell]

watchful.attributes.flair_atterize(sent) List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]][source]

This function creates an enriched cell from the cell inference derived by Flair NLP. It extracts attributes from a Flair paragraph. Attributes are associated to substrings, being tokens, entities, sentences or noun chunks. Every Substring is identified by its character start index and character end index.

Parameters:

sent (flair.data.Sentence) – Cell inference.

Returns:

The enriched cell.

Return type:

EnrichedCell

watchful.attributes.flair_atterize_fn(cell: str, flair_atterize_: Callable, tagger_pred: Callable, sent_fn: Callable) List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]][source]

This function creates an enriched cell from the original cell using the Flair NLP enrichment objects.

Parameters:
  • cell (str) – The original cell.

  • flair_atterize (Callable) – The enrichment function that creates an enriched cell from the cell inference derived by Flair NLP.

  • tagger_pred (Callable) – A Flair NLP enrichment object.

  • sent_fn (Callable) – A Flair NLP enrichment object.

Returns:

The enriched cell.

Return type:

EnrichedCell

watchful.attributes.get_context(attribute_filename: str) Tuple[str, str, str][source]

This function takes in an attributes filename, finds the current dataset file loaded in Watchful and returns the context needed to enrich that dataset. This context includes the filename of the file used by the attributes action and function: client.load_attributes(dataset_id, attribute_filename).

Parameters:

attribute_filename (str) – The input attributes filename.

Returns:

The dataset filepath, used attributes filepath and used attributes filename.

Return type:

Tuple[str, str, str]

watchful.attributes.get_dataset_id_dir_filepath(summary: Dict, in_file: str | None = '', is_local: bool | None = True) Tuple[str, str, str][source]

This function returns the id, directory and filepath of the currently opened dataset.

Parameters:
  • summary (Dict) – The dictionary of the HTTP response from a connection request, defaults to None.

  • in_file (str, optional) – The dataset filepath, defaults to “”.

  • is_local (bool, optional) – Boolean indicating whether the Watchful application is local (otherwise hosted), defaults to True.

Returns:

The id, directory and filepath of the currently opened dataset.

Return type:

Tuple[str, str, str]

watchful.attributes.get_vars_for_enrich_row_with_attribute_data(attr_names: str, attr_filepath: str) Tuple[Callable, List[str], reader][source]

This function takes in a comma-delimited string of attribute names and the csv attributes filepath. It returns the attribute names as a list, the csv attribute reader, and a function that takes in a full row of attributes and returns the desired attributes.

Parameters:
  • attr_names (str) – The comma-delimited attribute names.

  • attr_filepath (str) – The attributes csv filepath.

Returns:

The list of attribute names, csv attribute reader and a function that takes in a full row of attributes and returns the desired attributes.

Return type:

Tuple[Callable, List[str], csv.reader]

watchful.attributes.init_args(*args) None[source]

In this function, we create variables that we will store in the global ENRICHMENT_ARGS. We then later use them in enrich_row() to enrich our data row by row.

This function initializes a per-process context with the user function that will be used in the multiprocessing.Pool.imap. This is not necessarily thread-safe but is multiprocess-safe.

Parameters:

args (Tuple) – A tuple of objects of any type, to be used for the data enrichment.

watchful.attributes.load_flair() Tuple[source]

This function creates and returns the Flair NLP objects for data enrichment.

Returns:

The tuple of Flair NLP objects.

Return type:

Tuple

watchful.attributes.load_spacy() Tuple[source]

This function creates and returns the SpaCy NLP objects for data enrichment.

Returns:

The tuple of SpaCy NLP objects.

Return type:

Tuple

watchful.attributes.print_multiproc_params() None[source]

This function prints the multiprocessing flag and the multiprocessing chunk size for the data enrichment. This is still in internal alpha mode and is not expected to be used by user.

watchful.attributes.proc_enriched_cell(enriched_cell: List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]]) None[source]

This function is iterated over every enriched cell. Optionally, you may add code if you wish to do something auxiliary with every cell.

Parameters:

enriched_cell (EnrichedCell) – An enriched cell.

watchful.attributes.proc_enriched_row(enriched_row: List[List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]]]) None[source]

This function is iterated over every enriched row. Optionally, you may add code if you wish to do something auxiliary with every row.

Parameters:

enriched_row (List[EnrichedCell]) – A list of enriched cells.

watchful.attributes.set_multiproc_chunksize(multiproc_chunksize: int) None[source]

This function sets the multiprocessing chunk size for the data enrichment, if multiprocessing is used. This is still in internal alpha mode and is not expected to be used by user.

Parameters:

multiproc_chunksize (int) – The multiprocessing chunk size, at least 1.

watchful.attributes.set_multiprocessing(is_multiproc: bool) None[source]

This function sets whether multiprocessing is used for the data enrichment. This is still in internal alpha mode and is not expected to be used by user.

Parameters:

is_multiproc (bool) – The multiprocessing flag.

watchful.attributes.spacy_atterize(doc) List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]][source]

This function creates an enriched cell from the cell inference derived by SpaCy NLP. It extracts attributes from a SpaCy document. Attributes are associated to substrings, being tokens, entities, sentences or noun chunks. Every Substring is identified by its character start index and character end index.

Parameters:

doc (spacy.tokens.doc.Doc) – Cell inference.

Returns:

The enriched cell.

Return type:

EnrichedCell

watchful.attributes.spacy_atterize_fn(cell: str, spacy_atterize_: Callable, nlp: Callable) List[Tuple[List[Tuple[int]] | Dict[str, List[str]] | str | None]][source]

This function creates an enriched cell from the original cell using the SpaCy NLP enrichment objects.

Parameters:
  • cell (str) – The original cell.

  • spacy_atterize (Callable) – The enrichment function that creates an enriched cell from the cell inference derived by SpaCy NLP.

  • nlp (Callable) – A SpaCy NLP enrichment object.

Returns:

The enriched cell.

Return type:

EnrichedCell

watchful.attributes.validate_attribute_names(attr_names: str, attr_filepath: str) bool[source]

This function checks that all attribute names are present in the attribute file. It returns False as soon as an attribute name is absent, or True when all attribute names match.

Parameters:
  • attr_names (str) – The comma-delimited attribute names.

  • attr_filepath (str) – The attributes filepath.

Returns:

The boolean indicating if all the attribute names are present in the attributes file.

Return type:

bool

watchful.attributes.writer(output: TextIOWrapper, n_rows: int, n_cols: int) Callable[source]

This function takes in the output file object and the number of rows and columns of the dataset. It returns a write function that takes in all of the attributes for a cell in the dataset, where a cell is located on a row and a column pair.

The cells’ attributes should be in this shape (note that the following is in Rust idiom): [

(

spans: Vec<(int, int)>, attr_vals: Map<String, Vec<Any>>, name: Option<String>

] spans is a sorted vector of span (start, end) in the cell. attr_vals is a map from attribute name to values for the spans. None means that the attribute has no value for that token defined by its span. name is an optional parameter which can be used to give a name to the spans, where the attribute value of that name is the content of the spans themselves. Examples of this are sentences, noun_chunks, tokens or collage_names.

Parameters:
  • output (io.TextIOWrapper) – The output file for all the attributes of a dataset.

  • n_rows (int) – The number of rows of the original dataset.

  • n_cols (int) – The number of columns of the original dataset.

Returns:

The function that takes in all of the attributes for a cell in the dataset and writes an encoded representation for them onto output.

Return type:

Callable