| Title: | Embedding Vectorization and Distance-Based Scoring Workflows |
|---|---|
| Description: | R-first orchestration for text vectorization (embeddings), embedding distance computation, and distance-based scoring workflows. Supports backend-neutral embedding providers (HF, OpenAI, TEI), prototype cosine-distance scoring, reference-area distance scoring, and threshold calibration utilities. |
| Authors: | Rainer M Krug [aut, cre], ChatGPT Assistant [ctb] |
| Maintainer: | Rainer M Krug <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.3 |
| Built: | 2026-06-03 18:32:23 UTC |
| Source: | https://github.com/openalexPro/openalexVectorComp |
Creates a configuration object used by the embedding backend adapter.
backend_config( provider = c("hf", "openai", "tei"), base_url = NULL, model = NULL, max_batch_size = NULL, timeout = 60, retries = 3, tei_url = NULL )backend_config( provider = c("hf", "openai", "tei"), base_url = NULL, model = NULL, max_batch_size = NULL, timeout = 60, retries = 3, tei_url = NULL )
provider |
Backend provider: |
base_url |
Provider base URL. If |
model |
Optional model id. If |
max_batch_size |
Optional max texts per HTTP request. |
timeout |
Request timeout (seconds) used by backends that support it. |
retries |
Number of retry attempts for transient failures. |
tei_url |
Optional TEI compatibility argument. If provided, it is treated
as the full embedding endpoint URL and overrides |
A named list with backend configuration.
Uses the configured backend adapter to embed a character vector of texts.
For authenticated providers, set OVC_API_TOKEN in the environment. The
adapter sends it as a bearer token.
backend_embed_texts(texts, backend = backend_config())backend_embed_texts(texts, backend = backend_config())
texts |
Character vector of input texts. |
backend |
Backend configuration from |
Numeric matrix with one row per text and columns V1..Vd.
Returns normalized backend metadata used by the pipeline.
backend_info(backend = backend_config())backend_info(backend = backend_config())
backend |
Backend configuration from |
A list with fields provider, model_id, dim, max_batch_size,
and raw.
Reads backend configuration from a YAML file and returns a normalized object
in the same format as backend_config().
backend_read(fn = "embed_model.yaml")backend_read(fn = "embed_model.yaml")
fn |
Path to YAML file. Defaults to |
Supports both the current flat format and legacy nested metadata format.
A backend configuration list compatible with
backend_config().
Writes a backend configuration (same shape as returned by
backend_config()) to YAML.
backend_save(backend = backend_config(), fn = "embed_model.yaml")backend_save(backend = backend_config(), fn = "embed_model.yaml")
backend |
Backend configuration from |
fn |
Output YAML file path. Defaults to |
Invisibly returns fn.
Convenience wrapper around backend_config() for a SPECTER2 setup served by
a local TEI (text-embeddings-inference) server. The model itself must be
prepared and started externally (see inst/scripts/prepare_specter2_merged.py
and inst/scripts/start_tei_specter2.sh, and the specter2-setup vignette).
backend_specter2_tei( port = 8080L, host = "localhost", model = "allenai/specter2_proximity_merged" )backend_specter2_tei( port = 8080L, host = "localhost", model = "allenai/specter2_proximity_merged" )
port |
Port that TEI is listening on. Defaults to |
host |
Host for TEI. Defaults to |
model |
Provenance label for the served model. Defaults to
|
The model argument is metadata only and is recorded in embed_model.yaml
and parquet partition paths; TEI itself loads whichever model it was started
with.
A backend configuration list compatible with backend_config().
Collect completed OpenAI batch embedding jobs
batch_collect_openai( project_dir, backend = backend_config(provider = "openai"), label = "corpus", verbose = TRUE )batch_collect_openai( project_dir, backend = backend_config(provider = "openai"), label = "corpus", verbose = TRUE )
project_dir |
Project root directory. |
backend |
Backend configuration from |
label |
Embedding label partition to collect into. |
verbose |
Logical; print progress messages. |
Invisibly returns a list with collection summary.
Inspect OpenAI batch state for a label
batch_status_openai(project_dir, label = "corpus", refresh_remote = TRUE)batch_status_openai(project_dir, label = "corpus", refresh_remote = TRUE)
project_dir |
Project root directory. |
label |
Embedding label. |
refresh_remote |
Logical; if |
A data frame with one row per tracked job.
Preprocesses corpus text, performs preflight request-size checks, splits work into compliant OpenAI batch jobs, submits them, and returns immediately.
batch_submit_openai( project_dir, backend = backend_config(provider = "openai"), corpus_name = "corpus", label = corpus_name, batch_size = 5000, delete_existing = FALSE, text_preprocessor = clean_abstract_for_embedding, cleaner_args = list(), save_text = TRUE, max_requests_per_job = 20000L, max_job_bytes = 150 * 1024^2, completion_window = "24h", verbose = TRUE )batch_submit_openai( project_dir, backend = backend_config(provider = "openai"), corpus_name = "corpus", label = corpus_name, batch_size = 5000, delete_existing = FALSE, text_preprocessor = clean_abstract_for_embedding, cleaner_args = list(), save_text = TRUE, max_requests_per_job = 20000L, max_job_bytes = 150 * 1024^2, completion_window = "24h", verbose = TRUE )
project_dir |
Project root directory. |
backend |
Backend configuration from |
corpus_name |
Folder name under |
label |
Embedding label partition. Defaults to |
batch_size |
Number of corpus rows per Arrow scan batch while preparing requests. |
delete_existing |
If |
text_preprocessor |
Text-preparation function returning |
cleaner_args |
Additional named arguments passed to |
save_text |
Logical; whether to keep cleaned text for downstream parquet output. |
max_requests_per_job |
Max requests per submitted OpenAI job. Must be <= 50000. |
max_job_bytes |
Max JSONL bytes per submitted OpenAI job. Must be <= 200 MB. |
completion_window |
OpenAI batch completion window. Defaults to |
verbose |
Logical; print progress messages. |
Invisibly returns a list with state path and submission summary.
Sweeps candidate thresholds over scores stored in a Parquet dataset without loading all rows into memory. Uses two passes: first to determine the score range on the labeled subset; second to accumulate confusion counts across a fixed grid of thresholds. Returns the best threshold per the chosen metric.
calibrate_threshold( scores_parquet, score_col, labels_parquet, metric = c("f1", "precision_at_recall"), recall_min = 0.8, thresholds = NULL, n_thresholds = 1001, batch_size = 1e+05, verbose = TRUE )calibrate_threshold( scores_parquet, score_col, labels_parquet, metric = c("f1", "precision_at_recall"), recall_min = 0.8, thresholds = NULL, n_thresholds = 1001, batch_size = 1e+05, verbose = TRUE )
scores_parquet |
Path to a Parquet dataset (file or directory) with at
least columns |
score_col |
Name of the score column to calibrate (e.g., "ensemble", "relevance_score", or "margin"). |
labels_parquet |
Parquet dataset path with columns |
metric |
Optimisation target: |
recall_min |
Minimum recall required when |
thresholds |
Optional numeric vector of thresholds to evaluate. If
|
n_thresholds |
Number of thresholds to generate when |
batch_size |
Approximate Arrow scan batch size. |
verbose |
Logical; print progress messages. |
List containing the selected threshold (th) and the associated
precision, recall, and f1 values.
## Not run: best <- calibrate_threshold( scores_parquet = "output/scores/", score_col = "ensemble", labels_parquet = "output/labels/", batch_size = 200000 ) best$th ## End(Not run)## Not run: best <- calibrate_threshold( scores_parquet = "output/scores/", score_col = "ensemble", labels_parquet = "output/labels/", batch_size = 200000 ) best$th ## End(Not run)
Applies lightweight rule-based cleaning to title/abstract rows and
returns embedding-ready text plus a deterministic text_hash.
clean_abstract_for_embedding( df, mode = c("lenient", "balanced", "strict"), no_abstract_policy = c("keep_title_only", "discard", "conditional"), min_chars = NULL, min_alpha_ratio = NULL, placeholder_patterns = NULL, boilerplate_patterns = NULL, html_patterns = NULL, return_flags = TRUE )clean_abstract_for_embedding( df, mode = c("lenient", "balanced", "strict"), no_abstract_policy = c("keep_title_only", "discard", "conditional"), min_chars = NULL, min_alpha_ratio = NULL, placeholder_patterns = NULL, boilerplate_patterns = NULL, html_patterns = NULL, return_flags = TRUE )
df |
Data frame with columns |
mode |
Cleaning intensity: |
no_abstract_policy |
Policy when abstract is missing/invalid:
|
min_chars |
Optional minimum abstract length in characters after
cleaning. If |
min_alpha_ratio |
Optional minimum ratio of alphabetic characters in the
cleaned abstract. If |
placeholder_patterns |
Optional regex vector for placeholder abstract detection. |
boilerplate_patterns |
Optional regex vector for publisher boilerplate detection. |
html_patterns |
Optional regex vector for HTML/XML artifact detection. |
return_flags |
If |
A data frame with at least columns id, text, text_hash.
When return_flags = TRUE, also includes text_quality,
abstract_raw_present, abstract_kept, discard_reason, and
cleaning_mode.
Runs OpenAI batch status/collect for a prepared demo workspace and, when
batch embeddings are available, computes direct-vs-batch vector comparison.
Comparison artifacts are written to:
project/openai_batch_comparison/label=<label>/.
demo_finalize_openai_batch( demo_dir, api_key = NULL, label = "corpus_batch", refresh_remote = TRUE, verbose = TRUE )demo_finalize_openai_batch( demo_dir, api_key = NULL, label = "corpus_batch", refresh_remote = TRUE, verbose = TRUE )
demo_dir |
Demo workspace directory created by
|
api_key |
Optional OpenAI API key. If provided, it is set in
|
label |
Batch embedding label to finalize. Defaults to |
refresh_remote |
Logical; forwarded to
|
verbose |
Logical; print progress messages. |
Invisibly returns a list containing status/collect summaries, comparison readiness, and output paths.
Computes cosine distance as 1 - similarity_cosine(a, b).
distance_cosine(a, b)distance_cosine(a, b)
a |
Numeric vector. |
b |
Numeric vector or numeric matrix with embeddings in rows. |
A single numeric distance value, or NA_real_ when undefined.
Reads embeddings from a model-specific dataset and computes cosine distances
between all vectors in corpus_label and all vectors in reference_label.
A centroid row/column is added to the matrix:
rows are corpus ids plus "centroid" (corpus centroid),
columns are reference ids plus "centroid" (reference centroid).
distance_reference_cosine( project_dir, embeddings_dir = "model_id=BAAI_bge-small-en-v1.5", corpus_label = "corpus", reference_label = "reference", batch_size = 1e+05, max_cells = 5e+07, verbose = TRUE )distance_reference_cosine( project_dir, embeddings_dir = "model_id=BAAI_bge-small-en-v1.5", corpus_label = "corpus", reference_label = "reference", batch_size = 1e+05, max_cells = 5e+07, verbose = TRUE )
project_dir |
Project root directory containing |
embeddings_dir |
Model subfolder under |
corpus_label |
Label partition used as corpus side. Defaults to
|
reference_label |
Label partition used as reference side. Defaults to
|
batch_size |
Unused placeholder for compatibility with planned streaming extension. |
max_cells |
Maximum allowed matrix size
( |
verbose |
Logical; print progress messages. |
Embeddings are expected under:
project_dir/embeddings/model_id=<...>/label=<label>/batch=<n>/...
Output file:
pairwise-cosine.parquet: wide table with first column id (corpus id or
"centroid"), reference-id columns, and a final centroid column.
Invisibly the output directory
project_dir/distance_reference_cosine/model_id=<...>/corpus_label=<...>/reference_label=<...>/.
Fits (or loads) a reference-area model from reference_label embeddings and
computes squared Mahalanobis distance for rows in corpus_label.
distance_ridge( project_dir, reference_label = "reference", corpus_label = "corpus", fit_path = NULL, batch_size = 1e+05, regularization = 1e-06, verbose = TRUE )distance_ridge( project_dir, reference_label = "reference", corpus_label = "corpus", fit_path = NULL, batch_size = 1e+05, regularization = 1e-06, verbose = TRUE )
project_dir |
Project root containing |
reference_label |
Label partition used to fit the reference area.
Defaults to |
corpus_label |
Label partition to score. Defaults to |
fit_path |
Optional path to an existing reference-area fit ( |
batch_size |
Approximate number of rows per Arrow scan batch. |
regularization |
Diagonal covariance regularization added before inversion. |
verbose |
Logical; print progress messages. |
Invisibly the model output directory under
project_dir/distance_ridge/model_id=<...>/corpus_label=<...>/reference_label=<...>/.
Opens two Parquet datasets (prototype margins and ridge scores) as Arrow
datasets and performs a lazy inner join on the common key (typically id).
The result is an Arrow-dplyr query that is not materialized until you call
dplyr::collect() or write it with arrow::write_dataset().
distances(prototype_distances, ridge_distance)distances(prototype_distances, ridge_distance)
prototype_distances |
Path to a Parquet dataset (file or directory)
containing prototype distances, e.g., columns |
ridge_distance |
Path to a Parquet dataset (file or directory)
containing ridge-based scores, e.g., columns |
A lazy Arrow dplyr query representing the joined datasets.
## Not run: joined <- distances( prototype_distances = "path/to/prototype_distances/", ridge_distance = "path/to/ridge_scores/" ) # Continue piping lazily and write without loading into memory joined |> dplyr::mutate(ensemble = (margin + relevance_score) / 2) |> arrow::write_dataset(path = "path/to/output_scores/", format = "parquet") # Or collect a small sample for inspection head(dplyr::collect(joined)) ## End(Not run)## Not run: joined <- distances( prototype_distances = "path/to/prototype_distances/", ridge_distance = "path/to/ridge_scores/" ) # Continue piping lazily and write without loading into memory joined |> dplyr::mutate(ensemble = (margin + relevance_score) / 2) |> arrow::write_dataset(path = "path/to/output_scores/", format = "parquet") # Or collect a small sample for inspection head(dplyr::collect(joined)) ## End(Not run)
Processes a Parquet dataset without loading it fully in memory. Reads Arrow
record batches, builds canonical text from title + abstract, calls the
configured embedding backend, and writes Parquet batch files.
embed_corpus( project_dir = NULL, backend = backend_config(), corpus_name = "corpus", batch_size = 5000, delete_existing = FALSE, text_preprocessor = clean_abstract_for_embedding, cleaner_args = list(), save_text = TRUE, label = corpus_name, dry_run = FALSE, verbose = TRUE )embed_corpus( project_dir = NULL, backend = backend_config(), corpus_name = "corpus", batch_size = 5000, delete_existing = FALSE, text_preprocessor = clean_abstract_for_embedding, cleaner_args = list(), save_text = TRUE, label = corpus_name, dry_run = FALSE, verbose = TRUE )
project_dir |
Project root directory. Must contain
|
backend |
Backend configuration created with
|
corpus_name |
Folder name under |
batch_size |
Number of corpus rows per Arrow scan batch. |
delete_existing |
If |
text_preprocessor |
Function that prepares embedding text from a batch
data frame and returns at least columns |
cleaner_args |
Named list of additional arguments passed to
|
save_text |
Logical; if |
label |
Partition label written under
|
dry_run |
Logical; if |
verbose |
Logical; print progress and summary messages. |
Invisibly the model-specific embeddings directory under
project_dir/embeddings/model_id=<...>/.
Sends a character vector to the configured backend and returns embeddings as a numeric matrix.
embed_texts(texts, backend = backend_config())embed_texts(texts, backend = backend_config())
texts |
Character vector of texts to embed. Empty inputs return a 0-row matrix; missing values are not supported. |
backend |
Backend configuration created with
|
A numeric matrix with one row per input text and one column per embedding dimension.
Fits a reference-area model (centroid + regularized covariance inverse)
using rows from reference_label.
fit_ridge( embeddings, reference_label = "reference", output, regularization = 1e-06, verbose = TRUE )fit_ridge( embeddings, reference_label = "reference", output, regularization = 1e-06, verbose = TRUE )
embeddings |
Path to a Parquet dataset (file or directory opened by
Arrow) with columns |
reference_label |
Label partition used to define the reference area. |
output |
Name of the |
regularization |
Positive numeric diagonal regularization added to covariance. |
verbose |
Logical; print progress messages. |
Invisibly returns output.
Reads an embeddings Parquet dataset (produced by embed_corpus()) with columns
id and V1..Vd, computes a PCA on the embedding matrix, and returns a
scatter plot of the first two principal components. Points are colored by
labels provided via labels. Rows not found in labels are shown as
"other".
plot_embeddings_pca( embeddings, labels, center = TRUE, scale. = FALSE, point_size = 2, alpha = 0.5 )plot_embeddings_pca( embeddings, labels, center = TRUE, scale. = FALSE, point_size = 2, alpha = 0.5 )
embeddings |
Path to a Parquet file or dataset directory containing
columns |
labels |
Label mapping for ids. Supported formats:
|
center, scale.
|
Passed to |
point_size, alpha
|
Point size and transparency for points in the plot.
Defaults |
A ggplot object with points mapped to PC1 vs PC2 and colored by
group.
## Not run: p <- plot_embeddings_pca( embeddings = "inst/examples/embedings/", labels = data.frame( id = c("W1", "W2", "W10"), label = c("reference", "reference", "corpus") ) ) print(p) ## End(Not run)## Not run: p <- plot_embeddings_pca( embeddings = "inst/examples/embedings/", labels = data.frame( id = c("W1", "W2", "W10"), label = c("reference", "reference", "corpus") ) ) print(p) ## End(Not run)
Computes a 2D UMAP projection of V1..Vd and returns a scatter plot colored
by labels membership. Uses cosine distance by default to align
with common embedding similarity.
plot_embeddings_umap( embeddings, labels, n_neighbors = 15, min_dist = 0.1, metric = "cosine", n_epochs = 500, seed = 42, sample_n = NULL, point_size = 2, alpha = 0.5 )plot_embeddings_umap( embeddings, labels, n_neighbors = 15, min_dist = 0.1, metric = "cosine", n_epochs = 500, seed = 42, sample_n = NULL, point_size = 2, alpha = 0.5 )
embeddings |
Path to a Parquet file or dataset directory containing
columns |
labels |
Label mapping for ids. Supported formats:
|
n_neighbors, min_dist, metric, n_epochs
|
UMAP parameters passed to
|
seed |
Random seed for reproducibility (set to |
sample_n |
Optional maximum number of rows to sample for plotting
(applied before UMAP). If |
point_size, alpha
|
Point size and transparency for points in the plot.
Defaults |
A ggplot object of UMAP1 vs UMAP2 colored by group.
Uses the same demo structure as run_demo_openalex(), but configures
an OpenAI backend and requires an explicit API key argument. The key is set
in OVC_API_TOKEN for the duration of the call.
run_demo_openai( api_key, demo_dir = file.path(getwd(), "demos", "openai"), render = TRUE, model = "text-embedding-3-small", max_corpus = 100, max_reference = 10, overwrite = FALSE, quarto_file = "openai_demo_analysis.qmd", verbose = TRUE )run_demo_openai( api_key, demo_dir = file.path(getwd(), "demos", "openai"), render = TRUE, model = "text-embedding-3-small", max_corpus = 100, max_reference = 10, overwrite = FALSE, quarto_file = "openai_demo_analysis.qmd", verbose = TRUE )
api_key |
OpenAI API key. Must be a non-empty string. |
demo_dir |
Demo workspace directory. Defaults to
|
render |
Logical; if |
model |
OpenAI embedding model id. Defaults to
|
max_corpus |
Maximum number of corpus fixture rows to copy. |
max_reference |
Maximum number of reference fixture rows to copy. |
overwrite |
Logical; if |
quarto_file |
Name of the analysis file created in |
verbose |
Logical; print progress messages. |
Invisibly returns a list with project paths and render status.
Sets up a demo workspace under demo_dir, creates a pipeline project under
demo_dir/project, copies small corpus and
reference fixtures and Quarto template from inst/ovc_demo, and optionally
renders the analysis.
run_demo_openalex( demo_dir = file.path(getwd(), "demos", "openalex"), render = TRUE, backend = NULL, max_corpus = 100, max_reference = 10, overwrite = FALSE, quarto_file = "openalex_demo_analysis.qmd", verbose = TRUE )run_demo_openalex( demo_dir = file.path(getwd(), "demos", "openalex"), render = TRUE, backend = NULL, max_corpus = 100, max_reference = 10, overwrite = FALSE, quarto_file = "openalex_demo_analysis.qmd", verbose = TRUE )
demo_dir |
Demo workspace directory. Defaults to
|
render |
Logical; if |
backend |
Optional backend config from |
max_corpus |
Maximum number of corpus fixture rows to copy. |
max_reference |
Maximum number of reference fixture rows to copy. |
overwrite |
Logical; if |
quarto_file |
Name of the analysis file created in |
verbose |
Logical; print progress messages. |
The workspace keeps all generated directories and output artifacts. The
Quarto file is created in demo_dir, while embedding pipeline data is stored
in demo_dir/project.
Invisibly returns a list with project paths and render status.
Reads a wide reference-cosine distance matrix (as written by
distance_reference_cosine()) and converts all numeric distance columns to
scores.
score_reference_cosine( distance_parquet, output_dir = NULL, method = c("linear", "exponential"), alpha = 1, verbose = TRUE )score_reference_cosine( distance_parquet, output_dir = NULL, method = c("linear", "exponential"), alpha = 1, verbose = TRUE )
distance_parquet |
Path to a Parquet dataset (file or directory) with
first column |
output_dir |
Optional output directory. If |
method |
Scoring transform: |
alpha |
Positive numeric scaling factor used when
|
verbose |
Logical; print progress messages. |
Invisibly returns output directory.
Reads a distance dataset with columns id and area_distance, computes
relevance_score = exp(-alpha * area_distance), and writes a scored Parquet
dataset.
score_ridge(distance_parquet, output_dir = NULL, alpha = 0.5, verbose = TRUE)score_ridge(distance_parquet, output_dir = NULL, alpha = 0.5, verbose = TRUE)
distance_parquet |
Path to a Parquet dataset (file or directory) with
at least columns |
output_dir |
Optional output directory. If |
alpha |
Positive numeric scaling factor in
|
verbose |
Logical; print progress messages. |
Invisibly returns output directory.
Computes cosine similarity for two numeric vectors of equal length. Returns
NA_real_ if either vector has zero norm.
similarity_cosine(a, b)similarity_cosine(a, b)
a |
Numeric vector. |
b |
Numeric vector or numeric matrix with embeddings in rows. |
A single numeric similarity value in [-1, 1], or NA_real_ when
undefined.