---
title: "package-overview"
vignette: >
  %\VignetteIndexEntry{package-overview}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
format:
  html:
    mermaid:
      theme: forest
knitr:
  opts_chunk:
    collapse: true
    comment: "#>"
    eval: false
---

# Purpose

This vignette provides a full package-level overview of `openalexVectorComp`:

- design philosophy,
- core function groups,
- end-to-end workflow,
- storage layout,
- practical usage patterns,
- extension points.

It is intended as the orientation document before using the technical vignettes.

# Package Philosophy

The package follows five core principles:

1. **Pipeline-first orchestration**
   - Handle large corpora in batches with Arrow/Parquet.

2. **Backend-neutral embedding interface**
   - Use one config/dispatch API for HF, OpenAI, or TEI.

3. **Deterministic resume behavior**
   - Use `id + text_hash` to skip unchanged rows safely.

4. **Pluggable text preparation**
   - Let users inject custom cleaners while preserving contracts.

5. **Transparent scoring workflow**
   - Keep similarity, prototype distance, ridge scoring, and calibration
     separable and inspectable.

```{mermaid}
flowchart LR
  A[Large corpus] --> B[Deterministic preprocessing]
  B --> C[Backend-neutral embedding]
  C --> D[Reproducible storage]
  D --> E[Distance and classification]
  E --> F[Threshold calibration]
  F --> G[Operational decisions]
```

# Function Groups

The package API is organized into these groups.

## 1) Embedding backend abstraction

- `backend_config()`
- `backend_info()`
- `backend_embed_texts()`
- `backend_read()`
- `backend_save()`
- `embed_texts()`

These functions isolate provider-specific details and expose a stable interface.

## 2) Corpus embedding orchestration

- `embed_corpus()`

This is the main batch pipeline driver for production embedding runs.

## 3) Text preparation

- `clean_abstract_for_embedding()`

Default implementation for title/abstract cleaning and canonical text creation.

## 4) Similarity and distance

- `similarity_cosine()`
- `distance_cosine()`
- `distance_reference_cosine()`
- `score_reference_cosine()`

These functions quantify embedding-space relevance from geometric perspectives.

Internal helper note:

- `distances()` exists as a non-exported helper for joining distance datasets.

## 5) Supervised scoring and calibration

- `fit_ridge()`
- `distance_ridge()`
- `score_ridge()`
- `calibrate_threshold()`

These functions produce calibrated decision-ready scores.

## 6) Embedding-space visualization

- `plot_embeddings_pca()`
- `plot_embeddings_umap()`

These functions support diagnostics and qualitative checks.

```{mermaid}
flowchart TB
  A[Backend Abstraction] --> B[embed_corpus]
  C[Text Preparation] --> B
  B --> D[Stored embeddings]
  D --> E[Similarity and distance]
  D --> F[Ridge scoring]
  E --> G[Threshold calibration]
  F --> G
  D --> H[PCA/UMAP diagnostics]
```

# Canonical Workflow

Typical workflow for one project:

1. Configure backend (`backend_config()`).
2. Embed corpus (`embed_corpus()`).
3. Compute distance signal (`distance_reference_cosine()` and/or `distance_ridge()`), then score (e.g. `score_reference_cosine()`, `score_ridge()`).
4. Calibrate operating threshold (`calibrate_threshold()`).
5. Validate with plots (`plot_embeddings_pca()` / `plot_embeddings_umap()`).

```{mermaid}
sequenceDiagram
  participant U as User
  participant BC as backend_config
  participant EC as embed_corpus
  participant ES as embedding_store
  participant DP as distance_reference_cosine
  participant SCP as score_reference_cosine
  participant DR as distance_ridge
  participant SR as score_ridge
  participant CT as calibrate_threshold

  U->>BC: backend_config(...)
  U->>EC: embed_corpus(project_dir, backend)
  EC->>ES: write embeddings parquet
  U->>DP: distance_reference_cosine(...)
  U->>SCP: score_reference_cosine(...)
  U->>DR: distance_ridge(...)
  U->>SR: score_ridge(...)
  U->>CT: calibrate_threshold(...)
```

# Data and Storage Model

## Corpus input

`embed_corpus()` expects:

- `project_dir/corpus` as Arrow dataset
- columns: `id`, `title`, `abstract`

## Embeddings output

Embeddings are written under:

- `project_dir/embeddings/model_id=<model>/label=<label>/batch=*/embeddings-*.parquet`
- metadata in:
  - `project_dir/embeddings/model_id=<model>/embed_model.yaml`

## Dry-run output

When `dry_run = TRUE`, no embeddings are written. Instead:

- `project_dir/<corpus_name>_dryrun.parquet`

This file supports auditing preprocessing behavior (including custom cleaners).

```{mermaid}
flowchart TB
  A[project_dir/corpus] --> B[embed_corpus]
  B --> C[model_id=.../embed_model.yaml]
  B --> D[model_id=.../label=.../batch=*/embeddings-*.parquet]
  B --> E[<corpus_name>_dryrun.parquet]
```

# Text-Preparation Contract

`embed_corpus()` accepts:

- `text_preprocessor` (function)
- `cleaner_args` (named list)

The preprocessor must return a data frame with:

- `id`
- `text`
- `text_hash`

Optional columns are preserved and can be persisted (e.g., quality flags).

## Why this contract matters

- ensures skip/resume determinism,
- enables provider-independent cleaning strategies,
- allows quality provenance without altering backend adapters.

# Decision Model: Geometric + Supervised

The package supports two complementary relevance styles:

1. **Prototype distance**
   - pairwise cosine distances between reference and corpus label partitions.

2. **Reference-area distance + score (`distance_ridge` + `score_ridge`)**
   - Mahalanobis-style distance to the reference label area (`area_distance`)
   - optional conversion to `relevance_score`.

Use one or both, then calibrate threshold for target operating behavior.

```{mermaid}
flowchart LR
  A[Embeddings] --> B[Prototype distance]
  A --> C[Reference-area score]
  B --> D[Ranked candidates]
  C --> D
  D --> E[calibrate_threshold]
  E --> F[Precision/recall operating point]
```

# Minimal End-to-End Example

```r
library(openalexVectorComp)

backend <- backend_config(
  provider = "hf",
  model = "BAAI/bge-small-en-v1.5",
  max_batch_size = 64
)

model_dir <- embed_corpus(
  project_dir = "my_project",
  backend = backend,
  batch_size = 5000,
  delete_existing = FALSE,
  label = "corpus",
  save_text = TRUE
)

embed_corpus(
  project_dir = "my_project",
  backend = backend,
  batch_size = 5000,
  delete_existing = FALSE,
  label = "reference",
  save_text = TRUE
)

pairwise_dir <- distance_reference_cosine(
  project_dir = "my_project",
  embeddings_dir = basename(model_dir),
  corpus_label = "corpus",
  reference_label = "reference"
)
pairwise_score_dir <- score_reference_cosine(
  distance_parquet = pairwise_dir,
  method = "linear"
)

# Optional supervised scoring path
scores_dir <- distance_ridge(
  project_dir = "my_project",
  reference_label = "reference",
  corpus_label = "corpus"
)
scores_dir <- score_ridge(scores_dir)

best <- calibrate_threshold(
  scores_parquet = scores_dir,
  score_col = "relevance_score",
  labels_parquet = "labels.parquet"
)
```

# Operational Guidance

- Start with HF defaults for initial setup.
- Keep `save_text = TRUE` in alpha/review phases for auditability.
- Use `dry_run = TRUE` to validate custom cleaners before API spend.
- Prefer `mode = "balanced"` in `clean_abstract_for_embedding()` unless you
  have measured reasons to go stricter.
- Re-calibrate thresholds whenever model, cleaner policy, or label set changes.

# Related Vignettes

- `backend-architecture`: provider/dispatch implementation details.
- `abstract-cleaning`: cleaning rules and examples in depth.
- `tei-server-operations`: local TEI operational handling.
- `simplestart`: quick start usage path.

# Summary

`openalexVectorComp` is a composable embedding-and-scoring pipeline package:

- backend-neutral for embeddings,
- deterministic for resume and reproducibility,
- pluggable for text cleaning,
- explicit for distance and calibration decisions.

Use this overview as the map, then dive into the specialized vignettes for
implementation-level details.