pro_request_parquet() refactored into small internal helpers
(.prr_prepare_output, .prr_discover_jsons, .prr_infer_schema,
.prr_apply_baseline, .prr_fix_json_types, .prr_output_paths,
.prr_convert_one) to drop cyclomatic complexity below the
goodpractice threshold. Behaviour is unchanged.
Added tests for find_oas_binary(), run_oas(),
pro_validate_credentials(), prepare_snapshot(), and
sample_parquet_n(). Package test coverage rose from ~73% to ~80%.
Tests that intentionally exercise the deprecated
pro_request_jsonl_R() / pro_request_jsonl_parquet() pipeline now
wrap those calls in suppressWarnings() to keep test output clean.
Added a .lintr configuration setting line_length_linter(100)
(modern tidyverse default) and reflowed remaining long lines in
package code.
pro_request_parquet() gains a schema parameter (default "auto") that
uses pre-built field-type schemas — inferred from the complete OpenAlex
snapshot — to resolve ambiguous DuckDB type inference on small API pages.
This eliminates VARCHAR[] → JSON type-conflict errors when unioning parquet
files from separate API calls (e.g. keypaper + cited + citing in a snowball
search).
New oa_cache_schema() function copies schemas from a snapshot metadata
directory (e.g. /Volumes/openalex/openalex-snapshot_metadata) into the
user-level cache so the correct types are available even when the volume is
not mounted.
Factory-default schemas for all 21 OpenAlex entity types are now bundled in
inst/extdata/schemata/ and used automatically when schema = "auto", so
the feature works out-of-the-box without any manual cache population.
oa_works_abstract_sql() now casts abstract_inverted_index through JSON
(::JSON::MAP(VARCHAR, BIGINT[])) before calling map_entries(). This
fixes abstract reconstruction when DuckDB infers the column as STRUCT
(which happens when a sampled API page contains no duplicate-cased keys)
rather than MAP. The expression now handles STRUCT, MAP, and VARCHAR
inputs uniformly (#XXX).
pro_request_parquet() now overrides the inferred DuckDB type of
abstract_inverted_index to MAP(VARCHAR, BIGINT[]) when building the
paginated read_json schema, so the fix applies even before the abstract SQL
runs.
snapshot_to_parquet(), build_corpus_index(), and lookup_by_id() have
moved to the openalexSnapshot package. Calling them in openalexPro now
raises an informative error. Their _R variants have been removed entirely.
pro_request_parquet_R() and pro_fetch_R() removed. pro_request_parquet()
and pro_fetch() are now the single pure-R/DuckDB implementations.
Rust/Cargo build dependency removed; openalexPro is now a pure-R package.
No Rust toolchain is required to install or use it.
oa_works_abstract_sql(), oa_works_citation_sql(), and
oa_normalize_duckdb_type() are now implemented in R (behaviour unchanged).
Rust backend via extendr. Core functions now delegate to a compiled
Rust library (openalex-core v0.5.0) for JSON→Parquet conversion, schema
inference, corpus indexing, and ID-based record lookup. Pure-R _R variants
remain as fallbacks. This eliminates the external openalex-snapshot binary
dependency for the main pipeline.
pro_rate_limit_status() — query your OpenAlex API rate-limit status
(daily budget, used, remaining, prepaid balance, reset time, per-endpoint
costs). Returns a list invisibly; prints a formatted summary when
verbose = TRUE.
New debug option openalexPro.ratelimit_check: when set to TRUE via
options(openalexPro.ratelimit_check = TRUE), every API call prints the
current rate-limit status (budget, usage, remaining, reset time) as a message
before the request is sent. Internally handled in api_call() using
pro_rate_limit_status(verbose = TRUE). A recursion guard temporarily
disables the option during the nested rate-limit request.
pro_request() list method now respects the overwrite parameter. Previously,
when query_url was a list, the top-level output directory was neither
checked nor deleted regardless of overwrite. It now errors if the directory
exists and overwrite = FALSE, and deletes it upfront if overwrite = TRUE.
pro_fetch() now deletes all three subdirectories (json, jsonl, parquet)
upfront before the pipeline starts when overwrite = TRUE, rather than
delegating deletion to each sub-function individually. If any of the
subdirectories exist and overwrite = FALSE, the function now errors
immediately with a clear message listing which directories already exist.
pro_request(), pro_request_jsonl(), and pro_request_jsonl_parquet()
now accept nested lists of query URLs. Each nesting level is preserved
as a subdirectory in the output, and the parquet stage converts directory
depth into hive-style partition keys: depth 1 → query=<name>, depth 2 →
query_l2=<name>, depth 3 → query_l3=<name>, etc. The resulting dataset
is readable with arrow::open_dataset() and the partition columns appear
as regular columns. pro_fetch() inherits this behaviour automatically.
A new internal helper collect_leaf_queries() performs the recursive list
flattening.snapshot_to_parquet() has a new signature. The old snapshot_dir and
parquet_dir parameters are replaced by a single root_dir parameter that
matches the directory layout used by the companion openalex-snapshot Rust
binary. The function now delegates to the binary rather than performing
conversion in R. Migration: replace
snapshot_to_parquet(snapshot_dir = "...", parquet_dir = "...") with
snapshot_to_parquet(root_dir = "...").
build_corpus_index() has a new signature. The old corpus_dir parameter
is replaced by root_dir. The function now delegates to the
openalex-snapshot binary. Migration: replace
build_corpus_index(corpus_dir = "...") with
build_corpus_index(root_dir = "...").
lookup_by_id() has a new signature. The old index_file and output
parameters are replaced by root_dir and project_dir (consistent with the
project-folder convention used by pro_request() and pro_fetch()). The
function now delegates to the openalex-snapshot binary. Migration:
replace lookup_by_id(index_file = "...", output = "...") with
lookup_by_id(root_dir = "...", project_dir = "...").
Pure-R / DuckDB fallback variants are now exported as separate functions:
snapshot_to_parquet_R() — original R implementation of snapshot
conversion (uses DuckDB + arrow, no external binary required)build_corpus_index_R() — original R implementation of index buildinglookup_by_id_R() — original R implementation of ID-based record lookupThese retain the original parameter names and are useful when the
openalex-snapshot binary is unavailable.
find_oas_binary() and run_oas() are exported internal helpers for
resolving and invoking the openalex-snapshot binary. They support:
oas_bin argumentoptions(openalexPro.oas_bin = "/path/to/binary")Sys.which("openalex-snapshot")inst/Makefile.snapshot updated to use the openalex-snapshot binary
directly (replacing Rscript invocations of the now-renamed R functions).
openalex-snapshot Rust binary is now required for
snapshot_to_parquet(), build_corpus_index(), and lookup_by_id().
Download from https://github.com/rkrug/openalex-snapshot/releases or build
with cargo build --release. The pure-R *_R() variants have no binary
dependency.id field to the opt_select_names() as it is missing from the returned
list from OpenAlexNormalized api_key handling across API-calling functions:
pro_request(), pro_fetch(), pro_count(), and
pro_download_content() now accept api_key = NULL or api_key = "".
In that case, requests are sent without an API key (subject to OpenAlex's
unauthenticated limits).
Added explicit api_key type validation in API-calling functions.
Accepted inputs are now limited to NULL or a length-1 character string.
Updated pro_rate_limit_status() to handle api_key = NULL safely
(informational message + FALSE return), and aligned documentation.
Added opt-in live API contract tests (tests/testthat/test-900-live_api_contracts.R)
gated by OPENALEXPRO_LIVE_TESTS=true and a non-dummy
openalexPro.apikey.
Added inst/scripts/record_cassettes.R and recording safeguards to prevent
accidental re-recording with invalid credentials.
Reduced warning noise in test runs by cleaning up deprecated-search warning handling and removing unused cassette hooks.
Added pro_rate_limit_status() to query the OpenAlex rate-limit endpoint
(GET /rate-limit). Returns the full rate-limit JSON invisibly (daily budget,
used, remaining, prepaid balance, per-endpoint costs, reset time). Prints a
human-readable summary via message() when verbose = TRUE (the default).
Returns FALSE for a missing or invalid API key, and NULL on a network
error, so callers can distinguish auth problems from transient failures.
pro_validate_credentials() refactored to use pro_rate_limit_status()
internally instead of making a separate pro_count() request. Behaviour and
return value are unchanged.
Added pro_download_content() to download full-text PDFs
(format = "pdf") or TEI XML (format = "grobid-xml") from the OpenAlex
content endpoint (content.openalex.org). Accepts a vector of work IDs,
supports parallel downloads via workers, and returns a data frame with
per-file status ("ok" / "not_found" / "error"). Note: content
downloads cost $0.01 per file.
Added search.exact and search.semantic parameters to pro_query(),
matching the new OpenAlex search API:
search.exact: searches without stemming or stop-word removal; supports
boolean operators, quoted phrases, proximity (~N), and wildcards.search.semantic: AI embedding-based search that matches by conceptual
meaning rather than keywords (max 50 results, max 1 req/sec).search: now documented to support the full boolean/phrase/wildcard
syntax in addition to its existing stemmed matching.Exported infer_json_schema() for direct use. Infers a unified DuckDB columns
clause from a set of JSON/NDJSON files via per-file DESCRIBE queries with
type-widening and optional two-level disk caching (schema_cache_dir).
pro_rate_limit_status() and pro_download_content() now route their HTTP
requests through the internal api_call() helper, unifying retry logic and
error handling across all real API call sites. suppressMessages() is used
to suppress api_call()'s internal logging so each function emits its own
user-facing messages. pro_download_content() now also sends a User-Agent
header (previously omitted)..search suffix (e.g. title_and_abstract.search = "...")
are deprecated by the OpenAlex API. They still work but now emit a warning.
Use the search parameter of pro_query() instead:
pro_query(entity = "works", search = "your terms").
See https://developers.openalex.org/guides/searching for details.Fixed Windows path-normalization failures in snapshot_to_parquet(),
build_corpus_index(), lookup_by_id(), and pro_request_jsonl_parquet().
On Windows, normalizePath() can return 8.3 short names (e.g. RUNNER~1)
for tempdir()-derived paths while list.files() and DuckDB resolve to long
names (runneradmin). Resume detection in snapshot_to_parquet() used
%in% on paths with mixed separators (\ vs /), causing already-converted
files to be reconverted. build_corpus_index() embedded snapshot_dir (with
\) inside a DuckDB regexp_replace pattern, which never matched — so the
full absolute path was stored in the index and later doubled by
lookup_by_id(). pro_request_jsonl_parquet() used normalizePath string
comparison to detect subdirectories, which always failed, placing every output
file in a spurious query=<dirname> subdirectory.
Fixes: (1) normalize separators to / with gsub("\\\\", "/", ...) on both
sides of %in% comparisons; (2) compute relative paths in R using path-depth
counting (strsplit(path, "/") then indexed extraction) rather than
string-matching absolute paths — immune to 8.3 vs long-name differences;
(3) pass the relative path as a SQL literal in build_corpus_index() instead
of computing it inside DuckDB with a regex.
%06d_<basename>.schema.csv to
<update_date>_<part_name>.csv (e.g. 2024-01-15_part_001.csv), making
each cache file directly traceable to its source .gz.mailto parameter from all API functions (pro_request(), pro_fetch(),
pro_count(), pro_validate_credentials()). OpenAlex no longer uses email addresses
for polite-pool access.api_key handling was tightened in 0.6.0 for
pro_request(), pro_fetch(), and pro_count().api_key = NULL / "" and runs in unauthenticated mode.openalexPro v[VERSION] (mailto:[EMAIL]) to
openalexPro/[VERSION].prepare_snapshot() function for setting up a directory with Makefile and documentation
for managing OpenAlex snapshots.Makefile.snapshot in inst/ for automating snapshot download, conversion, and indexing.
Includes targets for snapshot, parquet, parquet_index, and automatic renaming of existing
data with release dates.snapshot_to_parquet() function for converting OpenAlex snapshot NDJSON files to Parquet format
using DuckDB. Processes each .gz file individually with per-file resume support. Supports parallel
processing via workers (using future_lapply()) and unified schema inference via sample_size.build_corpus_index() function for creating memory-efficient Parquet indexes for fast ID lookups.
Handles 300M+ records by processing parquet files individually, with optional parallelization via
workers and progress reporting via progressr. The index file is auto-named and placed alongside
the corpus directory.lookup_by_id() function for fast record retrieval from a parquet corpus using pre-built indexes.
Uses Arrow for index filtering with automatic ID normalization. Supports parallel reads
via workers and streaming to parquet via output for millions of IDs without loading into memory.snapshot_filter_ids() function for filtering snapshot data by ID lists.id_block() helper function for computing ID block partitions.snapshot.qmd vignette with comprehensive guide on downloading, converting, and querying
OpenAlex snapshots locally.snapshot_to_parquet() to process each .gz file individually instead of all at once.
This reduces memory usage, enables per-file resume on interruption, and shows progress with ETA.
The workers parameter now controls parallel future workers instead of DuckDB threads.
Added sample_size parameter for schema inference.infer_json_schema() and convert_json_to_parquet() internal helpers, shared by
both snapshot_to_parquet() and pro_request_jsonl_parquet().pro_request_jsonl_parquet() to per-file conversion with future_lapply()
parallelization. Removes hive partitioning by page; subfolder structure is preserved
directly. Added workers parameter. Removed progress parameter (replaced by progressr).pro_query.qmd (malformed code block closings).snapshot_to_parquet() when sample_size exceeded the
number of available files (e.g. sample_size = 10000 with 1981 works files). Schema
inference now processes one file at a time instead of a single bulk DuckDB query.duplicate key "as" crash when converting the works dataset.
abstract_inverted_index is now stored as VARCHAR (raw JSON string) rather than a
STRUCT. DuckDB folds struct field names to lowercase, causing a collision between the
valid JSON keys "as" and "As" in this field. Storing as VARCHAR avoids struct
parsing entirely and preserves the data. Parse individual values with
jsonlite::fromJSON() when needed.snapshot_to_parquet() by exposing a
TEMP_DIR variable in Makefile.snapshot (default /tmp).snapshot_to_parquet() schema inference now runs one DuckDB DESCRIBE per file
instead of a single query across all sampled files. Results are cached in
<parquet_ds>/.schema_cache/: per-file CSVs (<update_date>_<part_name>.csv) enable
mid-run resume; a unified unified_schema.csv is loaded on subsequent runs to skip
inference entirely. Delete unified_schema.csv to force re-inference.snapshot_to_parquet(), build_corpus_index(), and lookup_by_id().abstract_inverted_index VARCHAR round-trip.load_sql_file() function as not needed anymorepro_query and removed multiple_ids argument using Claude and expanded tests and added vignette.00_completed in output directory of json, jsonl and parquet folders upon successful completionopenalexPro.email and openalexPro.apikeypro_request_jsonl_parquet() to prevent schema conflicts when reading
combined Parquet datasets. New sample_size parameter controls schema inference sampling. This fixes
"Unsupported cast from string to struct" errors when fields have different types across JSONL files
(e.g., apc_paid being null in some files and a struct in others).harmonize_parquet_schemata() as it is no longer needed with the new unified schema inference.request_json() from 1000 to 10000 to allow the initially planned 2,000,000
work download.CI and coverage tweaks for CRAN readiness.
splitting snowball functionality into openalexSnowball
pro_fetch() with project_folder support for structured outputs.pro_request_jsonl().sample_parquet_n() random sampling utilities with select support.count_only output to return a data frame with an error column.count_only support for pro_request() and related helpers.pro_query() as the package-native query builder with chunking.extract_doi() helpers and compatibility reporting artifacts.