Changes in version 0.10.4 Internal / Code Quality - pro_request_parquet() refactored into small internal helpers (.prr_prepare_output, .prr_discover_jsons, .prr_infer_schema, .prr_apply_baseline, .prr_fix_json_types, .prr_output_paths, .prr_convert_one) to drop cyclomatic complexity below the goodpractice threshold. Behaviour is unchanged. - Added tests for find_oas_binary(), run_oas(), pro_validate_credentials(), prepare_snapshot(), and sample_parquet_n(). Package test coverage rose from ~73% to ~80%. - Tests that intentionally exercise the deprecated pro_request_jsonl_R() / pro_request_jsonl_parquet() pipeline now wrap those calls in suppressWarnings() to keep test output clean. - Added a .lintr configuration setting line_length_linter(100) (modern tidyverse default) and reflowed remaining long lines in package code. New Features - pro_request_parquet() gains a schema parameter (default "auto") that uses pre-built field-type schemas — inferred from the complete OpenAlex snapshot — to resolve ambiguous DuckDB type inference on small API pages. This eliminates VARCHAR[] → JSON type-conflict errors when unioning parquet files from separate API calls (e.g. keypaper + cited + citing in a snowball search). - New oa_cache_schema() function copies schemas from a snapshot metadata directory (e.g. /Volumes/openalex/openalex-snapshot_metadata) into the user-level cache so the correct types are available even when the volume is not mounted. - Factory-default schemas for all 21 OpenAlex entity types are now bundled in inst/extdata/schemata/ and used automatically when schema = "auto", so the feature works out-of-the-box without any manual cache population. Bug Fixes - oa_works_abstract_sql() now casts abstract_inverted_index through JSON (::JSON::MAP(VARCHAR, BIGINT[])) before calling map_entries(). This fixes abstract reconstruction when DuckDB infers the column as STRUCT (which happens when a sampled API page contains no duplicate-cased keys) rather than MAP. The expression now handles STRUCT, MAP, and VARCHAR inputs uniformly (#XXX). - pro_request_parquet() now overrides the inferred DuckDB type of abstract_inverted_index to MAP(VARCHAR, BIGINT[]) when building the paginated read_json schema, so the fix applies even before the abstract SQL runs. Breaking Changes - snapshot_to_parquet(), build_corpus_index(), and lookup_by_id() have moved to the openalexSnapshot package. Calling them in openalexPro now raises an informative error. Their _R variants have been removed entirely. - pro_request_parquet_R() and pro_fetch_R() removed. pro_request_parquet() and pro_fetch() are now the single pure-R/DuckDB implementations. Internal Changes - Rust/Cargo build dependency removed; openalexPro is now a pure-R package. No Rust toolchain is required to install or use it. - oa_works_abstract_sql(), oa_works_citation_sql(), and oa_normalize_duckdb_type() are now implemented in R (behaviour unchanged). Changes in version 0.9.0 New Features - Rust backend via extendr. Core functions now delegate to a compiled Rust library (openalex-core v0.5.0) for JSON→Parquet conversion, schema inference, corpus indexing, and ID-based record lookup. Pure-R _R variants remain as fallbacks. This eliminates the external openalex-snapshot binary dependency for the main pipeline. - pro_rate_limit_status() — query your OpenAlex API rate-limit status (daily budget, used, remaining, prepaid balance, reset time, per-endpoint costs). Returns a list invisibly; prints a formatted summary when verbose = TRUE. - New debug option openalexPro.ratelimit_check: when set to TRUE via options(openalexPro.ratelimit_check = TRUE), every API call prints the current rate-limit status (budget, usage, remaining, reset time) as a message before the request is sent. Internally handled in api_call() using pro_rate_limit_status(verbose = TRUE). A recursion guard temporarily disables the option during the nested rate-limit request. Changes in version 0.8.1 Bug Fixes - pro_request() list method now respects the overwrite parameter. Previously, when query_url was a list, the top-level output directory was neither checked nor deleted regardless of overwrite. It now errors if the directory exists and overwrite = FALSE, and deletes it upfront if overwrite = TRUE. - pro_fetch() now deletes all three subdirectories (json, jsonl, parquet) upfront before the pipeline starts when overwrite = TRUE, rather than delegating deletion to each sub-function individually. If any of the subdirectories exist and overwrite = FALSE, the function now errors immediately with a clear message listing which directories already exist. Changes in version 0.8.0 New Features - pro_request(), pro_request_jsonl(), and pro_request_jsonl_parquet() now accept nested lists of query URLs. Each nesting level is preserved as a subdirectory in the output, and the parquet stage converts directory depth into hive-style partition keys: depth 1 → query=, depth 2 → query_l2=, depth 3 → query_l3=, etc. The resulting dataset is readable with arrow::open_dataset() and the partition columns appear as regular columns. pro_fetch() inherits this behaviour automatically. A new internal helper collect_leaf_queries() performs the recursive list flattening. Changes in version 0.7.0 Breaking Changes - snapshot_to_parquet() has a new signature. The old snapshot_dir and parquet_dir parameters are replaced by a single root_dir parameter that matches the directory layout used by the companion openalex-snapshot Rust binary. The function now delegates to the binary rather than performing conversion in R. Migration: replace snapshot_to_parquet(snapshot_dir = "...", parquet_dir = "...") with snapshot_to_parquet(root_dir = "..."). - build_corpus_index() has a new signature. The old corpus_dir parameter is replaced by root_dir. The function now delegates to the openalex-snapshot binary. Migration: replace build_corpus_index(corpus_dir = "...") with build_corpus_index(root_dir = "..."). - lookup_by_id() has a new signature. The old index_file and output parameters are replaced by root_dir and project_dir (consistent with the project-folder convention used by pro_request() and pro_fetch()). The function now delegates to the openalex-snapshot binary. Migration: replace lookup_by_id(index_file = "...", output = "...") with lookup_by_id(root_dir = "...", project_dir = "..."). New Features - Pure-R / DuckDB fallback variants are now exported as separate functions: - snapshot_to_parquet_R() — original R implementation of snapshot conversion (uses DuckDB + arrow, no external binary required) - build_corpus_index_R() — original R implementation of index building - lookup_by_id_R() — original R implementation of ID-based record lookup These retain the original parameter names and are useful when the openalex-snapshot binary is unavailable. - find_oas_binary() and run_oas() are exported internal helpers for resolving and invoking the openalex-snapshot binary. They support: 1. Explicit oas_bin argument 2. options(openalexPro.oas_bin = "/path/to/binary") 3. PATH search via Sys.which("openalex-snapshot") - inst/Makefile.snapshot updated to use the openalex-snapshot binary directly (replacing Rscript invocations of the now-renamed R functions). Dependencies - The openalex-snapshot Rust binary is now required for snapshot_to_parquet(), build_corpus_index(), and lookup_by_id(). Download from https://github.com/rkrug/openalex-snapshot/releases or build with cargo build --release. The pure-R *_R() variants have no binary dependency. Changes in version 0.6.1 Bug Fixes - Manual add the id field to the opt_select_names() as it is missing from the returned list from OpenAlex Changes - Normalized api_key handling across API-calling functions: pro_request(), pro_fetch(), pro_count(), and pro_download_content() now accept api_key = NULL or api_key = "". In that case, requests are sent without an API key (subject to OpenAlex's unauthenticated limits). - Added explicit api_key type validation in API-calling functions. Accepted inputs are now limited to NULL or a length-1 character string. - Updated pro_rate_limit_status() to handle api_key = NULL safely (informational message + FALSE return), and aligned documentation. Testing and Tooling - Added opt-in live API contract tests (tests/testthat/test-900-live_api_contracts.R) gated by OPENALEXPRO_LIVE_TESTS=true and a non-dummy openalexPro.apikey. - Added inst/scripts/record_cassettes.R and recording safeguards to prevent accidental re-recording with invalid credentials. - Reduced warning noise in test runs by cleaning up deprecated-search warning handling and removing unused cassette hooks. Changes in version 0.6.0 New Features - Added pro_rate_limit_status() to query the OpenAlex rate-limit endpoint (GET /rate-limit). Returns the full rate-limit JSON invisibly (daily budget, used, remaining, prepaid balance, per-endpoint costs, reset time). Prints a human-readable summary via message() when verbose = TRUE (the default). Returns FALSE for a missing or invalid API key, and NULL on a network error, so callers can distinguish auth problems from transient failures. - pro_validate_credentials() refactored to use pro_rate_limit_status() internally instead of making a separate pro_count() request. Behaviour and return value are unchanged. - Added pro_download_content() to download full-text PDFs (format = "pdf") or TEI XML (format = "grobid-xml") from the OpenAlex content endpoint (content.openalex.org). Accepts a vector of work IDs, supports parallel downloads via workers, and returns a data frame with per-file status ("ok" / "not_found" / "error"). Note: content downloads cost $0.01 per file. - Added search.exact and search.semantic parameters to pro_query(), matching the new OpenAlex search API: - search.exact: searches without stemming or stop-word removal; supports boolean operators, quoted phrases, proximity (~N), and wildcards. - search.semantic: AI embedding-based search that matches by conceptual meaning rather than keywords (max 50 results, max 1 req/sec). - search: now documented to support the full boolean/phrase/wildcard syntax in addition to its existing stemmed matching. - Exported infer_json_schema() for direct use. Infers a unified DuckDB columns clause from a set of JSON/NDJSON files via per-file DESCRIBE queries with type-widening and optional two-level disk caching (schema_cache_dir). Internal Changes - pro_rate_limit_status() and pro_download_content() now route their HTTP requests through the internal api_call() helper, unifying retry logic and error handling across all real API call sites. suppressMessages() is used to suppress api_call()'s internal logging so each function emits its own user-facing messages. pro_download_content() now also sends a User-Agent header (previously omitted). Deprecations - Filter arguments with a .search suffix (e.g. title_and_abstract.search = "...") are deprecated by the OpenAlex API. They still work but now emit a warning. Use the search parameter of pro_query() instead: pro_query(entity = "works", search = "your terms"). See https://developers.openalex.org/guides/searching for details. Bug Fixes - Fixed Windows path-normalization failures in snapshot_to_parquet(), build_corpus_index(), lookup_by_id(), and pro_request_jsonl_parquet(). On Windows, normalizePath() can return 8.3 short names (e.g. RUNNER~1) for tempdir()-derived paths while list.files() and DuckDB resolve to long names (runneradmin). Resume detection in snapshot_to_parquet() used %in% on paths with mixed separators (\ vs /), causing already-converted files to be reconverted. build_corpus_index() embedded snapshot_dir (with \) inside a DuckDB regexp_replace pattern, which never matched — so the full absolute path was stored in the index and later doubled by lookup_by_id(). pro_request_jsonl_parquet() used normalizePath string comparison to detect subdirectories, which always failed, placing every output file in a spurious query= subdirectory. Fixes: (1) normalize separators to / with gsub("\\\\", "/", ...) on both sides of %in% comparisons; (2) compute relative paths in R using path-depth counting (strsplit(path, "/") then indexed extraction) rather than string-matching absolute paths — immune to 8.3 vs long-name differences; (3) pass the relative path as a SQL literal in build_corpus_index() instead of computing it inside DuckDB with a regex. Changes - Schema cache per-file CSVs renamed from %06d_.schema.csv to _.csv (e.g. 2024-01-15_part_001.csv), making each cache file directly traceable to its source .gz. Breaking Changes - Removed mailto parameter from all API functions (pro_request(), pro_fetch(), pro_count(), pro_validate_credentials()). OpenAlex no longer uses email addresses for polite-pool access. - api_key handling was tightened in 0.6.0 for pro_request(), pro_fetch(), and pro_count(). Note: this was later relaxed again in development; current development allows api_key = NULL / "" and runs in unauthenticated mode. - Simplified User-Agent string from openalexPro v[VERSION] (mailto:[EMAIL]) to openalexPro/[VERSION]. Changes in version 0.5.0 New Features Snapshot Handling - Added prepare_snapshot() function for setting up a directory with Makefile and documentation for managing OpenAlex snapshots. - Added Makefile.snapshot in inst/ for automating snapshot download, conversion, and indexing. Includes targets for snapshot, parquet, parquet_index, and automatic renaming of existing data with release dates. - Added snapshot_to_parquet() function for converting OpenAlex snapshot NDJSON files to Parquet format using DuckDB. Processes each .gz file individually with per-file resume support. Supports parallel processing via workers (using future_lapply()) and unified schema inference via sample_size. - Added build_corpus_index() function for creating memory-efficient Parquet indexes for fast ID lookups. Handles 300M+ records by processing parquet files individually, with optional parallelization via workers and progress reporting via progressr. The index file is auto-named and placed alongside the corpus directory. - Added lookup_by_id() function for fast record retrieval from a parquet corpus using pre-built indexes. Uses Arrow for index filtering with automatic ID normalization. Supports parallel reads via workers and streaming to parquet via output for millions of IDs without loading into memory. - Added snapshot_filter_ids() function for filtering snapshot data by ID lists. - Added id_block() helper function for computing ID block partitions. Documentation - Added snapshot.qmd vignette with comprehensive guide on downloading, converting, and querying OpenAlex snapshots locally. Changes - Refactored snapshot_to_parquet() to process each .gz file individually instead of all at once. This reduces memory usage, enables per-file resume on interruption, and shows progress with ETA. The workers parameter now controls parallel future workers instead of DuckDB threads. Added sample_size parameter for schema inference. - Extracted infer_json_schema() and convert_json_to_parquet() internal helpers, shared by both snapshot_to_parquet() and pro_request_jsonl_parquet(). - Refactored pro_request_jsonl_parquet() to per-file conversion with future_lapply() parallelization. Removes hive partitioning by page; subfolder structure is preserved directly. Added workers parameter. Removed progress parameter (replaced by progressr). Bug Fixes - Fixed vignette parse errors in pro_query.qmd (malformed code block closings). - Fixed out-of-memory crash in snapshot_to_parquet() when sample_size exceeded the number of available files (e.g. sample_size = 10000 with 1981 works files). Schema inference now processes one file at a time instead of a single bulk DuckDB query. - Fixed duplicate key "as" crash when converting the works dataset. abstract_inverted_index is now stored as VARCHAR (raw JSON string) rather than a STRUCT. DuckDB folds struct field names to lowercase, causing a collision between the valid JSON keys "as" and "As" in this field. Storing as VARCHAR avoids struct parsing entirely and preserves the data. Parse individual values with jsonlite::fromJSON() when needed. - Fixed DuckDB temp file IO errors during snapshot_to_parquet() by exposing a TEMP_DIR variable in Makefile.snapshot (default /tmp). Changes - snapshot_to_parquet() schema inference now runs one DuckDB DESCRIBE per file instead of a single query across all sampled files. Results are cached in /.schema_cache/: per-file CSVs (_.csv) enable mid-run resume; a unified unified_schema.csv is loaded on subsequent runs to skip inference entirely. Delete unified_schema.csv to force re-inference. Tests - Added comprehensive tests for snapshot_to_parquet(), build_corpus_index(), and lookup_by_id(). - Added tests for schema caching, unified schema reuse, and works abstract_inverted_index VARCHAR round-trip. Changes in version 0.4.2 Breaking Changes - removal of load_sql_file() function as not needed anymore Documentation - Update from vignettes and adding of new ones - Update of README.md Tests - Remove need in tests for openalexR Changes in version 0.4.1 - Standardised progressbar handling - Changed default pages from 1,000 to 10,000 - Refactored pro_query and removed multiple_ids argument using Claude and expanded tests and added vignette. - Added creation of 00_completed in output directory of json, jsonl and parquet folders upon successful completion - Changed api key and email handling. Removed oap_mail()_ and oap_apikey() and simplified handling of api key and email to only use environmental variables openalexPro.email and openalexPro.apikey - Added unified schema inference to pro_request_jsonl_parquet() to prevent schema conflicts when reading combined Parquet datasets. New sample_size parameter controls schema inference sampling. This fixes "Unsupported cast from string to struct" errors when fields have different types across JSONL files (e.g., apc_paid being null in some files and a struct in others). - Removed harmonize_parquet_schemata() as it is no longer needed with the new unified schema inference. - Increased default n umber of pages to be read by request_json() from 1000 to 10000 to allow the initially planned 2,000,000 work download. Changes in version 0.4.0 - CI and coverage tweaks for CRAN readiness. - splitting snowball functionality into openalexSnowball Changes in version 0.3.1 - Added pro_fetch() with project_folder support for structured outputs. - Added progress reporting and parallelization for pro_request_jsonl(). - Added sample_parquet_n() random sampling utilities with select support. - Improved count_only output to return a data frame with an error column. Changes in version 0.3.0 - Added count_only support for pro_request() and related helpers. - Added DOI handling improvements and API call fixes. Changes in version 0.2.0 - Introduced pro_query() as the package-native query builder with chunking. - Added snowball search utilities and citation edge extraction workflow. - Expanded conversion pipeline tests and VCR-based API fixtures. - Added extract_doi() helpers and compatibility reporting artifacts.