| Title: | Providing a more advanced access to OpenAlex for the power user |
|---|---|
| Description: | More about what it does (maybe more than one line). |
| Authors: | Rainer M Krug |
| Maintainer: | Rainer M Krug <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.10.4 |
| Built: | 2026-06-03 18:33:15 UTC |
| Source: | https://github.com/openalexPro/openalexPro |
**Moved to the openalexSnapshot package.**
This function has been removed from openalexPro. Please install the openalexSnapshot package and call 'openalexSnapshot::build_corpus_index()' instead.
build_corpus_index(...)build_corpus_index(...)
... |
Ignored. |
https://github.com/rkrug/openalexSnapshot
Renders the Quarto report at 'system.file("compatibility.qmd", package = "openalexPro") and opens the resulting HTML in your default browser.
compatibility_report( output_dir = "Compatibility Report", open = TRUE, quiet = FALSE )compatibility_report( output_dir = "Compatibility Report", open = TRUE, quiet = FALSE )
output_dir |
Directory to write the rendered HTML and the data into. Defaults to the flder './Compatibility Report'. |
open |
Logical; if 'TRUE' (default) opens the rendered HTML in the system browser. |
quiet |
Logical; suppress rendering output if 'TRUE'. Default: 'FALSE'. |
This report is designed to help you validate client–API compatibility in real time. During rendering, the report performs live requests against the OpenAlex API and compares the responses to the package's expected behavior. No cached data are used: every section issues fresh API calls so that the output reflects the current state of the upstream service. The report summarizes differences in fields, types, pagination and response shapes to surface potential regressions from upstream changes or local client updates.
Note: Because it depends on live API calls, rendering may take longer and requires network access. Be mindful of API rate limits when running the report repeatedly.
Invisibly returns the path to the rendered HTML file.
Extracts DOIs or specific DOI components (resolver, prefix, or suffix) from a character vector. Assumes that each element of 'x' contains at most one DOI (with or without resolver).
extract_doi( x, non_doi_value = "", normalize = TRUE, what = c("doi", "resolver", "prefix", "suffix") )extract_doi( x, non_doi_value = "", normalize = TRUE, what = c("doi", "resolver", "prefix", "suffix") )
x |
A character vector potentially containing DOIs (e.g., raw DOIs, DOI URLs, or strings with embedded DOIs). |
non_doi_value |
Value to use for elements where no DOI or component is found. If 'NULL', only matched elements are returned. |
normalize |
Logical. If 'TRUE' (default), convert extracted DOIs and suffixes to lowercase and trim surrounding whitespace. Has no effect for 'what = "prefix"' or 'what = "resolver"'. |
what |
What to extract from each element. One of:
|
A character vector: - If 'non_doi_value' is not 'NULL', a vector of the same length as 'x', with unmatched entries replaced. - If 'non_doi_value' is 'NULL', a vector of only matched entries.
x <- c( "https://doi.org/10.5281/zenodo.1234567", " 10.1000/XYZ456 ", "no doi here", NA ) extract_doi(x) # Full DOIs (default) extract_doi(x, what = "resolver") extract_doi(x, what = "prefix") extract_doi(x, what = "suffix") extract_doi(x, non_doi_value = NA_character_) extract_doi(x, non_doi_value = NULL)x <- c( "https://doi.org/10.5281/zenodo.1234567", " 10.1000/XYZ456 ", "no doi here", NA ) extract_doi(x) # Full DOIs (default) extract_doi(x, what = "resolver") extract_doi(x, what = "prefix") extract_doi(x, what = "suffix") extract_doi(x, non_doi_value = NA_character_) extract_doi(x, non_doi_value = NULL)
This function computes the ID block partition key from OpenAlex IDs. The ID block is calculated as the trailing numeric portion of the ID divided by 10,000.
id_block(ids)id_block(ids)
ids |
Character vector of OpenAlex IDs in any format:
|
OpenAlex IDs come in several formats:
Standard: https://openalex.org/{type}{number} (e.g., W1234567890)
Path-based: https://openalex.org/{entity_type}/{number} (e.g., domains/2, subfields/2208)
The ID block is computed as floor(number / 10000), where number is
the trailing numeric portion of the ID. This groups approximately 10,000
IDs into each block, useful for partitioning large datasets.
Integer vector of ID blocks.
# Short form IDs id_block(c("W2741809807", "W2741809808", "W1234567890")) # Returns: c(274180, 274180, 123456) # Long form IDs id_block("https://openalex.org/W2741809807") # Returns: 274180 # Path-based IDs id_block("https://openalex.org/domains/2") # Returns: 0 # Works with any entity type id_block(c("A123456789", "I987654321")) # Returns: c(12345, 98765)# Short form IDs id_block(c("W2741809807", "W2741809808", "W1234567890")) # Returns: c(274180, 274180, 123456) # Long form IDs id_block("https://openalex.org/W2741809807") # Returns: 274180 # Path-based IDs id_block("https://openalex.org/domains/2") # Returns: 0 # Works with any entity type id_block(c("A123456789", "I987654321")) # Returns: c(12345, 98765)
Infers the schema of each JSON/NDJSON file individually via DuckDB's 'read_json_auto()' and merges the results using type-widening rules. Processing files one at a time avoids the out-of-memory errors that occur when opening all files in a single DuckDB query.
infer_json_schema( con, files, sample_size = 20, extra_options = "", verbose = TRUE, schema_cache_dir = NULL )infer_json_schema( con, files, sample_size = 20, extra_options = "", verbose = TRUE, schema_cache_dir = NULL )
con |
An active DuckDB connection ('DBI::dbConnect(duckdb::duckdb())') with the JSON extension loaded ('LOAD json'). |
files |
Character vector of paths to JSON or NDJSON ('.gz') files. |
sample_size |
Number of files to sample for schema inference. Higher values give more accurate schemas but take longer. Use '0' or 'NULL' to use all files. Default is '20'. |
extra_options |
Additional options appended to the 'read_json_auto' SQL call, e.g. '", maximum_object_size=1000000000"' for large JSON objects. Default is '""'. |
verbose |
If 'TRUE', print progress messages and a progress bar. Default is 'TRUE'. |
schema_cache_dir |
Path to a directory for caching per-file and unified schemas. The directory is created if it does not exist. 'NULL' (default) disables caching. |
The returned columns clause can be passed directly to 'read_json(..., columns = <result>)' or 'read_json_auto(..., columns = <result>)' in subsequent DuckDB queries to enforce a consistent schema across all files.
A DuckDB columns clause string (e.g.
"{'col1': 'VARCHAR', 'col2': 'BIGINT', ...}") suitable for use as
the 'columns' argument to 'read_json()'. Returns 'NULL' if schema
inference fails for all files.
When 'schema_cache_dir' is provided, two levels of caching apply: - **Unified schema** ('unified_schema.csv'): if present, loaded and returned immediately — no DuckDB queries needed. Delete this file to force re-inference. - **Per-file schemas** (‘<update_date>_<part_name>.csv'): each file’s schema is saved as it is inferred. On restart, already-cached files are skipped, enabling mid-run resume for large file sets.
When a column has different types across files, the unified type is chosen by these rules (in order): 1. All identical → keep as-is. 2. Any 'STRUCT'/'LIST'/'MAP' vs simpler type → complex type wins. 3. Multiple 'STRUCT' types → pick the one with the most fields. 4. Numeric conflicts → widest type wins ('TINYINT < SMALLINT < INTEGER < BIGINT < HUGEINT < FLOAT < DOUBLE'). 5. Fallback → 'VARCHAR'.
[snapshot_to_parquet()] which uses this function internally.
## Not run: con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbExecute(con, "LOAD json") files <- list.files("path/to/snapshot/works", pattern = "\\.gz$", recursive = TRUE, full.names = TRUE) schema <- infer_json_schema(con, files, sample_size = 50, schema_cache_dir = "path/to/cache") DBI::dbDisconnect(con, shutdown = TRUE) # schema is now a string like: {'id': 'VARCHAR', 'title': 'VARCHAR', ...} ## End(Not run)## Not run: con <- DBI::dbConnect(duckdb::duckdb()) DBI::dbExecute(con, "LOAD json") files <- list.files("path/to/snapshot/works", pattern = "\\.gz$", recursive = TRUE, full.names = TRUE) schema <- infer_json_schema(con, files, sample_size = 50, schema_cache_dir = "path/to/cache") DBI::dbDisconnect(con, shutdown = TRUE) # schema is now a string like: {'id': 'VARCHAR', 'title': 'VARCHAR', ...} ## End(Not run)
This function runs a jq filter to extract records from the "results" array
(or from the root if type = "single"), reconstruct the abstract text,
generate a citation string, and optionally add a page field. It writes the
result as newline-delimited JSON (.jsonl), suitable for Arrow or DuckDB. For
details on the jq filter logic, see the vignette("jq", package
= "openalexPro").
jq_execute( input_json, output_jsonl, add_columns = list(), jq_filter = NULL, page = NULL, type = c("results", "single", "group_by") )jq_execute( input_json, output_jsonl, add_columns = list(), jq_filter = NULL, page = NULL, type = c("results", "single", "group_by") )
input_json |
Path to the input JSON file |
output_jsonl |
Path to the output .jsonl file |
add_columns |
List of additional fields to be added to the output. They nave to be provided as a named list, e./g. 'list(column_1 = "value_1", column_2 = 2)'. Only Scalar values are supported. |
jq_filter |
Optional custom jq filter string. If NULL, the default filter is used. |
page |
Optional integer to be added as a "page" field in each output record |
type |
Either "results" (default, expects a .results[] array) or "single" (treat input as array of records directly) |
Invisibly returns the output path
**Moved to the openalexSnapshot package.**
This function has been removed from openalexPro. Please install the openalexSnapshot package and call 'openalexSnapshot::lookup_by_id()' instead.
lookup_by_id(...)lookup_by_id(...)
... |
Ignored. |
https://github.com/rkrug/openalexSnapshot
Copies unified_schema.csv files from an OpenAlex snapshot metadata
directory (e.g. /Volumes/openalex/openalex-snapshot_metadata) into
the user-level cache used by pro_request_parquet(schema = "auto").
oa_cache_schema(source, entities = "all", overwrite = FALSE, verbose = TRUE)oa_cache_schema(source, entities = "all", overwrite = FALSE, verbose = TRUE)
source |
Path to the snapshot metadata directory, e.g.
|
entities |
Character vector of entity names to cache, or |
overwrite |
Logical. Overwrite an existing cached file? Default
|
verbose |
Logical. Print progress messages? Default |
Once cached, the schemas are used even when the source volume is not mounted.
Update the cache periodically to pick up new fields added by OpenAlex (run
with overwrite = TRUE).
The path to the schemata cache directory (invisibly).
pro_request_parquet for the schema parameter.
Uppercases SQL type keywords (BIGINT, VARCHAR,
STRUCT, ...) while preserving the case of struct field
identifiers.
oa_normalize_duckdb_type(t)oa_normalize_duckdb_type(t)
t |
A character scalar: a raw DuckDB type string, e.g.
|
A character scalar with normalised type keywords.
abstract_inverted_index column in OpenAlex works data.The expression walks the map, collects (position, word) pairs, sorts by
position ascending, and joins words with single spaces. Returns NULL when
abstract_inverted_index is NULL.
oa_works_abstract_sql()oa_works_abstract_sql()
abstract_inverted_index is normalised to MAP(VARCHAR, BIGINT[])
via a double JSON cast (::JSON::MAP(VARCHAR, BIGINT[])) before
map_entries() is called. This makes the expression safe regardless
of whether DuckDB inferred the column as MAP, STRUCT, or
VARCHAR (raw JSON text): all three round-trip through the JSON
representation identically.
A character scalar containing the SQL expression.
authorships and publication_year columns in OpenAlex
works data.Format: "Author (year)" / "A & B (year)" /
"A et al. (year)".
Null year renders as "(n.d.)".
Null or empty authorships yields NULL.
oa_works_citation_sql()oa_works_citation_sql()
A character scalar containing the SQL expression.
Get API key for OpenAlex API
opt_api_key(api_key)opt_api_key(api_key)
api_key |
character vector or NULL. If specified, value to assign to the api key option. Default is 'NULL'. |
The API key, if 'api_key' is not specified the current one, otherwise the old one.
Get available filter names from OpenAlex API
opt_filter_names(update = FALSE)opt_filter_names(update = FALSE)
update |
logical. If 'TRUE' update the existing value. Default is 'FALSE'. |
A character vector of available filter names
Get available select fields from OpenAlex API
opt_select_fields(update = FALSE)opt_select_fields(update = FALSE)
update |
logical. If 'TRUE' update the existing value. Default is 'FALSE'. |
A character vector of available select fields
Copies the Makefile for snapshot management to the specified directory and provides instructions for creating and managing OpenAlex snapshots.
prepare_snapshot(path = ".", overwrite = FALSE)prepare_snapshot(path = ".", overwrite = FALSE)
path |
Character. The directory where the Makefile and documentation should be copied. Defaults to the current working directory. |
overwrite |
Logical. Whether to overwrite existing files. Defaults to FALSE. |
This function sets up a directory for managing OpenAlex snapshots by:
Copying a Makefile with targets for downloading and converting snapshots
Copying documentation about the snapshot process
The Makefile provides the following targets:
Show available make targets
Download/sync OpenAlex snapshot from S3
Convert snapshot to parquet format
Build ID indexes for fast lookups
Remove generated directories
Invisibly returns the path to the created Makefile.
## Not run: # Prepare current directory prepare_snapshot() # Prepare a specific directory prepare_snapshot("/path/to/openalex-data") ## End(Not run)## Not run: # Prepare current directory prepare_snapshot() # Prepare a specific directory prepare_snapshot("/path/to/openalex-data") ## End(Not run)
Retrieves the OpenAlex Pro API key from one of several locations, checked in the following order:
pro_api_key()pro_api_key()
The R option openalexPro$api_key
The environment variable openalexPro.api_key
The system keyring via the keyring package (only if the package keyring is installed)
If no API key is found, NULL or an empty string may be returned,
depending on the environment variable state.
A character string containing the API key, or NULL if no key
could be found.
## Not run: pro_api_key() options(openalexPro = list(api_key = "my-key")) pro_api_key() ## End(Not run)## Not run: pro_api_key() options(openalexPro = list(api_key = "my-key")) pro_api_key() ## End(Not run)
Downloads full-text content from the OpenAlex content endpoint
(content.openalex.org) for a vector of work IDs. One file is written
per ID. Downloads can be parallelised via the workers argument.
pro_download_content( ids, format = c("pdf", "grobid-xml"), output = ".", workers = 1L, api_key = pro_api_key(), endpoint = "https://content.openalex.org" )pro_download_content( ids, format = c("pdf", "grobid-xml"), output = ".", workers = 1L, api_key = pro_api_key(), endpoint = "https://content.openalex.org" )
ids |
Character vector of OpenAlex work IDs (e.g.
|
format |
File format to download. One of |
output |
Directory to save downloaded files into. Defaults to the current working directory. Created if it does not exist. |
workers |
Number of parallel download workers. Defaults to |
api_key |
OpenAlex API key (character string) or 'NULL'. Defaults to
the |
endpoint |
Base URL of the content endpoint. Defaults to
|
A data frame with one row per ID and columns:
idThe (normalised) work ID.
fileFull path to the saved file, or NA if not
downloaded.
statusOne of "ok", "not_found" (HTTP 404),
or "error".
messageError message, or NA on success.
Content downloads cost $0.01 per file — 10x the cost of a
metadata search query. Use has_content.pdf:true or
has_content.grobid-xml:true as filter arguments to pro_query()
to discover which works have downloadable content before downloading.
"pdf"Full-text PDF (~60 million files available).
"grobid-xml"Machine-readable TEI XML parsed by Grobid (~43 million files). Suitable for structured text extraction.
PDFs and XMLs retain their original copyright. OpenAlex does not grant
additional rights. Check the best_oa_location.license field of each
work for the applicable licence.
## Not run: # Download a single PDF result <- pro_download_content( ids = "W2741809807", format = "pdf", output = tempdir() ) # Find works with PDFs available, then download them urls <- pro_query( entity = "works", has_content.pdf = TRUE, from_publication_date = "2023-01-01", options = list(per_page = 10) ) works <- pro_request(urls, output = tempdir()) # ... extract IDs from works data, then: result <- pro_download_content(ids = work_ids, format = "pdf", workers = 4) # XPAC works: discover via pro_query() with include_xpac = TRUE, then download # (pro_download_content() works with any valid OpenAlex ID, including XPAC IDs) urls_xpac <- pro_query( entity = "works", has_content.pdf = TRUE, from_publication_date = "2023-01-01", options = list(include_xpac = TRUE, per_page = 10) ) works_xpac <- pro_request(urls_xpac, output = tempdir()) # ... extract IDs from works_xpac data, then: result_xpac <- pro_download_content(ids = xpac_ids, format = "pdf", workers = 4) ## End(Not run)## Not run: # Download a single PDF result <- pro_download_content( ids = "W2741809807", format = "pdf", output = tempdir() ) # Find works with PDFs available, then download them urls <- pro_query( entity = "works", has_content.pdf = TRUE, from_publication_date = "2023-01-01", options = list(per_page = 10) ) works <- pro_request(urls, output = tempdir()) # ... extract IDs from works data, then: result <- pro_download_content(ids = work_ids, format = "pdf", workers = 4) # XPAC works: discover via pro_query() with include_xpac = TRUE, then download # (pro_download_content() works with any valid OpenAlex ID, including XPAC IDs) urls_xpac <- pro_query( entity = "works", has_content.pdf = TRUE, from_publication_date = "2023-01-01", options = list(include_xpac = TRUE, per_page = 10) ) works_xpac <- pro_request(urls_xpac, output = tempdir()) # ... extract IDs from works_xpac data, then: result_xpac <- pro_download_content(ids = xpac_ids, format = "pdf", workers = 4) ## End(Not run)
Convenience wrapper that downloads records from OpenAlex via
pro_request() and converts them directly to an Apache Parquet
dataset via pro_request_parquet(). No intermediate JSONL
files are written.
pro_fetch( query_url, pages = 10000, project_folder = NULL, overwrite = FALSE, api_key = pro_api_key(), delete_input = TRUE, workers = 1, verbose = FALSE, progress = TRUE, enrich = TRUE, count_only, error_log = NULL )pro_fetch( query_url, pages = 10000, project_folder = NULL, overwrite = FALSE, api_key = pro_api_key(), delete_input = TRUE, workers = 1, verbose = FALSE, progress = TRUE, enrich = TRUE, count_only, error_log = NULL )
query_url |
The URL of the API query or a list of URLs returned from |
pages |
The number of pages to be downloaded. The default is set to
10000, which would be 2,000,000 works. It is recommended to not increase it
beyond 100000 due to server load and to use the snapshot instead. If |
project_folder |
Directory where intermediate ( |
overwrite |
Logical. If |
api_key |
Character string API key or |
delete_input |
Logical. If |
workers |
Number of parallel workers to use if |
verbose |
Logical indicating whether to show verbose messages. |
progress |
Logical indicating whether to show a progress bar. Default |
enrich |
Logical. When |
count_only |
Do not use it here. The function will abort if set to
|
error_log |
location of error log of API calls. (default: |
The function
downloads records from OpenAlex via pro_request() into a
"json" subfolder of project_folder, and
converts the JSON files to an Apache Parquet dataset via
pro_request_parquet() into a "parquet" subfolder.
This function assumes count_only == FALSE
Invisibly, the normalised path of the parquet subfolder
inside project_folder.
pro_request() for the download step,
pro_request_parquet() for the conversion step.
Construct an httr2 request for the OpenAlex API. All filters must be
supplied as named ... arguments (e.g., from_publication_date = "2020-01-01").
pro_query( entity = c("works", "authors", "venues", "institutions", "concepts", "publishers", "funders"), id = NULL, doi = NULL, search = NULL, search.exact = NULL, search.semantic = NULL, group_by = NULL, select = NULL, options = NULL, endpoint = "https://api.openalex.org", chunk_limit = 50L, ... )pro_query( entity = c("works", "authors", "venues", "institutions", "concepts", "publishers", "funders"), id = NULL, doi = NULL, search = NULL, search.exact = NULL, search.semantic = NULL, group_by = NULL, select = NULL, options = NULL, endpoint = "https://api.openalex.org", chunk_limit = 50L, ... )
entity |
Character; one of |
id |
Optional ID or vector of IDs (e.g., |
doi |
Optional DOI or vector of DOIs (e.g., |
search |
Optional full-text search string. Applies stemming and
stop-word removal. Supports boolean operators ( |
search.exact |
Optional full-text search without stemming or stop-word
removal. Supports the same boolean/phrase/wildcard syntax as |
search.semantic |
Optional semantic (AI-powered) search string. Uses embeddings to match conceptual meaning rather than exact keywords. Limited to 1 request per second and returns at most 50 results per query. |
group_by |
Optional field to group by (facets), e.g. |
select |
Optional character vector of fields to return. |
options |
Optional named list of additional query parameters (e.g.,
|
endpoint |
Base API URL. Defaults to |
chunk_limit |
Number of DOIS or ids per chunk if chunked. Default: 50 |
... |
Filters as named arguments. Values may be scalars or vectors (vectors
are collapsed with |
Filter names are validated via .validate_filter() using
opt_filter_names(). select fields are validated via
.validate_select() using `opt_select_fields()`.
If multiple more then 50 'doi' or openalex 'id's are provided, the request is automatically split into chunks of 50 and a named list of URLs is returned.
An individual URL or a list of URLs.
All three search parameters (search, search.exact,
search.semantic) accept a query string. For search and
search.exact, the following syntax is supported:
Boolean: biodiversity AND finance, climate OR weather,
ocean NOT pollution (operators must be uppercase).
Exact phrase: "biodiversity finance" (double quotes).
Proximity: "biodiversity finance"~5 (words within 5 positions).
Wildcard: bio* (zero or more characters), organi?ation.
Fuzzy: biodiversty~1 (allows 1 character edit).
search.semantic does not use keyword syntax; pass a natural-language
phrase or even a full abstract. It returns at most 50 results per call.
Filter arguments with a .search suffix (e.g.
title_and_abstract.search = "biodiversity") are deprecated by the
OpenAlex API. They still work but emit a warning. Use the search,
search.exact, or search.semantic parameters instead. See
https://developers.openalex.org/guides/searching for details.
## Not run: req <- oa_build_req( entity = "works", search = "biodiversity", from_publication_date = "2020-01-01", language = c("en","de"), select = c("id","title","publication_year"), options = list(per_page = 5) ) # resp <- api_call(req) # httr2::resp_body_json(resp) ## End(Not run)## Not run: req <- oa_build_req( entity = "works", search = "biodiversity", from_publication_date = "2020-01-01", language = c("en","de"), select = c("id","title","publication_year"), options = list(per_page = 5) ) # resp <- api_call(req) # httr2::resp_body_json(resp) ## End(Not run)
Queries the OpenAlex rate-limit endpoint and returns current API usage and remaining budget as a parsed list.
pro_rate_limit_status(api_key = pro_api_key(), verbose = TRUE)pro_rate_limit_status(api_key = pro_api_key(), verbose = TRUE)
api_key |
API key (character string) or 'NULL'. Defaults to
|
verbose |
Logical. If |
Invisibly, the parsed JSON list with all rate limit fields; FALSE
if the API key is missing or invalid; or NULL if the request failed due
to a network error.
All returned values from OpenAlex will be saved as
json files in the output directory and the return value is the directory of the
json files.
pro_request( query_url, pages = 1e+05, output = NULL, overwrite = FALSE, api_key = pro_api_key(), workers = 1, verbose = FALSE, progress = TRUE, count_only = FALSE, error_log = NULL )pro_request( query_url, pages = 1e+05, output = NULL, overwrite = FALSE, api_key = pro_api_key(), workers = 1, verbose = FALSE, progress = TRUE, count_only = FALSE, error_log = NULL )
query_url |
The URL of the API query or a list of URLs returned from |
pages |
The number of pages to be downloaded. The default is set to
10000, which would be 2,000,000 works. It is recommended to not increase it
beyond 100000 due to server load and to use the snapshot instead. If |
output |
directory where the JSON files are saved. Default is a temporary directory. Needs to be specified. |
overwrite |
Logical. If |
api_key |
Character string API key or |
workers |
Number of parallel workers to use if |
verbose |
Logical indicating whether to show verbose messages. |
progress |
Logical indicating whether to show a progress bar. Default |
count_only |
return count only as a data.frame. |
error_log |
location of error log of API calls. (default: |
If query_url is a list, the function is called for each element of the list in parallel
using a maximum of workers parallel R sessions. The results from the individual URLs
in the list are returned in a folder named after the names of the list elements in the
output folder.
When starting the download, a file 00_in.progress which is deleted upon completion.
If count_only is FALSE (the default) the complete path to the expanded and
normalized output. If count_only is TRUE, a data.frame with metadata about
the query (count, db_response_time_ms, page, per_page, error). When query_url is
a list, an additional query column identifies each query.
The function takes a directory of JSONL files as written from a call to
pro_request_jsonl_R(...) and converts each file individually to a Parquet file.
The subfolder structure from the input is preserved in the output, so files
in Chunk_1/ will be written to Chunk_1/ in the output directory.
pro_request_jsonl_parquet( input_jsonl = NULL, output = NULL, overwrite = FALSE, verbose = TRUE, delete_input = FALSE, sample_size = 1000, workers = NULL )pro_request_jsonl_parquet( input_jsonl = NULL, output = NULL, overwrite = FALSE, verbose = TRUE, delete_input = FALSE, sample_size = 1000, workers = NULL )
input_jsonl |
The directory of JSON files returned from
|
output |
output directory for the parquet dataset; default: temporary directory. |
overwrite |
Logical indicating whether to overwrite |
verbose |
Logical indicating whether to show verbose information.
Defaults to |
delete_input |
Determines if the |
sample_size |
Number of records to sample from each file when inferring the unified schema. Higher values give more accurate schema inference but use more memory. Default is 1000. Set to -1 to read all records (may be slow for large files). |
workers |
Number of parallel workers for file conversion via
|
The page column (added by pro_request_jsonl_R()) is preserved as a regular
column in the Parquet data.
When starting the conversion, a file 00_in.progress is created which is
deleted upon completion.
The function uses DuckDB to read the JSON files and to create the
Apache Parquet files. Each JSON file is converted individually using its own
DuckDB connection, which enables parallel processing via
future.apply::future_lapply().
To ensure consistent schemas across all Parquet files, the function first
infers a unified schema by sampling records from all JSONL files. This
prevents type mismatches (e.g., a column being struct in one file but
string in another) that would cause errors when reading the combined
Parquet dataset.
The function returns the output path invisibly.
The function takes a directory of JSON files as written from a call to
pro_request(...) and is preparing the json files to be
processed further using DuckDB by converting them to jsonl files.
The subfolders in input_json are preserved in output, i.e.
results of a list of initial queries passed to pro_request() are maintained.
pro_request_jsonl_R( input_json = NULL, output = NULL, add_columns = list(), overwrite = FALSE, verbose = TRUE, progress = TRUE, delete_input = FALSE, workers = 1 )pro_request_jsonl_R( input_json = NULL, output = NULL, add_columns = list(), overwrite = FALSE, verbose = TRUE, progress = TRUE, delete_input = FALSE, workers = 1 )
input_json |
The directory of JSON files returned from |
output |
output directory for the jsonl files as created by calls to 'jq_execute(). |
add_columns |
List of additional fields to be added to the output. They
nave to be provided as a named list, e./g. |
overwrite |
Logical indicating whether to overwrite |
verbose |
Logical indicating whether to show a verbose information.
Defaults to |
progress |
Logical indicating whether to show a progress bar. Default |
delete_input |
Determines if the |
workers |
Number of parallel workers to use. Defaults to 1. |
See jq_execute or the vignette("jq",
package = "openalexPro") for more information on the conversion of the
JSON files.
The folder/filename is converted to a value named page
As an example:
the subfolder in the output folder is called Chunk_1
the page othe json file represents is 2
The resulting cvalus for page will be Chunk_1_2
When starting the conversion, a file 00_in.progress which is deleted upon completion.
The function uses DuckDB to read the JSON files and to create the Apache Parquet files. The function creates a DuckDB connection in memory and readsds the JSON files into DuckDB when needed. Then it creates a SQL query to convert the JSON files to Apache Parquet files and to copy the result to the specified directory.
The function does returns the output invisibly.
## Not run: source_to_parquet( input_json = "json", source_type = "snapshot", output = "parquet" ) ## End(Not run)## Not run: source_to_parquet( input_json = "json", source_type = "snapshot", output = "parquet" ) ## End(Not run)
Single-step replacement for the two-step
pro_request_jsonl_R() + pro_request_jsonl_parquet() pipeline.
Reads the JSON files written by pro_request() and converts each one to a
Parquet file using DuckDB, with no intermediate JSONL on disk.
pro_request_parquet( input_json = NULL, output = NULL, add_columns = list(), overwrite = FALSE, verbose = TRUE, progress = TRUE, delete_input = FALSE, sample_size = 1000, workers = NULL, enrich = TRUE, schema = "auto" )pro_request_parquet( input_json = NULL, output = NULL, add_columns = list(), overwrite = FALSE, verbose = TRUE, progress = TRUE, delete_input = FALSE, sample_size = 1000, workers = NULL, enrich = TRUE, schema = "auto" )
input_json |
Directory of JSON files returned by |
output |
Output directory for the Parquet dataset. |
add_columns |
Named list of scalar constant columns to embed in every
output record (e.g. |
overwrite |
Logical. Overwrite |
verbose |
Logical. Show progress messages. Default |
progress |
Logical. Show a progress bar. Default |
delete_input |
Logical. Delete |
sample_size |
Integer. Number of records per file passed to DuckDB's
|
workers |
Integer. Number of parallel workers.
|
enrich |
Logical. When |
schema |
Controls use of a pre-built baseline schema for type resolution. Possible values:
|
For works entities the function detects the presence of
abstract_inverted_index, authorships, and publication_year in the
inferred schema and, when enrich = TRUE (the default), adds two computed
columns:
abstract — plain text reconstructed from abstract_inverted_index.
citation — "Author (year)" / "A & B (year)" / "A et al. (year)".
These expressions are identical to those used by the openalex-snapshot CLI
binary, so the Parquet output matches the snapshot pipeline column for column.
Output directory path (invisibly).
pro_request() writes one JSON file per API page. For paginated queries
each file has the structure {"results": [...], "meta": {...}}. For
group-by queries the array field is "group_by". For single-record lookups
the file is a bare JSON object. All three formats are handled automatically.
The subdirectory structure of input_json is preserved, with hive-partition
naming (query=<name>/, query_l2=<name>/, …) so that Arrow/DuckDB can
read the result as a partitioned dataset. A page column is added to each
record with a value derived from the source filename (or subdirectory for
multi-query inputs).
pro_request() to download the JSON files,
pro_request_jsonl_R() and pro_request_jsonl_parquet() for the older
two-step pipeline (now deprecated).
Makes a minimal API request to verify that the api_key is valid.
pro_validate_credentials(api_key = pro_api_key(), show_credentials = FALSE)pro_validate_credentials(api_key = pro_api_key(), show_credentials = FALSE)
api_key |
API key to validate (character string) or 'NULL'. |
show_credentials |
shows the api_key using 'message()'. USE WITH CAUTION! |
TRUE if credentials work, FALSE otherwise
This function reads a corpus in Apache Parquet format and returns an
ArrowObject representing the corpus which can be fed into a dplyr
pipeline or a tibble which contains all the data.
read_corpus(corpus, return_data = FALSE)read_corpus(corpus, return_data = FALSE)
corpus |
The directory of the Parquet files. |
return_data |
Logical indicating whether to return an |
An ArrowObject representing the corpus or a tibble.
Draw a uniform random sample of n rows from one or more Parquet files
using DuckDB's SQL USING SAMPLE reservoir(n ROWS) clause.
The sampling is performed entirely inside DuckDB, so the full dataset
is never loaded into R.
This is well-suited for large Parquet corpora (e.g. OpenAlex works) where you want a random subset of rows without materialising the whole table.
sample_parquet_n(path, n, seed = NULL, con = NULL, select = NULL)sample_parquet_n(path, n, seed = NULL, con = NULL, select = NULL)
path |
Character scalar. Path or glob pointing to one or more Parquet
files, as understood by DuckDB's |
n |
Integer scalar. Number of rows to sample. If |
seed |
Optional integer scalar. If supplied, a |
con |
Optional |
select |
Optional character vector of column names to return. If
|
The function delegates to the following SQL pattern (simplified):
SELECT [columns]
FROM parquet_scan('path/to/files/*.parquet')
USING SAMPLE reservoir(n ROWS)
[REPEATABLE (seed)]
Using reservoir(n ROWS) gives an exact uniform sample of size
n from all rows in the dataset (unless n exceeds the total
row count, in which case all rows are returned).
Note that the path argument is passed directly to DuckDB's
parquet_scan() function, so you can use:
A single Parquet file:
"works.parquet"
A glob for many files:
"works/*.parquet"
A directory, depending on your DuckDB version/configuration.
When con is NULL, the function creates an in-memory DuckDB
database. If you want to reuse the same DuckDB instance for multiple queries
(for performance reasons or to control pragmas), you can create a DuckDB
connection yourself and pass it via con.
A data.frame with up to n rows, containing a uniform random
sample from the union of all Parquet files matched by path, restricted
to the columns specified in select (or all columns if select is
NULL).
## Not run: # Sample 1,000 rows from a directory of Parquet files sample_df <- sample_parquet_duckdb( path = "spc_corpus/output/chapter_3/corpus/*.parquet", n = 10000L, seed = 1234 ) # Sample only a subset of columns sample_df_small <- sample_parquet_duckdb( path = "spc_corpus/output/chapter_3/corpus/*.parquet", n = 10000L, seed = 1234, select = c("id", "doi", "citation", "author_abbr", "display_name", "ab") ) # Reuse a DuckDB connection for multiple samples con <- DBI::dbConnect(duckdb::duckdb()) on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE) s1 <- sample_parquet_duckdb( path = "openalex_works/*.parquet", n = 500L, seed = 42, con = con ) s2 <- sample_parquet_duckdb( path = "openalex_works/*.parquet", n = 500L, seed = 777, con = con ) ## End(Not run)## Not run: # Sample 1,000 rows from a directory of Parquet files sample_df <- sample_parquet_duckdb( path = "spc_corpus/output/chapter_3/corpus/*.parquet", n = 10000L, seed = 1234 ) # Sample only a subset of columns sample_df_small <- sample_parquet_duckdb( path = "spc_corpus/output/chapter_3/corpus/*.parquet", n = 10000L, seed = 1234, select = c("id", "doi", "citation", "author_abbr", "display_name", "ab") ) # Reuse a DuckDB connection for multiple samples con <- DBI::dbConnect(duckdb::duckdb()) on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE) s1 <- sample_parquet_duckdb( path = "openalex_works/*.parquet", n = 500L, seed = 42, con = con ) s2 <- sample_parquet_duckdb( path = "openalex_works/*.parquet", n = 500L, seed = 777, con = con ) ## End(Not run)
**Moved to the openalexSnapshot package.**
This function has been removed from openalexPro. Please install the openalexSnapshot package and call 'openalexSnapshot::snapshot_to_parquet()' instead.
snapshot_to_parquet(...)snapshot_to_parquet(...)
... |
Ignored. |
https://github.com/rkrug/openalexSnapshot