| Title: | OpenAlex Bulk Snapshot Conversion, Indexing, and Record Extraction |
|---|---|
| Description: | Provides tools for working with the OpenAlex bulk snapshot: converting .json.gz files to Parquet format, building ID-lookup indexes over the resulting corpus, and extracting records by OpenAlex ID. Large-scale operations delegate to a compiled Rust back-end (openalex-core via extendr); a pure-R/DuckDB fallback is included for environments without a Rust toolchain. |
| Authors: | Rainer M Krug |
| Maintainer: | Rainer M Krug <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.0.2 |
| Built: | 2026-06-02 18:43:37 UTC |
| Source: | https://github.com/openalexPro/openalexSnapshot |
Builds a <dataset>_id_idx.parquet index from the Parquet corpus produced
by snapshot_to_parquet(), enabling fast record retrieval by OpenAlex ID
using lookup_by_id().
build_corpus_index( root_dir = NULL, data_sets = NULL, workers = NULL, memory_limit = NULL, overwrite = FALSE, verbose = TRUE, corpus_dir = NULL )build_corpus_index( root_dir = NULL, data_sets = NULL, workers = NULL, memory_limit = NULL, overwrite = FALSE, verbose = TRUE, corpus_dir = NULL )
root_dir |
Root directory containing a |
data_sets |
Character vector of dataset names to index (e.g.
|
workers |
Number of parallel workers for Stage 1 indexing. Default is
|
memory_limit |
DuckDB memory limit (e.g., |
overwrite |
If |
verbose |
Print progress messages. Default is |
corpus_dir |
Explicit path to a single dataset Parquet directory (e.g.
|
The function uses a two-stage approach:
Index each Parquet file individually (bounded memory, parallel, with resume support).
Combine the per-file shard indexes into a single Parquet index.
Paths can be supplied as a single root_dir (which iterates over all
requested data_sets) or as an explicit corpus_dir pointing to a single
dataset directory.
The index contains columns:
The OpenAlex ID
Block number computed as floor(numeric_id / 10000)
Relative path to the Parquet file in the corpus
Row number within the file (0-indexed)
When corpus_dir is provided, invisibly returns the path to the
created index file. When root_dir is used, invisibly returns root_dir.
snapshot_to_parquet() for creating the Parquet corpus,
lookup_by_id() for ID-based record retrieval.
## Not run: build_corpus_index(root_dir = "/Volumes/openalex") build_corpus_index( root_dir = "/Volumes/openalex", data_sets = "works", workers = 4 ) # Single explicit directory: build_corpus_index( corpus_dir = "/Volumes/openalex/parquet/works", memory_limit = "20GB" ) ## End(Not run)## Not run: build_corpus_index(root_dir = "/Volumes/openalex") build_corpus_index( root_dir = "/Volumes/openalex", data_sets = "works", workers = 4 ) # Single explicit directory: build_corpus_index( corpus_dir = "/Volumes/openalex/parquet/works", memory_limit = "20GB" ) ## End(Not run)
Uses a pre-built index (created by build_corpus_index()) to locate records
efficiently and extract them from the Parquet corpus.
lookup_by_id( root_dir = NULL, ids, project_dir = NULL, data_sets = NULL, workers = NULL, progress = TRUE, verbose = TRUE, index_file = NULL, selected = NULL, output = NULL )lookup_by_id( root_dir = NULL, ids, project_dir = NULL, data_sets = NULL, workers = NULL, progress = TRUE, verbose = TRUE, index_file = NULL, selected = NULL, output = NULL )
root_dir |
Root directory containing |
ids |
Character vector of OpenAlex IDs to retrieve. Can be long form
(e.g. |
project_dir |
Project output directory. Extracted Parquet files are
written to |
data_sets |
Character vector of dataset names to search (e.g.
|
workers |
Number of parallel workers for reading corpus files. Default
is |
progress |
Ignored (kept for backward compatibility). |
verbose |
Print progress messages. Default is |
index_file |
Explicit path to an index Parquet file created by
|
selected |
Column selection passed to |
output |
Path to an output directory for writing results as Parquet
files when using |
Paths can be supplied as a root_dir + data_sets pair (which
automatically locates the correct index files and writes output into
project_dir) or as an explicit index_file for direct use.
index_file mode, output not NULL: invisibly returns output.
index_file mode, output is NULL: returns a data frame of matching
records.
root_dir mode: invisibly returns project_dir.
build_corpus_index() for building the required index,
snapshot_to_parquet() for creating the Parquet corpus.
## Not run: # root_dir mode (searches multiple datasets) lookup_by_id( root_dir = "/Volumes/openalex", ids = c("W2741809807", "W1234567890"), project_dir = "my_project", data_sets = "works" ) # index_file mode (direct access, returns data frame) records <- lookup_by_id( index_file = "works_id_index.parquet", ids = c("W2741809807", "W1234567890") ) # index_file mode (write to parquet) lookup_by_id( index_file = "works_id_index.parquet", ids = large_id_vector, output = "filtered_works", workers = 3 ) ## End(Not run)## Not run: # root_dir mode (searches multiple datasets) lookup_by_id( root_dir = "/Volumes/openalex", ids = c("W2741809807", "W1234567890"), project_dir = "my_project", data_sets = "works" ) # index_file mode (direct access, returns data frame) records <- lookup_by_id( index_file = "works_id_index.parquet", ids = c("W2741809807", "W1234567890") ) # index_file mode (write to parquet) lookup_by_id( index_file = "works_id_index.parquet", ids = large_id_vector, output = "filtered_works", workers = 3 ) ## End(Not run)
Stage 1: per-file shard indexes (parallel via rayon).
Stage 2: combine shards into <corpus_name>_id_idx.parquet.
oa_build_corpus_index(corpus_dir, workers, memory_limit, overwrite, verbose)oa_build_corpus_index(corpus_dir, workers, memory_limit, overwrite, verbose)
corpus_dir |
Path to a single dataset Parquet directory. |
workers |
Number of parallel workers for Stage 1. |
memory_limit |
DuckDB memory limit ( |
overwrite |
If |
verbose |
Print progress to stderr. |
Returns the path to the created index file as a character scalar.
Character scalar: path to the index file.
Reads the index file, filters to the requested IDs, and extracts matching
rows into the output directory (which must not already exist).
oa_lookup_by_id(index_file, ids, output, workers, verbose)oa_lookup_by_id(index_file, ids, output, workers, verbose)
index_file |
Path to the index Parquet file (created by
|
ids |
Character vector of OpenAlex IDs (long or short form). |
output |
Output directory path. Must not already exist. |
workers |
Number of parallel workers for file extraction. |
verbose |
Print progress to stderr. |
Invisibly returns NULL.
Full pipeline: schema inference (per-dataset, cached in
<parquet_dir>/<dataset>/.schema_cache/unified_schema.csv) plus parallel
per-file COPY via rayon.
oa_snapshot_to_parquet( snapshot_dir, parquet_dir, data_sets, workers, sample_size, memory_limit, temp_dir, verbose )oa_snapshot_to_parquet( snapshot_dir, parquet_dir, data_sets, workers, sample_size, memory_limit, temp_dir, verbose )
snapshot_dir |
Path to the snapshot root (contains a |
parquet_dir |
Output directory for Parquet files. |
data_sets |
Character vector of dataset names, or |
workers |
Number of parallel workers ( |
sample_size |
Files to sample for schema inference ( |
memory_limit |
DuckDB memory limit, e.g. |
temp_dir |
DuckDB temp directory ( |
verbose |
Print progress to stderr. |
Invisibly returns NULL.
Converts OpenAlex snapshot .json.gz files to Parquet using schema
inference and parallel conversion. Paths can be supplied as a single
root_dir (which derives snapshot_dir and parquet_dir automatically)
or as explicit snapshot_dir and parquet_dir arguments.
snapshot_to_parquet( root_dir = NULL, data_sets = NULL, workers = NULL, sample_size = 20, memory_limit = NULL, temp_directory = NULL, progress = TRUE, verbose = TRUE, snapshot_dir = NULL, parquet_dir = NULL )snapshot_to_parquet( root_dir = NULL, data_sets = NULL, workers = NULL, sample_size = 20, memory_limit = NULL, temp_directory = NULL, progress = TRUE, verbose = TRUE, snapshot_dir = NULL, parquet_dir = NULL )
root_dir |
Root directory. If provided, |
data_sets |
Character vector of dataset names to convert (e.g.
|
workers |
Number of parallel workers for file conversion. Default is
|
sample_size |
Number of |
memory_limit |
DuckDB memory limit per worker (e.g., |
temp_directory |
Location of the temporary directory for DuckDB.
Default is |
progress |
Ignored (kept for backward compatibility). |
verbose |
Print per-dataset progress messages. Default is |
snapshot_dir |
Explicit path to the snapshot data directory (the one
containing a |
parquet_dir |
Explicit path to the Parquet output directory. Required
when |
Invisibly returns NULL.
build_corpus_index() for indexing the resulting Parquet files,
lookup_by_id() for ID-based record retrieval.
## Not run: snapshot_to_parquet(root_dir = "/Volumes/openalex") snapshot_to_parquet( root_dir = "/Volumes/openalex", data_sets = c("authors", "works"), workers = 4, memory_limit = "8GB" ) # Explicit paths (no root_dir): snapshot_to_parquet( snapshot_dir = "/data/openalex-snapshot", parquet_dir = "/data/parquet", data_sets = "authors" ) ## End(Not run)## Not run: snapshot_to_parquet(root_dir = "/Volumes/openalex") snapshot_to_parquet( root_dir = "/Volumes/openalex", data_sets = c("authors", "works"), workers = 4, memory_limit = "8GB" ) # Explicit paths (no root_dir): snapshot_to_parquet( snapshot_dir = "/data/openalex-snapshot", parquet_dir = "/data/parquet", data_sets = "authors" ) ## End(Not run)