Package 'openalexConvert'

Title: Convert OpenAlex Parquet (from openalexPro) to other Formats
Description: Utilities to convert an OpenAlex parquet/Arrow corpus into other formats. Implemented are at the moment CSL JSON, BibTeX, BibLaTeX, Markdown, LaTeX, HTML, or PDF via Pandoc. Uses DuckDB over Arrow for efficient chunked CSL JSON conversion.
Authors: Rainer M Krug
Maintainer: Rainer M Krug <[email protected]>
License: GPL (>= 2)
Version: 0.0.3
Built: 2026-06-03 18:32:08 UTC
Source: https://github.com/openalexPro/openalexConvert

Help Index


One-shot export via CSL JSON + Pandoc

Description

Convenience wrapper that maps a corpus to CSL JSON, then converts it to the desired output format via Pandoc.

Usage

corpus_export_via_pandoc(
  corpus,
  output,
  to = c("bibtex", "biblatex"),
  csl_tmp = NULL,
  ...
)

Arguments

corpus

Arrow Dataset/Table or data.frame/tibble of works.

output

Path to the final file (e.g., 'corpus.bib').

to

Target format passed to Pandoc (e.g., '"bibtex"', '"biblatex"').

csl_tmp

Optional path for a temporary CSL JSON directory. If 'NULL', a temporary directory is used and removed afterwards.

...

Additional arguments passed to 'corpus_to_csljson()' (e.g., 'chunk_size').

Value

Invisibly returns the normalized path to the created file.


Convert a corpus to CSL JSON (chunked)

Description

Maps an OpenAlex-like corpus (Arrow Dataset/Table or data.frame/tibble) to CSL JSON items and writes them into chunked files. The function creates the directory output (if not present) and writes files chunk_1.json, chunk_2.json, ... inside that directory.

Usage

corpus_to_csljson(
  project_dir,
  corpus = file.path(project_dir, "parquet"),
  output = file.path(project_dir, "csljson"),
  chunk_size = 10000,
  overwrite = FALSE,
  verbose = TRUE
)

Arguments

project_dir

Optional path to project directory. If provided, used to set default values for corpus and output parameters. Can be omitted if corpus and output are specified explicitly.

corpus

Path to parquet dataset, parquet Dataset/Table (e.g., from arrow::open_dataset()) or a data.frame/tibble (e.g., from dplyr::collect()).

output

Path to a directory to create and populate with chunked CSL JSON files (chunk_1.json, chunk_2.json, ...).

chunk_size

Rows processed per chunk via DuckDB. Default: 10000.

overwrite

Overwrite output if it exists. Default: FALSE.

verbose

Print progress messages. Default: TRUE.

Details

This converter targets the most common OpenAlex field layout and is resilient to missing columns by falling back to NULL/empty values in SQL. Mapping includes: title, year, DOI, container-title (venue), volume/issue/pages, authors (with basic given/family split and ORCID when present), URL/abstract, publisher and ISSN, language, keywords (collapsed to a single string), and an aggregated note with OA status and citation count. Records are processed in DuckDB-backed chunks for low memory usage.

Value

Invisibly returns normalizePath(output).


Convert CSL JSON (file or directory) via Pandoc

Description

Converts CSL JSON with Pandoc into one of: BibTeX, BibLaTeX, Docx, Markdown, LaTeX, or PDF. Behavior depends on to:

  • bibtex/biblatex: creates bibliography files. For a directory of chunks, writes chunk_*.bib into the directory given by output. For a single file, writes the specified output (appends .bib if missing).

  • docx/markdown/latex/pdf: renders a formatted references document using citeproc. For a directory of chunks, writes ⁠references.<ext>⁠ inside output. For a single file, writes to output (appends extension if missing).

Usage

csljson_convert_pandoc(
  csljson,
  output,
  to = c("biblatex", "bibtex", "docx", "markdown", "latex", "html", "pdf"),
  from = "csljson",
  overwrite = FALSE,
  verbose = TRUE,
  references_csl = NULL,
  pdf_engine = "xelatex",
  pdf_mainfont = NULL,
  pdf_sansfont = NULL,
  pdf_monofont = NULL,
  pdf_cjk_mainfont = NULL,
  pdf_cjk_options = NULL
)

Arguments

csljson

Path to a CSL JSON file (array) or a directory created by corpus_to_csljson() containing chunk_*.json files.

output

Output path. For ⁠bib*⁠ with a file input, this is the target .bib file (extension added if missing). For ⁠bib*⁠ with a directory input, this is the output directory. For formatted references (docx, markdown, latex, pdf), this is the output file (file input) or the output directory (dir input; file will be ⁠references.<ext>⁠ within).

to

One of "biblatex", "bibtex", "docx", "markdown", "latex", "html", or "pdf".

from

Source format; defaults to "csljson".

overwrite

Logical; overwrite existing output file(s). Defaults to FALSE.

verbose

Print progress messages.

references_csl

Optional path to a CSL style file (e.g., apa.csl). If NULL, Pandoc's default style is used.

pdf_engine

LaTeX engine used when to = "pdf". Common values are "xelatex" (default, good Unicode support), "lualatex", or "pdflatex". Passed to Pandoc as --pdf-engine.

pdf_mainfont

Main text font name for PDF output (used with XeLaTeX/LuaLaTeX). Sets Pandoc variable mainfont (e.g., ⁠-V mainfont=Source Serif Pro⁠).

pdf_sansfont

Sans‑serif font name for PDF output. Sets Pandoc variable sansfont.

pdf_monofont

Monospace font name for PDF output. Sets Pandoc variable monofont.

pdf_cjk_mainfont

Main CJK font name for PDF output. Sets Pandoc variable CJKmainfont for better East‑Asian typography.

pdf_cjk_options

Additional CJK options passed via Pandoc variable CJKoptions (e.g., feature flags accepted by xeCJK).

Details

Requires Pandoc to be available. In RStudio, a bundled Pandoc is usually available; otherwise install Pandoc and ensure it is on PATH.

When rendering to = "pdf", this function maps the supplied PDF options to Pandoc command line flags and variables as follows:

  • pdf_engine⁠--pdf-engine=<engine>⁠

  • pdf_mainfont, pdf_sansfont, pdf_monofont⁠-V mainfont=...⁠, ⁠-V sansfont=...⁠, ⁠-V monofont=...⁠

  • pdf_cjk_mainfont, pdf_cjk_options⁠-V CJKmainfont=...⁠, ⁠-V CJKoptions=...⁠

Use these to ensure Unicode coverage and consistent typography, especially for multilingual bibliographies.

Value

Invisibly returns the created file path(s).


Upload CSL JSON batch files to a Zotero group library (with progress bar)

Description

Upload one or more CSL-JSON files to a Zotero group library using the Zotero Web API. Each file is assumed to contain an array of item objects suitable for POSTing to ⁠/groups/{group_id}/items⁠.

Progress is reported via the progressr package, which integrates with future / future.apply progress handlers.

Usage

csljson_to_zotero_upload(
  files,
  group_id,
  api_key = Sys.getenv("ZOTERO_API_KEY"),
  pause = 0.5
)

Arguments

files

Character vector. Paths to CSL-JSON files. If a single element points to an existing directory, all ⁠*.json⁠ files in that directory (non-recursive) are used.

group_id

Character or numeric. Zotero group ID.

api_key

Character. Zotero API key with write access to the group. Defaults to Sys.getenv("ZOTERO_API_KEY").

pause

Numeric scalar. Number of seconds to wait between requests (to avoid rate limiting). Default is 0.5.

Value

A data.frame with one row per file and columns:

file

Path to the CSL-JSON file.

status_code

HTTP status code returned by the Zotero API.

ok

Logical; TRUE if status_code is in 200–299.

message

Character; short message or error text (possibly truncated).

Invisibly returns this data.frame.

Examples

## Not run: 
library(progressr)
handlers(global = TRUE)
handlers("cli")  # or "progress", "txtprogressbar", etc.

res <- csljson_to_zotero_upload(
  files    = "csljson_batches",  # directory with items_0001.json, ...
  group_id = "123456",
  api_key  = Sys.getenv("ZOTERO_API_KEY")
)

subset(res, !ok)  # inspect failures

## End(Not run)