| Title: | Convert OpenAlex Parquet (from openalexPro) to other Formats |
|---|---|
| Description: | Utilities to convert an OpenAlex parquet/Arrow corpus into other formats. Implemented are at the moment CSL JSON, BibTeX, BibLaTeX, Markdown, LaTeX, HTML, or PDF via Pandoc. Uses DuckDB over Arrow for efficient chunked CSL JSON conversion. |
| Authors: | Rainer M Krug |
| Maintainer: | Rainer M Krug <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.0.3 |
| Built: | 2026-06-03 18:32:08 UTC |
| Source: | https://github.com/openalexPro/openalexConvert |
Convenience wrapper that maps a corpus to CSL JSON, then converts it to the desired output format via Pandoc.
corpus_export_via_pandoc( corpus, output, to = c("bibtex", "biblatex"), csl_tmp = NULL, ... )corpus_export_via_pandoc( corpus, output, to = c("bibtex", "biblatex"), csl_tmp = NULL, ... )
corpus |
Arrow Dataset/Table or data.frame/tibble of works. |
output |
Path to the final file (e.g., 'corpus.bib'). |
to |
Target format passed to Pandoc (e.g., '"bibtex"', '"biblatex"'). |
csl_tmp |
Optional path for a temporary CSL JSON directory. If 'NULL', a temporary directory is used and removed afterwards. |
... |
Additional arguments passed to 'corpus_to_csljson()' (e.g., 'chunk_size'). |
Invisibly returns the normalized path to the created file.
Maps an OpenAlex-like corpus (Arrow Dataset/Table or data.frame/tibble) to
CSL JSON items and writes them into chunked files. The function creates the
directory output (if not present) and writes files chunk_1.json,
chunk_2.json, ... inside that directory.
corpus_to_csljson( project_dir, corpus = file.path(project_dir, "parquet"), output = file.path(project_dir, "csljson"), chunk_size = 10000, overwrite = FALSE, verbose = TRUE )corpus_to_csljson( project_dir, corpus = file.path(project_dir, "parquet"), output = file.path(project_dir, "csljson"), chunk_size = 10000, overwrite = FALSE, verbose = TRUE )
project_dir |
Optional path to project directory. If provided, used to
set default values for |
corpus |
Path to parquet dataset, parquet Dataset/Table (e.g., from
|
output |
Path to a directory to create and populate with chunked CSL
JSON files ( |
chunk_size |
Rows processed per chunk via DuckDB. Default: 10000. |
overwrite |
Overwrite |
verbose |
Print progress messages. Default: TRUE. |
This converter targets the most common OpenAlex field layout and is resilient
to missing columns by falling back to NULL/empty values in SQL. Mapping
includes: title, year, DOI, container-title (venue), volume/issue/pages,
authors (with basic given/family split and ORCID when present), URL/abstract,
publisher and ISSN, language, keywords (collapsed to a single string), and an
aggregated note with OA status and citation count. Records are processed in
DuckDB-backed chunks for low memory usage.
Invisibly returns normalizePath(output).
Converts CSL JSON with Pandoc into one of: BibTeX, BibLaTeX, Docx, Markdown,
LaTeX, or PDF. Behavior depends on to:
bibtex/biblatex: creates bibliography files. For a directory of chunks,
writes chunk_*.bib into the directory given by output. For a single
file, writes the specified output (appends .bib if missing).
docx/markdown/latex/pdf: renders a formatted references document
using citeproc. For a directory of chunks, writes references.<ext> inside
output. For a single file, writes to output (appends extension if
missing).
csljson_convert_pandoc( csljson, output, to = c("biblatex", "bibtex", "docx", "markdown", "latex", "html", "pdf"), from = "csljson", overwrite = FALSE, verbose = TRUE, references_csl = NULL, pdf_engine = "xelatex", pdf_mainfont = NULL, pdf_sansfont = NULL, pdf_monofont = NULL, pdf_cjk_mainfont = NULL, pdf_cjk_options = NULL )csljson_convert_pandoc( csljson, output, to = c("biblatex", "bibtex", "docx", "markdown", "latex", "html", "pdf"), from = "csljson", overwrite = FALSE, verbose = TRUE, references_csl = NULL, pdf_engine = "xelatex", pdf_mainfont = NULL, pdf_sansfont = NULL, pdf_monofont = NULL, pdf_cjk_mainfont = NULL, pdf_cjk_options = NULL )
csljson |
Path to a CSL JSON file (array) or a directory created by
|
output |
Output path. For |
to |
One of |
from |
Source format; defaults to "csljson". |
overwrite |
Logical; overwrite existing output file(s). Defaults to FALSE. |
verbose |
Print progress messages. |
references_csl |
Optional path to a CSL style file (e.g., apa.csl). If NULL, Pandoc's default style is used. |
pdf_engine |
LaTeX engine used when |
pdf_mainfont |
Main text font name for PDF output (used with
XeLaTeX/LuaLaTeX). Sets Pandoc variable |
pdf_sansfont |
Sans‑serif font name for PDF output. Sets Pandoc
variable |
pdf_monofont |
Monospace font name for PDF output. Sets Pandoc
variable |
pdf_cjk_mainfont |
Main CJK font name for PDF output. Sets Pandoc
variable |
pdf_cjk_options |
Additional CJK options passed via Pandoc variable
|
Requires Pandoc to be available. In RStudio, a bundled Pandoc is usually available; otherwise install Pandoc and ensure it is on PATH.
When rendering to = "pdf", this function maps the supplied PDF options to
Pandoc command line flags and variables as follows:
pdf_engine → --pdf-engine=<engine>
pdf_mainfont, pdf_sansfont, pdf_monofont → -V mainfont=...,
-V sansfont=..., -V monofont=...
pdf_cjk_mainfont, pdf_cjk_options → -V CJKmainfont=...,
-V CJKoptions=...
Use these to ensure Unicode coverage and consistent typography, especially for multilingual bibliographies.
Invisibly returns the created file path(s).
Upload one or more CSL-JSON files to a Zotero group library using the
Zotero Web API. Each file is assumed to contain an array of item objects
suitable for POSTing to /groups/{group_id}/items.
Progress is reported via the progressr package, which integrates with future / future.apply progress handlers.
csljson_to_zotero_upload( files, group_id, api_key = Sys.getenv("ZOTERO_API_KEY"), pause = 0.5 )csljson_to_zotero_upload( files, group_id, api_key = Sys.getenv("ZOTERO_API_KEY"), pause = 0.5 )
files |
Character vector. Paths to CSL-JSON files. If a single element
points to an existing directory, all |
group_id |
Character or numeric. Zotero group ID. |
api_key |
Character. Zotero API key with write access to the group.
Defaults to |
pause |
Numeric scalar. Number of seconds to wait between requests
(to avoid rate limiting). Default is |
A data.frame with one row per file and columns:
Path to the CSL-JSON file.
HTTP status code returned by the Zotero API.
Logical; TRUE if status_code is in 200–299.
Character; short message or error text (possibly truncated).
Invisibly returns this data.frame.
## Not run: library(progressr) handlers(global = TRUE) handlers("cli") # or "progress", "txtprogressbar", etc. res <- csljson_to_zotero_upload( files = "csljson_batches", # directory with items_0001.json, ... group_id = "123456", api_key = Sys.getenv("ZOTERO_API_KEY") ) subset(res, !ok) # inspect failures ## End(Not run)## Not run: library(progressr) handlers(global = TRUE) handlers("cli") # or "progress", "txtprogressbar", etc. res <- csljson_to_zotero_upload( files = "csljson_batches", # directory with items_0001.json, ... group_id = "123456", api_key = Sys.getenv("ZOTERO_API_KEY") ) subset(res, !ok) # inspect failures ## End(Not run)