tosr: Create the Tree of Science from WoS and Scopus

The R package 'tosr' enables the construction of the Tree of Science (ToS), a metaphorical representation of scientific papers on a specific topic. The ToS's roots symbolize seminal works, the trunk stands for structural works


INTRODUCTION
The digital age has fueled an exponential surge in academic literature production and accessibility.While reachability was an issue before the 20 th century, the current challenge lies in managing the overwhelming volume of new research.In response to this, the Tree of Science (ToS) metaphor has proven instrumental in streamlining the identification of pertinent works within academic literature. [1]Leveraging graph theory, the ToS algorithm positions papers within a tree structure: classic or seminal works as roots, structural works as the trunk, and the latest research as leaves. [2,3] evolved version of this algorithm, known as the SAP algorithm, has further improved the accuracy of results, especially within the leaves. [4]In furtherance of this work, the 'tosr' package seeks to automate the SAP algorithm while facilitating the merger of Scopus and Web of Science data.
The SAP algorithm, extensively detailed by Valencia-Hernandez et al., [4] has expedited the creation of review papers.For instance, Duque et al. [5] and Tabares & Duque [6] have effectively applied it to identify significant literature on social economy and school cyberbullying, respectively.The integration of the SAP algorithm with machine learning techniques has enabled the testing of different software. [7]The ToS has also proved valuable in guiding early-career researchers toward essential academic papers on specific topics.Notable examples include Rubio et al.'s discourse on the governance of tourist destinations [8] and Uribe et al.'s work on blended learning in education. [9]Additionally, the ToS has enhanced the visibility of academic papers within the scientific community, as evidenced by the increased citation rates of works like Grisales-Aguirre et al., [10] Ariza-Colpas et al., [11] and Hernández-Leal. [12]gure 1 elucidates the overall workflow of 'tosr.'To generate a mixed source Tree of Science, users should download two filesone from Scopus (.bib) and another from Web of Science (WoS, .txt)-bothcontaining the references of each paper.Subsequently, an RStudio cloud session should be initiated, activating the 'tosr,' 'biblometrix', [13] and 'tidyverse' [14] libraries for data interaction.
The primary function of the package, tosR(), accepts the two files as input, transforming the data into a data frame that designates papers as roots, trunks, and leaves.The tosr_load() function creates three separate files, each representing a part of the tree and ready for in-depth analysis using different programs.

Overview
A stable version of tosr is available at CRAN1 (Comprehensive R Archive Network) and the current development version is on Github.2

"tosr" architecture
The 'tosr' package comprises three core functions: tosR(), tosr_ load(), and tosSAP().Its operation requires two input files-a .txtfile from WoS encompassing all records and cited references, and a .bibfile from Scopus containing comprehensive data (Figure 1).Utilizing the bibliometrix package's convert2df function, [13] 'tosr' converts these files into data frames.The mergeBD function [13] subsequently merges data devoid of references, facilitating the creation of an ID_TOS derived from reference data, specifically, the first author's last name and publication year.Notably, despite the distinct reference formats of WoS and Scopus, their ID_TOS data align.In the concise path, the user submits the WoS and Scopus files to the tosR() function, which returns a dataframe distinguishing papers into roots, trunks, and leaves.Conversely, the comprehensive path necessitates the use of both tosr_load() and tosSAP() functions.The former receives Scopus and WoS files, generating a citation network, a merged dataframe, and a dataframe containing reference names (WoS and Scopus).This process benefits users seeking a refined citation network analysis.The subsequent tosSAP() function employs these three files to yield a dataframe that classifies papers into roots, trunks, and leaves.

"tosr" functionalities
The 'tosr' package generates a citation network from WoS and Scopus files.Recognizing the disparate reference formats between the two databases, 'tosr' establishes a common identifier (ID_ TOS) drawn from both reference types, enabling their unification and merging.In this citation network, nodes represent papers, while links symbolize references connecting two papers.Thus, if paper A cites paper B, it creates a link -a directed graph that, by definition, is unidirectional as the cited paper cannot reciprocate the citation.
The 'tosr' package ensures the citation network's integrity by extracting the giant component [21] and eliminating nodes (papers) exhibiting an in-degree of 0 and an out-degree of 1.
The refined network is then ripe for SAP algorithm application.Mirroring a tree's sap process, the SAP algorithm first identifies the most frequently cited papers (those with a high citation count and zero out-degree), papers with the highest number of references to others, and those with significant betweenness.Upon determining these metrics, the SAP algorithm initiates the shortest path identification process among the paper groups.Finally, the algorithm designates papers into roots, trunks, and leaves (For a detailed explanation, refer to Valencia-Hernandez et al.). [4]

METHODOLOGY Data Acquisition
This section illustrates the process of constructing the ToS using the 'scientometrics' topic as an example.To start, the user needs to download data from both WoS and Scopus databases, ensuring that references to the papers are included.It's important to note that access to these databases was facilitated through licenses provided by one of the institutions affiliated with the researchers, ensuring that the use of these data complies with all relevant copyright and licensing agreements.Figure 2 provides examples from both platforms.In WoS, the user should opt for a 'plain text file' configured to include 'Full Record and Cited References.' Simultaneously, in Scopus, the user should select a BibTeX file encompassing all the relevant information.

Creating ToS -Concise Method
In the concise method, the user employs the tosR() function to construct the ToS for the 'Scientometrics' topic using the acquired files from WoS and Scopus.The analysis necessitates the activation of 'tosr', 'tidyverse', and 'bibliometrix' libraries.Source Code 1 delineates the code required to generate the ToS.

Source Code 1:
Example code to create ToS with tosr in a short way.library(tosr) library(tidyverse) library(bibliometrix) # Data loading and creating ToS ToS_short <-tosr::tosR("scientometrics.txt", "scientometrics.bib")The outcome of this process is a dataframe classifying papers into roots, trunks, and leaves.Table 1 presents an exemplar ToS, featuring the first three papers from each tree segment.The metaphor unravels the narrative of 'Scientometrics,' commencing with Hirsh's seminal paper introducing the h-index, [22] followed by the trunk's representation of HistCite software's application in evaluating scientometrics' impact. [23]The tree concludes with a leaf signifying a contemporary paper on scientometrics' application in photocatalytic degradation. [24]eating ToS -Comprehensive Method The comprehensive method of generating a ToS from a research topic serves users desiring a citation network for more detailed data analysis (refer to section 3.4).The tosr_load() function utilizes WoS and Scopus files as inputs, generating a list comprising three elements: a dataframe merging Scopus and WoS files (df), a graph depicting the citation network (graph), and a dataframe listing the names of the papers (nodes) (Source Code 2).

Extended Scientometric Analysis -Citation Network
The citation network forms the core of the ToS process and can be created using the 'tosr' package.Figure 3 exhibits a scientometrics citation network, segmented into three clusters.
The directed citation network comprises 1,568 nodes (papers) and 3,585 edges (references).Figure 3a categorizes the ten most significant subfields using clustering analysis, [25] with only three selected for detailed study.Figure 3b illustrates the longitudinal evolution of these clusters; despite being the largest, cluster 1 has displayed diminished production over the past four years.Figure 3c provides a visual representation of the citation network, where cluster 1 is the densest, primarily because it contains seminal papers such as [26] and, [27] renowned and frequently cited in academic circles.
The initial phase of scientometric analysis involves data acquisition and preparation, given that WoS and Scopus export data in .txtand .bibformats, respectively.Meanwhile, software tools such as R and Python necessitate a dataframe or JSON structure for effective data analysis.Source Code 3 outlines the principal libraries ('tosr', 'tidyverse', 'biblimetrix', and 'tidygraph') required to transform the data.The 'tosr_load()' function, sourced from the 'tosr' package, generates three files, one of which is the citation network in an 'igraph' object format.This is subsequently converted into a 'tidygraph' object.The 'tidygraph' package facilitates the addition of attributes and metrics to the network, employing 'tidyverse' syntax for enhanced ease of use.

Influence and Implications
The 'tosr' package facilitates researchers in extracting the Tree of Science (ToS) from a research topic, leveraging datasets from Web of Science (WoS) and Scopus.This serves as a potent tool to address exploratory research queries concerning their respective research topics, for instance, delineating significant contributions from inception to contemporary literature.
One of the salient features of 'tosr' is its ability to construct a citation network utilizing the two most widely employed datasets, WoS and Scopus.This citation network enables researchers to conduct more complex analyses, such as clustering and topic modeling.
Constructed in R, an open-source language accessible via CRAN, 'tosr' simply necessitates an understanding of R code from its users.It aligns with the Tidyverse ethos, a collection of R packages for data analysis that adhere to a unified philosophy and grammar.Thus, researchers can leverage the advantages of the R language and its dynamic community to conduct data analysis.
The 'tosr' package allows researchers to gain an understanding of a research topic via ToS by amalgamating WoS and Scopus data, offering limited restrictions on the volume of data they wish to process.The package can be employed either on a personal computer or in the cloud (via rstudio.cloud).Users have the ability to download up to 100,000 records per day from WoS and 2,000 from Scopus, integrate them via 'tosr', and subsequently perform statistical analysis.Moreover, users have the option to export the data in an Excel format for uploading to 'biblioshiny' (a Shiny application of bibliometrix), facilitating more intricate data analyses.
The 'tosr' package is a component of the Core of Science suite of products, employed in courses at various universities and institutions.The ToS concept was initiated with a web application exclusively for WoS data, [2] and subsequently expanded to Scopus data. [3]The 'tosr' package paves the way for the development of new web applications that can allow users to amalgamate both datasets, fostering the prospect of new courses with in-depth scientometric analyses.The impact of the ToS concept is evident, with over one hundred citations for the inaugural paper. [28]

CONCLUSION
In this scholarly discourse, we have succinctly outlined the salient functionalities of the 'tosr' package, demonstrated its application in a scientometrics investigation, and elucidated its profound implications on academic research.The primary merits of the 'tosr' package lie in its capacity to amalgamate Scopus and WoS datasets, thereby enabling more sophisticated data analysis, and its ability to construct the Tree of Science for a research topic.
Researchers stand to benefit immensely from this package as it simplifies the process of tracing the evolutionary trajectory of a specific research topic.Furthermore, this package can be complemented with additional tools and software to augment scientometric analysis, including topic modeling and burst analysis.
Future scholarly pursuits can concentrate on establishing a tidyverse environment conducive to data preprocessing and conceptualizing a web application capable of handling both WoS and Scopus.This innovative development in scientometrics significantly extends the scope of data analysis, thus enriching the process of knowledge discovery in various research fields.

Figure 1
Figure 1 outlines the 'tosr' package operation.The tosR() function employs Scopus and WoS files to generate a ToS pertinent to a specified research topic.There exist two methodologies for creating the ToS-a concise path and a comprehensive one.

Figure 1 :
Figure 1: General overview to create the Tree of Science with tosr.

Figure 3 :
Figure 3: Citation network for scientometrics using tosr.a. Biggest subfields and ordered by size.b.Longitudinal analysis of the three biggest clusters since 2000.c.Citation Network of the three main clusters.