Epigraphiology: A Hybrid Approach for Measuring and Analyzing Influence Diffusion in Article Networks

Identifying influential nodes in an article network is crucial for understanding the dynamics of information propagation and its impact on various applications. Traditional methods often rely on citation-based analysis or network structure, overlooking the intricate dynamics of diffusion and node linkages. In this research, we propose a novel scoring model, named "Epigraphiology," which combines these aspects to compute and analyze the elements contributing to the spread of influence in article networks. To evaluate the effectiveness of our approach, we employ real published article networks with around 904 articles downloaded from the WOS (Web of Science) with total cited references of 32084 in the field of cloud computing from 2010 to 2015. By leveraging the SIR (Susceptible-Infected-Removed) model, we compare the dynamics of articles in the network with the transition of states, highlighting the diffusion process. Additionally, we derive the Reproduction number (R0) for our model, serving as an indicator of the potential spread of influence. Our findings showcase the following key contributions: (a) Epigraphiology introduces a novel methodology for measuring the diffusion capacity of an article's influence in a hybrid manner, combining diffusion dynamics and node linkages. (b) Contrary to traditional approaches that primarily consider the number of citations (in degree), our results reveal that articles with lower citation counts can still act as super-spreaders, reflecting the ground-truth influence scores. Cross-validation of an article's influence diffusion score is performed, shedding light on the significant factors contributing to its spread within the network. By bridging the gap between diffusion dynamics, node linkages, and influence measurement, Epigraphiology offers a comprehensive approach to understanding and quantifying the spread of influence in article networks. This research holds implications for various fields and applications where the identification of influential spreaders is paramount in leveraging information dissemination and impact assessment.


INTRODUCTION
The ranking of an article node based on "influence" in a citation network is often based on the Centrality Measure (CM) and its variations.The common strategies used vary from graphical centrality measures such as degree, closeness, and betweenness, to diffusion-based methods, like Page-Rank, Leader-Rank, and epidemiological models. [1]Authors A. Landher et al. explained how in Social Network Analysis (SNA) literature a wide range of CM exists to quantify the interlink of individual entities associated with the social network. [2]These commitments from SNA literature permit the general conclusion that distinctive CM frequently indicates various outcomes for the centrality of individual entities.Everett and S. P. Borgatt discussed the limitations of CM and explained three ways to further improve the basic concept of centrality. [3]In the first method, the centrality is applied to groups as well as individuals.In the second level, two-mode data is used that applies all the tools and concepts of centrality to this data set.In the third method, centrality is applied to the innermost and external periphery structure of a network.The authors Zeng and Zhang proposed a Mixed Degree Decomposition (MDD) procedure based on the K-shell decomposition method.The MDD approach is shown to outperform the existing ranking approaches that are based on degree methods. [4]They also introduced some interesting additions to the existing concepts of degree, closeness, and betweenness centrality as distinguished by. [5,6]Chen, Gao, et al. proposed the local ranking algorithm named ClusterRank which considers not only the number of neighbors and the neighbors' influences but also the clustering coefficient. [7]The authors ' Zhu  et al. proposed a unique approach to ranking individual nodes of a real-world communication network based on their roles in such diffusion processes. [8,9]The existing evaluative methods explore the citation network of an article and try to trace the diffusion path using indirect citations or by exploring the similarity index. [10]Also, the existing Article Influence Score (AIS) as reported by the SCI-Web of Science is an average influence of a journal's article for the first five years after publication.These metrics do not consider the spreading abilities of an article node.However, a lot of literature exists on the SIR spreading model.It is used for simulating the spreading processes in networks to evaluate the performance of the algorithm as explained in. [11]evertheless, scientific literature should place some emphasis on the strength and spread of influence propagation of a node in a short time while scoring the "Influentiality" of a node.
In this paper, we propose a hybrid approach called 'Epigraphiology' to identify the influential nodes in a citation network using the SIR model.A.G. McKendrick and W.O. Kermack formalized the famous modern mathematical epidemic model named the Susceptible-Infected-Recovered (SIR) compartmental model [12] when studying the spreading pattern of plague .The mathematical model for the spread of infection was explained in a series of works by. [13]The concept of reproduction number was first introduced by [14] where it was shown through clinical trials that a threshold of the mosquito population below a critical level would be sufficient to control malaria.These compartmental models are found to be useful in estimating the diffusion or spread of an infection in a susceptible population.We have harnessed the reproduction number R0 as the key indicator of the spread of influence diffusion across citation networks.
The value of R0 is derived mathematically using the graphical parameters from a real-world citation network of articles.Further weights are assigned to each of the features and an optimized score is calculated which reflects the Influence of an article in a network based on the direct(primary) and indirect citations(secondary infections).

Objective and Problem Statement
The current scholastic evaluation methods lack a metric that can effectively capture an article's influence diffusion within a domain, independent of the ecosystem it grows in.Existing graph-based metrics, such as K-core and centrality metrics, do not adequately reflect the article's "influentiality". [15]This poses a challenge for agencies seeking to evaluate a group of researchers' work and for top-tier universities looking for evidence of influential research capability during faculty recruitment or funding grants.Additionally, there is a need for a metric that funding/grant/patent organizations can employ to assess the impact their funding had on a funded project.Therefore, the main problem addressed in this study is the absence of a comprehensive metric that purely quantifies an article's influence-spreading abilities, goes beyond citation count to measure impact, and allows for evaluating the impact of funding on a project.The research aims to develop a unique graph-based method that combines the positional and diffusion capabilities of nodes in the academic network.The proposed algorithm was tested using real data and then compared with existing methods to draw significant inferences and validate its effectiveness in measuring and evaluating academic influence.

We address the following issues during our experimental study
Is there any metric other than the existing ones that will quantify the article purely based on its influence-spreading abilities in a domain?
Is there any metric that goes beyond citation count and quantifies an article's impact?Is there a metric that funding/grant/patent organizations can use to evaluate the impact the funding had on a funded project?
The rest of the paper is organized as follows.The "Methods" identified for Epigraphiology are presented in the following section.The Section "Experimental setup" discusses the complete setup for the experiment with the construction of the dataset and construction of the network with the derivation of the parameters and finally the influence score computation.The "results and discussion" section is written to provide insight into the algorithm and its highlights.

METHODOLOGY
The methodology for deriving the R0 for our model and developing the unique graph-based method to measure academic influence involved the following steps:

Data set construction
A data set with around 904 articles downloaded from the WOS (Web of Science) with total cited references of 32084 was constructed.Next, a network representation was built using the collected data.Nodes in the network represented academic articles or researchers, and edges represented relationships such as citations.

Factor Identification
The obtained network was analyzed and various factors that influence the spread of influence within the domain were identified.This involved examining the network structure, node characteristics, influence transmission mechanisms and time dynamics.

Model Inception
Developing a mathematical model using an Epigraphiology algorithm that incorporates the identified factors to quantify the influence-spreading abilities of academic articles or researchers.
The model was designed to capture the positional and diffusion capabilities of nodes in the network.

R0 Derivation
The developed model was further used to estimate the basic reproduction number (R0) for the academic network.R0 represents the average number of secondary influence transmissions caused by a single influential node.This estimation will provide an objective measure of the influence-spreading abilities within the domain.

Validation and Comparison
Validate the derived R0 value and the proposed graph-based method by comparing the results with existing methods or metrics.Conduct statistical analyses and draw appropriate inferences to demonstrate the effectiveness and uniqueness of the developed approach.
By following this methodology, we could address the research objectives and problem statement, develop a unique graph-based method for measuring academic influence, and provide valuable insights into evaluating influential research credentials within a domain.

Graphical Methods for Network Structure and Node Characteristics
The underlying structure of Epigraphiology is based on the graph which is created from the articles as nodes and edges representing citations.The first part of the algorithm is used to compute the positional strength of an article node in a graph.Our graph G (V, E) had 'N' vertices which were the articles in a domain published within the years 2011-2015.The 'E' edges represented citations from one article to another.Next, the positional strength was computed using the following graphical measures as defined below:-

The In-degree of a node
In-degree is defined as a reflection of a node's importance in a citation network.This is because the directed edge from article 'i' to 'j' indicates that citing article 'i' is influenced/infected by the cited article 'j' .

Closeness Centrality
It is one of the most commonly used measures in citation networks, where nodes with high closeness centrality explain the highly favorable position of the node to spread the influence around.

Betweenness Centrality
The Betweenness centrality method calculates the amount of influence a node has over the information flow in a graph.It is often used to establish a connection between one part of a graph to another.

Eigen Vector Centrality
It is an extremely important graphical measure where the nodes with higher in-degree have a high score.So a connection from  such high-scoring nodes is considered more important than low-scoring ones.

The SIR-Based-Influence Spreader Model
One of the efficient methods to quantify the influence of an article node in a network is by investigating its trajectory path.The Susceptible-Infected-Removed model (SIR Model) is one of the commonly used methods which can simulate the spread through compartmental stages.The SIR model is embraced in "Epigraphiology" to assess the spreading capacity of a node.We derive the formula for Reproduction number ( R 0 ) as per our model and track the secondary infections of an article.Each node in the network represents an article published in a year and labeled as Susceptible (S).The citation to each of these articles creates a transition from S to I (Infected).Often articles do not get any citations even after a period of five years citation window.These articles are then moved from S to R (Removed).For further analysis, the articles are classified into three groups based on their position in our citation window from 2011-2015:

Highly Susceptible
Those articles which are published in 2011-12 and hence are highly susceptible to maximum infections (citations) owing to maximum exposure in a domain for 5 or 4 years.

Moderately Susceptible
Those articles were published in 2013 and are aged 3-4 years.These articles are moderately exposed to infections and may be capable of becoming efficient spreaders in the future.

Nascent
Those articles published towards the end of our citation window with just existence of 2 years and yet may have the potential of being a super spreader.

Mathematical Modelling with Time Dynamics
In the citation network context, S(t) is the number of articles published in the domain of "security issues in cloud computing" at a time 't' which is the year of publication.I(t) is the number of articles that get citations from other articles representing the spread of infection.R(t) is the number of articles that do not get citations.The SIR model has certain parameters like infection rate ( α ), Transmissibility rate ( β ), and recovery rate ( γ ) which are crucial to reflect the dynamic nature of a complex network.
Here α is the rate at which a node gets infected in a population of susceptible nodes while β is the average rate of contact between susceptible and infected individuals and γ is the rate of removal from the population.Table 3 lists out the analogies between the SIR model and our model.
We represent the states as the proportions of the network as follows.

Derivation of Reproduction Number
In the study of information diffusion models, the Epidemics model has been potentially found to be highly effective in implicit networks.R0 is a crucial parameter in the study of infectious diseases and epidemiology. [16]It represents the average number of secondary infections that can be generated by a single infected individual in a population that is entirely susceptible to the disease.The most important use of R0 is determining the contagiousness and transmissibility of an infectious disease.This property is suitably used in our model to simulate and measure the spread of article influence.The basic reproduction number must be greater than 1, otherwise, influence propagation will die off, i.e., R 0 > 1 .Figure 1 shows the propagation of infection for R0=2.The node at layer 0 is capable of spreading the infection to at-least two nodes which further can infect two other individuals.
To derive the R0 for our model, we need to identify and analyze the various factors that affect the spread of influence.These factors play a crucial role in determining the rate at which influence is transmitted from one node to another in the network.By understanding and quantifying these factors, we can estimate the basic Reproduction number (R0), which represents the average number of secondary influence transmissions caused by a single influential node.
The spread of influence of an article depends on the average rate of citations, duration of the infection, and transmittable rate.We define R0 for our model as follows: More specifically in our Model where α is the probability of infection given contact between a susceptible and infected individual, β is the average rate of contact between susceptible and infected individuals, and d is the duration of infection.In this study, the above parameters are recalculated to derive the Reproduction number that is suitable for our citation network as follows:- In graph G(V, E) Infection Rate of a node n which belongs to the set vertices N is defined as: In a graph G(V,E) Transmissibility Rate for a node n is defined as: where 'a' denotes the publication year of the article and 'b' is the end of the citation window.This can also be defined as the average number of citations received by an article per year.
The term ' d' indicates the duration of infection and in our model, this is calculated by investigating the years between the publication of the article and the maximum time frame which is 2015.
The removal rate ( γ ) is kept at 1 because ultimately all articles which are infected or susceptible state, tend to get removed as no citations are received by them after a certain period.

Epigraphiology Algorithm
The proposed algorithm detects the most influential nodes in a graph based on locality measures and spreading capabilities.The overall approach is shown in Figure 2.
The algorithm is a three-phased approach.In the first phase, the parameters are extracted from the citation network and graphical and epidemiology features are calculated.The value of the reproduction number is estimated from the α β values explained in the previous section.The next phase starts with assigning weights to the graphical features keeping R 0 as the target function.
The extra tree classifier indicates the weights of each feature and finally the influence score is calculated as the weighted score.The algorithm below gives the complete details of the steps performed for influence-score computation.
Edges representing Direct citation from a citing article to cited article.
Output: Influence Scores of each Article involved in the citation network .
Article nodes []=array to store article nodes in a network.

Experimental Setup and Results
In this section, we discuss the steps incorporated in creating the data set and the network.The overall methodology adopted can be explained in the following subsections.

Data Set and Graph Construction
The data set is created by downloading all articles published in the domain of "Security issues in cloud computing" from the Web of Science.To construct the network for "Epigraphiology", all articles with at least indegree=1 or outdegree=1 both are considered which means the article must get cited or cite one or more articles.This resultant network is a network with 460 nodes with 739 edges.
Figure 3 is an overview of our network.Once the graph is ready we start calculating the various local parameters for each node using the Python package networkx.The in-degree, out-degree, Closeness (CC), Betweenness (BC), and Eigen-Centrality (EC) are computed as discussed in the "Graphical Methods" section and stored in a data set along with the article ID and authors list.The next step is to calculate the R 0 as discussed in the "Epigraphiology Algorithm" section.The parameters α and β along with the value of ' d' are listed below.The Reproduction number is one of the key predictors as it has the power of expressiveness to demonstrate the spread of influence.Table 4 represents the value of R 0 10 random articles.Interestingly highlighted node 379 indicates a good R 0 value even though the value α is less because of the very   high transmissibility rate.This node is an excellent example of the nascent published article category as though it is published in 2014 it is a potential super-spreader.

K-shell Decomposition: Comparative study with R0
The k-shell method is a popular existing method to identify influential nodes in a network.However, it uses only global information such as betweenness, and allocates the same core numbers to many nodes.The k-core method is used for static networks which have a fixed structure.The k-core of a particular graph 'g1' is defined as the maximal sub-graph of 'g1' having a degree of at least k.The K-shell method was applied to this dataset as well and a comparative analysis with R0 is performed.Table 5 shows a large number of articles are assigned the same shell numbers leading to difficulty in ranking articles.As shown in Table 5 both articles 17 and 126 have the same core number 4 but R0 is different for both articles.Also since only global information is used for assigning shell numbers, many nodes which are nascent and influential are not captured by K-decomposition.

Article Influence Diffusion Score Computation
Once all the technical parameters are collected from the above-described methods, the influence score is computed.We use the R0 values as the target function and assign weights to all the parameters using ExtraTreeClassifier.This is a classifying method, but we have used it to get the importance of attributes.The Article Diffusion score is now calculated by taking the product of attribute weight and attribute value.Table 6 shows the influence scores of the top 10 highly influential articles with the in degrees.The results obtained are in line with our hypothesis that the citation received cannot truly reflect the influence of an article.Article ID 64(highlighted) does prove our point as even with in-degree 17 its Influence score is at par with its established peer articles.
Figure 4 demonstrates the spread of node 64 as an influential article in the network.Also, the peak of infection is shown alongside.

Influence-Diffusion Cross-validation
Multivariate linear regression was performed over the Influence score using the centrality measure variables from the citation network as the independent variables.Mathematically, eq.7 as follows: Where Y = Influence diffusion score, X 1 = Closeness Centrality, X 2 = betweenness Centrality, and X 3 = Eigen Centrality.The model had an adjusted R 2 value of 0.733 and the F statistic was found to be significant.The coefficient for ' eigen centrality' was found to be significant.Table 7 shows the results of the regression.
It can be seen from  from the final model.The next goal is to detect multi-collinearity within the other independent variables.The Variable Inflation Factors (VIF) method was used to detect multi-collinearity among the independent variables.Variables In-degree and Infection probability had high VIF values and hence they were removed as well from the model.The model then ran a linear regression using the citation network variables.Mathematically, the model can be represented by eq.7 where Y = Influence diffusion score, X 1 = eigen centrality, X 2 = beta, and X 3 = gamma All coefficients were found to be significant at the 95% level.however, the beta value(Transmissibility Rate) seems to have the maximum impact.Thus, it can be verified from the above equations that the citation network variables have a significant impact on the Influence-Diffusion score, while the centrality measures do not have a huge impact.

Comparison of Epigraphiology with Alternative Methods
We compare the proposed Influence Diffusion model with the existing Article Influence Score (AIS) metric which evaluates the relative importance of a scholarly journal's articles within the citation network.It is calculated by dividing a journal's Eigenfactor Score by the number of articles published by the journal, normalized for differences in citation frequency across disciplines.We also compare our model with citation count as well on various parameters as shown in Table 9.

DISCUSSION
Epigraphiology is a unique method with graph graph-driven approach.We have performed various validations at each step to test Epigraphiology as an algorithm.1) The use of R0 as one of the key predictors as it has the power of expressiveness to demonstrate the spread of influence is suitably compared with existing K-core decomposition in section 5.1.1.The results are discussed in detail to indicate the effectiveness of the same.2) We Compared Epigraphilogy with the existing metrics like Citation count and Article influence score in section 6.1.Our Model captures the spread of influence and the temporal dynamics of influence diffusion more directly, potentially accounting for both short-term and long-term effects as the influence spreads through the network.This is a unique quality that makes our model work on all real-life datasets.3) Sensitivity analysis was performed by

AIS Citation count Epigraphiology
Scope AIS evaluates the influence of entire journals based on the citations received by their articles over a five-year period.
Citation count measures the number of times an article has been cited by other scholarly works.
The proposed model focuses on evaluating influence diffusion of individual articles within a scholarly network.
Granularity AIS provides a journal-level metric, aggregating the influence of all articles published in the journal.
It provides a quantitative measure of the impact of an article within the academic community but doesn't differentiate between direct and indirect influence or the spread of influence beyond citations.
The proposed model focuses on evaluating the influence diffusion of individual articles within a scholarly network.

Temporal Considerations
AIS is calculated based on citations accumulated over a five-year period, providing a snapshot of influence during that timeframe.
Citation counts typically reflect the cumulative number of citations received by an article over time.It doesn't necessarily differentiate between recent citations and those accumulated over a longer period.
The proposed model may capture the temporal dynamics of influence diffusion more directly, potentially accounting for both short-term and long-term effects as the influence spreads through the network.
Methodology AIS is based on the Eigenfactor Score, which considers the network structure of citations and accounts for differences in citation patterns across disciplines.
Citation counts are straightforward to calculate and widely used as a proxy for research impact.However, they may be influenced by factors such as self-citations or citation practices within specific fields.
The proposed model utilizes graph network analysis to quantify the diffusion of influence from individual articles, considering factors such as direct and indirect connections between articles.
Application AIS is commonly used to evaluate the impact and prestige of scholarly journals, aiding researchers, librarians, and funding agencies in decision-making.
Citation counts are commonly used by researchers, institutions, and funding agencies to assess the impact of scholarly publications.They play a crucial role in tenure and promotion decisions, funding allocations, and ranking of researchers and institutions.
The proposed model offers a systematic approach for evaluating individual scholarly contributions, aiding in the identification of influential articles and researchers within specific domains.

For
each node x of node 'a' do If x has cited a then Append x to in-degree nodes [] end if end for for each node x in in-degree nodes [] do Compute BC, CC, EC as ATTR VALUE end for for each node x in article nodes [] do R=((in-degree)/N) * Avg edges * D ATTR VALUE * W end for

Table 1
list outs the Table search string and Table 2 is a snapshot of the dataset

Table 7
that the eigen-centrality value has a huge impact on the Influence diffusion score.The coefficients for Closeness Centrality and betweenness Centrality are insignificant indicating they do not have a significant impact on the Influence-Diffusion score and these variables are dropped

Table 8
shows the results of the regression.All variables play a significant impact in determining the Influence-diffusion score,

Table 6 : Influence score of the top 10 articles. Figure 4:
Final State of node 17 with the Highest Influence Score and infected nodes as red.