An Analytical Study of Alternative Method for Solving Lotka’s Law with Simpson’s 1/3 Rule

This paper deals with a new stochastic method to solve Lotka’s Law with higher degree of Newton-Cotes Quadrature Rule as an alternate method to existing solution of the value determination of the constant part of the power law; so far, M.L. Pao gave a solution with an equation to determine the area under the curve with numerical integration rule with degree=1 which is also known as Trapezoidal Rule. Here, next higher degree 2, popularly known as Simpson’s 1/3 rule at closed interval [ x 1 , ] has been used to establish a deterministic equation form to solve authors’ productivity realized through Lotka’s Law. Re-estimating the value of C with higher degree quadrature rule is very crucial as the probability of inclusion of more area and exclusion of unnecessary area under the curve is more precise. Another area of investigation is the determination of p value (Pao determined p =20), i.e. whether p =20 can be altered? Or equation derived through Simpson’s 1/3 rule, whether it can give a minimal residual error beyond p =20. This paper is dedicated to build up a mathematical equation to solve the constant value(C) of the Lotka’s law equation as well as enlighten all these investigating points.


INTRODUCTION Fundamentals of Lotka's Law
In last century, researches in the field of Scientometrics was dominated by three empirical power laws: Bradford's Law, Lotka's Law and Zipf 's Law and ultimately, they become classical laws in this subject and created a whirlpool of researches in last 7 decades.All of them are essentially behaving like Power Law and their forms can also be derived to interchange when they are put to representative forms under generalized framework.This article discusses the essential existing solutions of Lotka's Law [1] when applied to datasets and highlights other possible solutions.
Lotka's law, also known as law of authors' productivity, states that-the number of authors making contribution with x papers at certain time interval is inversely proportional to the number of authors making single contribution in their lifetime.
Here, x denotes number of papers, φ (x) denotes number of authors, C is a constant value and n is the exponent value of the variable x.According to Lotka, n is perceived to be around 2, if n=2, the equation becomes a perfect inverse square law.In other words, function φ (x) is a power law function representing concentration of sources (here author) with items (papers) and x ϵ N {x≥1}.The form of Riemann-Zeta function is Expanding the summated form to integer numbers, it becomes, Thus, when n=2, it becomes a special case which is called as inverse-square law which invites Lotka's Law to fit in.Lotka's Law is a special case of Riemann-Zeta Function.Eqn. 2 is only solvable when n=2 and 4 only.The solution is

Lee Pao's Method
The main procedure of solving Lotka's function is deriving the value of the constant (C) as well as the exponent value (n).Exponent value is calculated using MLE (Maximum Likelihood Estimator) method.There is no room for doubting the efficacy of MLE while calculating the value of the exponent.Calculation of

Calculation of the exponent 'n'
The value of the exponent actually reflects many factors, on the reverse, we can hypothesize-there are several factors which actually governs the value of the exponent.The exponent value is calculated using simple linear least square estimator.The value of n can be found with -Here, N=Number of Data Points as pair of x and y values, X= Logarithmic Transformation of x and Y=Logarithmic Transformation of y.

Calculation of constant 'C'
The crux of the problem of solving Lotka's law is the accuracy of fitting the observed data into the equation with the calculated numerical values.As stated above, if n is 2 or 4, solving equation-1 is pretty straightforward.But in real world, distribution of data rarely shows such sharp exponent value and there is no specific method to estimate such things.As stated by Pao, [2] dividing both sides of equation-1 with total number of authors, Now, let F(x)= and C= is the new constant.So, equation-7 can be rewritten as -Eqn.8 is another identical form of the original Lotka's equation as written in Eqn-1.Putting the numerical values in equation 1-

Dividing both sides with total number of authors
The new constant

LITERATURE REVIEW
The researches done so far on Lotka's Law can be segregated into two broad sub-divisions-a.Development of Methods to solve it, building mathematical relation among Bradford, Lotka and Zipf 's law; b.Application of the law in different subjects to understand authors' productivity.As Lotka's law shows a power law form, there have been continuous efforts to fit different standard distributions on authors' productivity in different disciplines.Murphy [3] applied Lotka's law in Humanities discipline using inverse square law and the law did fit on the dataset.Hersh [4] analyzed authors' productivity on researches on Drosophila assuming a simple power law function: y=bx k and he stated that, prediction on the behaviour of exponential curve is risky due to uncertainty in reaching at the point of inflection at any moment.Sen [5] used a simple method to calculate constant value C by putting the first value of number of paper published (here x=1); he also calculated constant value C with Lee Pao Method. [2]His method showed C= 1.485 and Pao method showed the constant C to be 1.682 and he also showed his method by and large follows Sen's Method of deriving C (constant value).Empirical analysis made on the available standard datasets using inverse power model of Lotka's Law (Y x =kx -b ) and estimated constant K using Pao's method and exponent with Maximum Likelihood Estimator (MLE) and subsequently K-S Test was done in order to test the goodness of fit. [6]Bookstein [7] made an important analysis on the intrinsic behaviour of classic bibliometric distributions as shown in Bradford's Law, Lotka's Law and Zipf 's law.Bookstein concluded that, "Lotka's law is not sensitive to how we count articles, so that two people testing the law for a single population, but different count methods, will very likely to come up with the same law". [7]Krisciunas [8] made a short communication to the editor explaining his study on randomly chosen sample and he observed that, Lotka's law is biased towards the distribution of relatively more prolific authors and he also concluded that, the dataset must be prepared taking authors' publication over a long time period; then only Lotka's law can be best fitted on the dataset.Krisciunas fitted exponential distribution form over his collected data, but it did not decrease as fast as Lotka's distribution. [8]In another study, [9] Brookes proved that, all the empirical laws get reduced to simple form of hyperbolic law and its probability density function revolves around k/x 2 and operates upon certain intervals of X-axis.Bradford's law shows identical behaviour with other similar empirical laws (Lotka, Price and Zipf). [9]In another paper, [2] Pao used general inverse power law x n y=c and estimated the exponent value using linear least-square method.Constant C was approximated by numerical integration method using trapezoidal rule.As compared with the actual value of π 2 / 6 , Pao's method produced error less than 1/110,000; for π 4 / 90 , the error is less than 1/25,000,000.In another research paper, [6] Nicholls validated Lotka's law on 15 classical datasets after ground-breaking work of Pao [2] and he proposed two modification on Pao method.First proposal of modification was while calculating the exponent value, the data need to be truncated and Maximum Likelihood Estimator should be solved by numerical iterative method provided by Johnson and Kotz. [10]Second suggestion for modification was to also include multi-authorship fractional credit counts. [6]Nicholls [11] investigated the validity of Price's Law against the back-drop of the Lotka's law and he tried to prove consistency with respect to its theoretical and empirical behaviour between the two.But, the value of the constant k is always dependent on the exponent value b of the distribution and empirically on X max (No. of Papers produced by most prolific authors) is not infinite and also doesn't follow a limiting value; in some datasets-number of authors with single publication vary considerably across different subjects and this actually disobeys Price's conjecture due to problem in indicating cut-off point between prolific and non-prolific authors.His observation was "the validity of the Price law need not depend on that of Lotka's law, the Price law is seen to be inconsistent with the generalized Lotka model.The Price law does not agree with empirical data very well; empirical results do not support the Price hypothesis.Since the empirical validity of Lotka's law has recently been more firmly established, it is not surprising that the empirical and theoretical findings are consistent". [11]In another study by Nicholls, [12] Pao method was applied on 70 datasets and his observation was, 90% of the cases followed generalized Lotka's method.In a ground-breaking study by Bailón-Moreno et al., [13] they deduced a Unified Scientometric model by unifying all three classical scientometric laws and their variant forms through concept of Fractal theory and accumulated advantage models.Through the use of Index of fractality, they also showed with the difference of Fractality Index, how different forms of distributions (Zipf-Mandelbrot, Lotka, Leimkuhler Distribution form of Bradford's Law, Booth-Federowicz-Zipf Distribution, Condon-Zipf Distribution, Brooke's Law for Aging of Science, Price's Law of Exponential Growth, Generalized Model of Aging-Viability etc.) are created and some of them changed their equation forms as well. [13][16][17][18][19][20][21] Egghe propounded two dimensional Information Production Process(IPP) with size-frequency function and size-frequency functions and this novel approach created a new domain of subject known as "Lotkaian Informetrics". [17,19]A general mathematical framework was developed for a continuous description of classical bibliometric laws. [22]Egghe [15] applied Pratt's measure for bibliometric distribution and proved that, 80/20 rule and Price's Square-root law basically can be derived from Lotka's law.In another study, [23] Egghe explained different consequences of Lotka's Law, for instance, the negative correlation between average number of papers and Lotka's parameters and change of Lotka's parameters for high concentration of authors.He also explained Herdan's law in linguistics and Heap's law in information retrieval on the basis of Lotkaian informetrics. [20]gghe proved the fitness of Naranan's theorem in the framework of generalized IPP framework of power law and further interpreted Lotka's and Zipf 's laws. [21]In the same line of research, Egghe derived mathematical relation between fraction of sources and produced items [15] and also studied inequality aspects of Zipfian and Lotkaian functions. [18]

Research gap and purpose of the study
After literature review, it is realized that, no research has yet been done to find out new alternate methods apart from trapezoidal rule and that is probably due to minimisation of error approximation.Most of the researches have been focussed on the applications of Lotka's law on different subjects in order to investigate whether the authors' productivity distribution follows this law or not.This study derived the formula for Lotka's constant (Equation 32) on the basis of Simpson's 1/3 rd rule.

METHODOLOGY
The central part of the calculation is to divide the whole curve into two regions, the first part is to calculate the area under the curve upto a certain P and then from P to infinity.Under this method, the curve representing the function f(x) is approximated to compute the given integral form.Any approximating method always contains some error and the error must be evaluated along with its integration.The method, so far used is to evaluate the constant value, is the use of Newton-Cotes Quadrature formulas.
The integral form is, Here I n is essentially a Lagrangian polynomial equation and EI n is the error part.Under Newton-Coates formulas, Pao approximated the value of constant C with the principle of calculating the area under the curve using Trapezoidal rule of numerical integration method and here n=1, that means only 2 points ( x 1 & x 2 ) are connected.Our approach is to refine the calculation of approximating the area with several other higher degree numerical integration methods.That's why, in the same line with Pao, we are applying Simpson's 1/3 rule to calculate the area under the curve and here n=2.The function can be estimated by fitting a parabola passing through (x 0 ,f(x 0 )), (x 1 ,f(x 1 )) and (x 3 ,f(x 3 )).

RESULTS AND ANALYSIS Data
As this paper is solely focussed to develop a new method to derive the value of the constant, no new dataset has been collected here.In the celebrated paper by Lotka, he coined 6891 unique contributor/author starting with letter A and B from 1907-1916 of Decennial Index from Chemical Abstract of volume 1-10.The Auerbach data is the collection of 1325 number of prominent physicist from Geschichtstafeln der Physik until 1900.Only those physicists' names were covered who made significant contributions in Physics. [24]For the sake of proof of concept, old datasets [Chemical Abstract and Auerbach Data originally compiled by Lotka himself in 1926] used by Pao, [2] have been used here in order to continue the legacy.

Data Analysis
After deriving the mathematical form, it is necessary to test whether the threshold of the point P to be fixed at 20 or something else.
After setting exponent value n=2, we are putting several values of P to check upto what the error minimises.The table given below shows that, using equation-10, at P=21 the error minimises.After p=21, difference goes negative which means, using Simpson's 1/3 rule, estimating the inverse square infinite sum does not fit.Using trapezoidal rule, at p=20 the error or difference is minimised and with our method, the threshold can be fixed at 21 and this is a significant As mentioned by Pao, Lotka decided to use the simplest possible solution as to use to calculate constant because of its mathematical elegance and subsequently concluded as -"the proportion of all contributors that make a single contribution is about 60%".Although, it is needless to say that, more approximation is needed even though inverse power relation exists between authors and their contributions.The testing with higher degree numerical integration resulting into inclusion of more data points as expressed though p value.Whereas, Coile's testing on Murphy's data and Vlachy's effort was based on inverse square function rather than re-calculation on values of n and c.The Table 1 shows the successive calculation of 1 / with contiguous p values.The values of the constant 'C' at different exponent values are presented in Table 2.The results of Kolomogorov-Smirnov Test (KS Test) of observed and expected distributions of Chemical Abstract data are presented in Table 3.The results of KS Test of the Observed and Expected Values of Authors' Productivity Distribution of Auerbach Data are presented in Table 4.The comparative study between the values of 'C' (The Constant) and 'N' (The Exponent) for both of the Chemical Abstract data and Auerbach data are presented in Table 5.The methods applied and the results of hypothesis testing are also indicated in Table 5.
The value on the exponent (n) depend on multiple factors-Size of the bibliography, authors' collaboration behaviour, nature of subject (Physical Sciences, Social Sciences, Biological Sciences, Humanities and Arts) etc. Small changes in decimal places can change a lot in C value.

Calculation of Constant C (For Chemical Abstract Data):-
So, the Lotka's equation becomes -

Statistical Test
So, the maximum difference lies beyond the critical value at 0.01 significance level of Kolomogrov-Smirnoff test.Therefore, null hypothesis is rejected that means the given distribution is different from the theoretical distribution given by ϕ(x)=0.566847/x1.8878.
Our method exactly complies with the exact phenomenon explained by Pao, w.r.t.Chemical Abstract data. [2]

For Auerbach Data
So, the Lotka's equation becomes -Here, we can see that, the maximum difference is less than the critical value at 0.01 significant level and thus null hypothesis is accepted.That means, φ (x) = 0.6151553/x 2.021 is fair enough to fit the observed values in Auerbach data.

DISCUSSION AND CONCLUSION
The null hypothesis is rejected for Chemical Abstracts data in both Trapezoidal Rule and Simpson's 1/3 Rule, whereas the null hypothesis is accepted for Aurbach's data in both Trapezoidal Rule and Simpson's 1/3 Rule.This paper has presented the possible scope of remodelling and restructuring the Pao method in order to fit Lotka's law upon any authors' productivity dataset in any subject domain.The Pao method is very vigorous and precise that has been used in order to validate Lotka's Law in different domains since more than two decades.The accuracy of every method can't be taken as 100% perfect and every method has some error term up to some finite level.
The main challenge is how to minimise the error and how much accuracy can be achieved for the area under the curve calculation.the mid-point between the said points and by this rule more area under the curve can be covered resulting better precision.Whatever Pao Method [2] explained about Chemical Abstract data and Auerbach's data, same phenomena can be explained by this method.This research may trigger to experiment with higher degrees of numerical integration methods.Similar efforts has been recently seen by Basu and Dutta [25] on implementing Simpson's 3/8 rule on similar datasets as used by Pao.
According to Chen and Leimkuhler, [26] common functional relationship can be established between Bradford, Zipf and Lotka's Law.Basically, Lotka's Law is about Frequency-Size approach; Bradford's Law is about cumulative-frequency log-rank approach and Zipf 's Law is all about Frequency-Rank approach.
Though mathematically, functional relations can be established but, the very nature of data and its distribution are different.An attempt was made by the authors of this paper to implement such Simpson's rule in order to develop new methods for calculation of Bradford's and Zipf 's law, which was failed.There are other numerical integration methods to calculate area under the curve e.g.Lagrange interpolation functions, Bisection Method, Cubic Spline Methods etc., which are untouched areas till date.The future research may explore these domains for Lotka's Law and other scientometric laws as well.
Basu and Dutta: Lotka's Law with Simpson's 1/3 Rule the constant is always very problematic.Essentially the constant C can be derived by the following equation form C is the inverse of the summation of the series.The value of C is depending on the particular domain/subject, age of the subject, collaborating behaviour among researchers, research aptitude of individual scientist and many other factors.
Basu and Dutta: Lotka's Law with Simpson's 1/3 Rule The Integral form of the Simpson's 1/3 rule is -Error estimate in case of simpson's 1/3 rule is -When we define the limit as [x,x+1] on [a,b] and putting the value of M in eqn.18 If x=P, P+1, P+2, P+3 ...........∞, so summing the inequalities If we expand the equation, we can get the following expression-In the same line, we need to estimate the term It is implied that Now again re-arranging eqn. 25 with Value of 28 and 29 After putting the sum in Eqn. 31 So, from Lotka's law, the constant C has been transformed into the following form - Basu and Dutta: Lotka's Law with Simpson's 1/3 Rule development in this estimation process.Inclusion of more data points and better error minimization are more acceptable.

Table 3 : Kolomogorov-Smirnov Test of observed and expected distributions of chemical abstract data.
Basu and Dutta: Lotka's Law with Simpson's 1/3 Rule

Table 4 : KS Test of the Observed and Expected Values of Authors' Productivity Distribution of Auerbach Data.
Pao method is based on Trapezoidal rule where two points are connected through a straight line fitted between two successive points.In Simpson's 1/3 rule, quadrature is fitted within interval of [a,b] where n=2.As a quadrature is fitted in a curve at an interval of [a,b]; the chance of exclusion of area under a curve gets minimised.Inversely, in its effect the constant value C gets Critical Value is (At 0.01 Significance Level).And Maximum Difference(D max ) is-0.020719.moreoptimized and that's give better approximation where there is a very minute difference between the critical value and the maximum difference.The pattern of the result with Simpson's 1/3 rule and that with Pao method are same.Using Simpson's 1/3 method, it was found that the Lotka's law is not obeyed in case of Chemical Abstract data, but, Lotka's Law gets fit in case of Auerbach Data.The basic difference between Pao method and our method lies in the use of higher degree Newton-Coates formula.Fundamentally, trapezoidal rule deploys linking two intermediate points and then include/exclude areas as the curve moves upto its terminal distribution point.In our method, Simpson's 1/3 rule is used where 3 points are connected, 2 intermediate points and Critical Value is (At 0.01 Significance Level) And Maximum Difference(D max ) is -0.023584