This is Part II of our Literature Review series. Part I: How to Judge Impact Factors lays the foundation for the focus of this post. If not you’re already familar with accessing impact factors for journals and researchers, give it a quick watch (it’s only 5 minutes) or read the edited transcript. This video and post will focus on:

How to make sure I found the most appropriate papers (by topic), and enough of them (how much is enough?).

Thanks again to AstraGlacialia on Reddit for raising the question. If you would rather read, I’ve provided an edited transcript below.


First, a brief disclaimer: the analysis I’m going to do here is not perfectly quantitative. It’s really just estimates within a rough order of magnitude, and is really more about trends. In that way, you can apply what I’m talking about to any field (as far as I know).

Applying the Pareto Distribution

I took the four researchers that we were talking about in the previous video: David Cahill, my former advisor Alan, Jeremy England, and myself and I plotted our number of citations which we extracted from Google Scholar. Then I applied a distribution known as the Pareto distribution, which comes from Pareto’s principle. Pareto’s principle (also known as the 80/20 rule) states that, in this context, a very small number of researchers have a lot of citations and a very large number of researchers very few citations1. It’s a lopsided distribution.

Literature Review: Pareto distribution of citations

I plotted it here for varying values of the parameter N, which defines the distribution. It’s typically between 1 and 3 depending on the field you’re looking at, the subset of journals that researchers published in, etc. The Pareto principle applies to citation numbers for a given paper, not just for a researcher.

This shows that for every one David Cahill, and “David Cahill” could stand for any top researcher in any field. For every one of him, there is roughly between 10 and 100 researchers like my former advisor, Alan. In Jeremy England’s range maybe 100, 1,000, or even 10,000 researchers. The point being that for every one “David Cahill” there are many of “us”: the people below with fewer citations.

Literature Review: Pareto distribution of citations overlay

Also important to note, on the y-axis, lies a given paper and a number citations it would recieve. That distribution is also more or less Pareto distributed. Meaning, a very small number of papers have a very large number of citations while most papers2 have very few3. Further, the number of citations is correlated to impact factor. For example, David Cahill has lots of citations and also happens to publish in high impact journals. These attributes are all related.

Literature Review: Impact factor citations by funding group size

If we take all of these things that are correlated and draw them out in a top down fashion. We find that the highest impact institutions (in STEM fields: MIT, CalTech, Stanford) have the highest impact researchers. They publish lots of papers with lots of citations in high impact journals.

David Cahill is at the top of this hierarchy. My former advisor might be in the next layer. There’s me, and Jeremy England.


The takeaways from the previous model and diagram are:

  • The highest impact researchers are at the highest impact universities. These are correlated. In other words, MIT produces a lot of Science and Nature articles. MIT could represent any top university in any field and Science and Nature could represent any top journal in any field. These things will typically come together.
  • The highest impact researchers have the most total citations and publish papers with a large number of citations. This is interesting because, “there are whole bunch at the top,” meaning,
  • You can get significant literature coverage (say >50%) by focusing on the top universities, top researchers, and top papers.

To reiterate, these are all just rough order of magnitude estimates. The models are just meant to prove a point about how these things scale and trend.

The next question is, where is the other 50 percent of the literature coverage? Most of that is going to come from groups underneath this hierarchy–all the researches and all the papers that those groups produce.

There is this sort of family of researchers, institutions, and journals within a given topic or subtopic that you want to be aware of because chances are they’re relevant. You can get most of your literature review coverage from those two segments—top researchers/schools and the groups that form underneath—of your given field, topic, and subtopic. There will be other odds and ends in terms of literature review, and all of this depends on how broad the scope is of your project, how many different interdisciplinary topics, and other factors.

There are more techniques to doing literature review, but I think this covers a couple of nice points. It helps to paint the landscape a little bit better.

Please post any questions or comments in the comment section below and I’ll address them in the next video. See ya!

Jason Larkin
Cofounder & Principle PhD Mentor

This post is Part II in our series on Literature Review: