Automating Document Retrieval for a Literature Review

Automating Document Retrieval for a Literature Review

Someone I know is an academic working on a literature review. She needed some thousands of articles from a couple different publishers. I was not willing to allow her to attempt to download the files with a web browser, at least not without some kind of automation. When I taught people about computers, one of the things I tried to teach them is that computers are good at doing one thing, and that’s doing the same thing over and over again. If you find yourself using a computer and doing the same thing over and over again, you’re doing something wrong (or perhaps the system you’re using is very, very bad).

We both had legitimate access to a university library’s resources, and she had met with a librarian about the project, its scope, and how to move forward. This is not a story about how to illegally retrieve copyrighted material. What follows is some information about how I did it that might be helpful to you, if you’re an academic who wants to download more than a couple hundred articles. Or maybe, it’ll make you want to find a way to get a little pile of money so I can do something like this for you.

Tell me what you want, what you really, really want

The first step in retrieving the articles was getting a list of the articles we needed to retrieve. We used a patchwork of methods to do this. One was Constellate. It lets you do full text searches of the journals it knows about and give you a CSV that you can then Do Stuff With. We selected the references that Constellate thought were articles and then grabbed a list of DOIs. It’s possible to get the full text of the articles from Constellate, but these researchers wanted the full original PDFs, so I didn’t try using their text.

I even had video conference or two to a very helpful person at Constellate that made it somewhat easier to figure things out. They also explained that for “historical reasons” (I think that Constellate pre-dated DOIs) sometimes the DOIs provided by Constellate were sort-of bogus DOIs that point to a copy on JSTOR rather than the actual DOI. I wrote another script that reads the file generated by Constellate and attempts to find the actual DOI if Constellate had provided a bogus one. This one I did in Python, which has some decent tools for dealing with CSVs.

Part of that script’s magic counts on tools at crossref.org and then does a lookup by calling a URL like https://doi.crossref.org/openurl?pid={USER}&title={JNL_encoded}&volume={VOL}&issue={ISSUE}&spage={PAGE}&redirect=false" and also generates download URLs to get those files (in a web browser) through the university proxy server. This mostly got me a list of actual DOIs for the stuff we wanted.

Getting Wiley

I first worked on articles from Wiley because they have an API for Text and Datamining. That made it possible to write a script that takes a DOI as an argument and retrieves the file, checks that it’s really a PDF (because rate limiting) and names the file the DOI. Since DOIs start with 10.xxx/, it creates a directory with the 10.xxx part of the DOI and then saves the DOI with the remaining part of the DOI as its filename. This is a satisfying way to organize PDFs to a computer scientist, but probably not a Normal Person; more on that later.

With this script in place, it was then possible to get a list of DOIs and then call the script for all of those DOIs and have it pull them down. In spite of adding some delays to slow things down, I still got rate limited sometimes, so I had the script notice when that had happened and try again after waiting for a while.

The script checks to see if the file has been downloaded before it tries to retrieve it, so it’s safe to run it multiple times to make sure that some didn’t get lost due to rate limiting.

Taylor and Francis: It was the worst of times; it was the best of times

Next I went about retrieving files from Taylor and Francis. Their Text and Data Mining page said “If you are planning to carry out TDM activity, we recommend that you contact us to ensure we can provide any access and support you may require,” and went on to say to email them. That seemed hard.

The good news is that the DOI download URLs for T&F can be constructed with the DOI (unlike lots of other publishers that generate a unique download URL after you click several times in the web browser). I was able to generate another script that generates the download URL and opens it in the browser by merely hitting return for every. Single. Article. This wasn’t ideal, but it worked. Sort of. After some hundreds of articles, I got an ominous message that made it look like the entire university had been blocked from using T&F. Thankfully, it went away after a day or two, so I tried again. This time I was scolded and told to contact someone, so I did, which, after a bit of pfaffing about, resulted in an ideal solution. I should have done it sooner.

Librarians for the Win!

I was put in touch with another librarian, who was very accommodating; they had never worked with a faculty member who needed to use the T&F data mining services, so it took us a while to get it all worked out. It was made more complicated by my using Linux and thinking that I should be able to use a Yubikey rather than The Sanctioned App, as my second factor tool. Once that was worked out, though, it was perfect, and even easier than my beloved Wiley.

What is required is that a particular IP be registered with T&F by the librarian that is then given a time-limited access to freely download stuff from the T&F site. The way that the University chose to do that was to set up a virtual Windows machine at Amazon that I was then able to access (once I figured out how to make the VPN work under Linux) by some remote control software to access the Windows machine (also made more difficult because I used Linux and needed incantations that I don’t think would be required under Mac or Windows). Once I’d jumped through those hoops, I could simply curl the very same URLs that I had been passing to the web browser, now direct to the T&F site rather than the URL that traverses the campus proxy server. I considered briefly writing a script that would run on the Windows machine but instead just used an Emacs keyboard macro that converted the DOIs into the required curl command (curl is a command-line tool that retrieves urls available for most every operating system). Once that was done, I was able to retrieve ALL OF THE PDFs in one fell swoop. 2220 files downloaded in just over half an hour with a single command. I tried getting a generative AI to generate a script that would download the DOIs, but envisioned spending an hour or more trying to figure out how to get the file there and find or install an editor to debug with, and a bunch of other things that made me tired before deciding that my Emacs keyboard macro was much more expeditious.

Once I downloaded the files, I opened massive.io on the remote Windows machine and dragged and dropped the folder with all 2220 files to the site and a few minutes later was able to retrieve the files to the physical laptop computer on my own network.

Helping out with the sorting

The project at hand was looking at several words having to do with methodology. For most things we were able to do keyword searches before downloading the files, but not all articles containing those words are actually doing the thing that the word implies. So you’d then need to open each PDF, find the methodology section, find the word, and see if they are really doing The Thing, or maybe saying why they didn’t do The Thing, or maybe just used that word for some other reason. Opening 2220 files is something of a bother.

Zotero to the Rescue!

My solution was to use Zotero to tag the files that were actually doing The Thing. Zotero is an open source reference manager. I hadn’t used it in a very long time, and I’m pleased to report that if I were still doing academic writing, there’s a good chance I’d use it (though I’d have to figure out how to get it to sync with BibTeX, which it might do quite well). Zotero was chosen because I knew I could install it both on my Linux box and her Mac.

First, we created three separate groups by having Zotero search the files for The Words. That was a good start, but opening all of those files was still daunting. It turns out Zotero has some scripting abilities, and I wrote this script that looks at the full text of each article and inserts a few lines surrounding The Words. In almost every case, that was enough text to be able to discern whether the article needed to be included in the data set. It was then just a few keystrokes to add a tag that indicated that the article had been checked and matched the criteria. From there, we were able to perform three separate exports (one for each keyword) that were then imported to ATLAS.ti in three batches for the actual analysis.

Other little tools

Normalize-filename

Remember how I’d put all of the DOIs in 10.123 directories with names like ‘the-rest-of-the-doi.pdf’? Or, with with T&F files, some version of the article’s title? That’s no good. Most people would be satisfied with using the names that Zotero uses, but instead, I wrote another little script that calls pdftotext on the first page of the PDF document, looks for “DOI:” to get the article’s PDF, and then uses the Crossref API to get the author, year, title, and journal (which I converted to a shortened version) and construct a filename something like "$year-$author-$short_j-$FILE_DOI.pdf". I wrote another version that renames a file to the DOI with the slash replaced with a dash, avoiding the extra-directory issue.

Get all articles from a journal

I also wrote one that will get a list of articles of a journal from a DOI from that journal. It would be better if that one did one year at a time, as getting ALL of the articles is a bit much, but if you don’t have a reliable full-text search, getting all of the articles is one way to solve that problem.

Get DOI from an Endnote file

Sometimes you can get the library’s or publisher’s tools to do a full-text search and give you the resulting list in a Endnote file. I wrote this to pull the DOIs out of the Endnote file for subsequent retrieval.

Need help with your own Literature Review?

In my dreams someone would stumble on my literature-review-scripts repo and find them useful. It’s pretty much the case that they are likely to be useful only to someone who already understands them. If you are doing a big lit review perhaps this can serve as inspiration for developing your own scripts, or, better for me, you’d scrape some money from your grant or startup funds and have me whip up scripts that will actually be useful to you and see that you know how to make them work.

Cookies
essential