Global Document Data Mining in India, American Experts'Bold Approach to Nature

Selected from Nature

Author: Priyanka Pulla,Heart of Machine Compilation

The latest issue of Nature describes an ambitious project launched by an American expert in India. He created a database of 73 million documents and images extracted from 1847 to the present, and plans to mine them, including unauthorized texts. This approach is helpful to the development of many disciplines, but its legitimacy is not yet clear.


Carl Malamud stood in front of the server, and his team was preparing to mine 73 million papers.

Carl Malamud is trying to liberate the information behind paid papers, and his action has received a lot of support.

Malamud has spent decades publishing copyright-protected legal documents (from building codes to court records) and insists that they represent public domain law and should be open to all citizens. But now the 60-year-old American technologist is turning to a new goal: liberating paid scientific literature, which he believes can be done legally.

Over the past year, Malamud has worked with Indian researchers to build a huge text and image library that has been extracted from 73 million documents since 1847.

The project's cache device, located at the University of Nehru in New Delhi (JNU), is still under construction and has a storage capacity of 576 TB.

Malamud and her JNU partners named the project JNU data depot. "JNU data depot does not collect all articles from all journals in history, but there are a lot of them," he said. Its size is equivalent to the core resources of the Web of Science dataset. "

The JUN database does not allow anyone to read or download documents from it, because this would undermine the rights of publishers. Therefore, Malamud envisages that researchers can crawl text and data through computer software and extract core information by scanning scientific literature around the world, thus avoiding the actual reading of text.

This unprecedented project quickly attracted many people's interest because it opened the way for the first time to fast computing and analysis of paid literature. Dozens of research teams have been mining papers to build databases related to genes and chemistry, and to map the association between diseases and proteins to generate useful scientific assumptions.

But the control of publishers often limits the progress and promotion of the project, because they often only allow access to abstracts rather than full text. Researchers in India, the United States and the United Kingdom have planned to use JNU storage, and many professors are interested in the project.

However, the legal status of such banks is still unclear. Malamud consulted several intellectual property lawyers before creating the project, hoping to avoid litigation. "We think what we do is legal," he said. At the moment, he's moving carefully: JNU databases are isolated, that is, no one can access them online. Users must adopt physical access mode. At present, only researchers who do not conduct data mining for profit can obtain access rights. Malamud said his team planned to open up remote access rights and move forward step by step.

The Power of Data Mining

Max H, Bioinformatics researcher, University of California, Santa Cruz

Chris Hartgerink, a part-time statistician at the QUEST Center for Transforming Biomedical Research in Berlin, Germany, said that he could only dig for open access publishers'articles because "such an operation on closed publishers' articles would cause a lot of trouble". A few years ago, while Hartgerink was still reading a blog in the Netherlands, three publishers banned him from accessing their journals after he tried to download articles for text mining.

Some countries have amended laws to allow non-commercial project researchers to excavate legally acquired articles without the permission of copyright owners. Britain passed such laws in 2014, and the European Union voted to pass a similar law this year.

However, university scholars are still limited to mining abstracts from databases. After all, abstracts can provide far less information than full articles.


Carl Malamud and Andrew Lynn examine the Nehru University (JNU) project, which aims to extract text and images from 73 million papers.

Scientists also need to overcome technical barriers if they want to mine research articles. Publishers use a variety of formats, so extracting text is not easy, which is the problem that the JNU team is currently working on. PDF text-to-text tools often fail to specify the areas of paragraphs, footnotes and images. However, once the JNU team has solved these problems, others can save their time and energy. Malamud said the JNU team was about to complete the first round of extraction of 73 million papers (although errors still needed to be checked), so he expected the database to be ready by the end of this year.

Benefit multiple areas

Early enthusiasts were ready to use JNU databases, including Gitanjali Yadav, a computational biologist at the National Institute of Plant Genome Research (NIPGR) in Delhi, India and a lecturer at Cambridge University in the United Kingdom. In 2006, Yadav established a data set on plant secretion chemicals, EssOilDB, in NIPGR. Now, drug research and development groups and perfume manufacturers regard the EssOilDB dataset as the source of their own guidance. Yadav believes that "compendium provided by Carl" can help her data set.

The establishment of data sets has never been easy. In the process of establishing the EssOilDB dataset, the Yadav team must crawl the relevant papers from the PubMed and Google Scholar databases, extract the data from the complete text they can find, and personally enter the relevant database to copy the table contents of rare journals. Yadav indicates that the JNU database can speed up the process of collecting the above data, and her team is currently writing a query program for extracting the data.

Srinivasan Ramachandran, a bioinformatics researcher at the Institute of Genomics and Integrated Biology (IGIB) in Delhi, India, was also inspired by the Malamud project. His team ran a data set on the type II diabetes gene, and they had been crawling abstracts from the PubMed database. Now, he hopes that JNU databases can extend the scope of their data mining.

MIT's Knowledge Futures Group team says it hopes to continue mining JNU databases to gain access to the evolution of academic publishing. James Weis, a Ph.D. student at, MIT Media Lab, a member of the team, said the team hoped the database would predict emerging areas of research and identify other ways to replace current conventional measures of academic impact.

Is it legal?

Malamud says it doesn't matter where the articles he uses come from. "data mining" is not consumptive, that is to say, data mining researchers don't read or present most of the articles they analyze. "you can't enter a DOI to get that article," he said. Malamud also believes that text mining for copyrighted content is legal in countries such as the United States. Google Books did something similar to JNU, scanning thousands of copyrighted books without buying them and displaying fragments of them in the search service, although they were not allowed to download or read them all. However, American courts have ruled that Google book scanning does not constitute infringement.

Joseph Gratz, Google's attorney, said the Google Books example was an experiment to determine whether non-expendable data mining was legal. Although Google will display fragments of the book, the court held that the limited length of the text displayed was not enough to constitute infringement. Previously, Google also scanned authorized books (in many cases from libraries), although authorship was not sought. Gratz said copyright owners might think that Sci-Hub or other unauthorized content provided to JNU libraries might be different from Google's. However, such cases involving unauthorized resources have not yet been discussed in United States courts, making it difficult to predict the outcome of the judgement. "There are good reasons to prove that the source of resources is irrelevant, but some people believe that the source is important. "

Of course, whether this is legal in the United States does not seem to make that much sense, because the project is built in India, and how Indian law is the focus, said a professor at American University.

India's copyright law may help Malamud's approach, which is another reason why he built his project in New Delhi. Arul George Scaria, an assistant professor at Delhi's National University of Law, said Google's scan would be considered a reasonable use of copyrighted content if it were based on research exemptions under Article 52 of Indian law.

Of course, not everyone agrees with this view. T. Prashant Reddy, a legal researcher at the Viddy Center for legal Policy in New Delhi, said article 52 allows researchers to copy journal articles for personal use, but does not necessarily allow full-text reproduction of journal content, as in the JNU library. Reddy said that not sharing the entire article with users does help solve copyright problems, but bulk copying of text to create a database is still in the "gray area."

A risky plan

When Nature magazine exchanged JNU database plans with 15 publishers, six of them said they had never heard of the project before and declined to comment on its legitimacy until further information was available. But the six publishers, Elsevier, Springer Nature, of the BMJ, American Society for the Advancement of Science, the National Academy of Sciences, said researchers must first obtain authorization to excavate their papers.

Malamud acknowledged the risks of the project. But he thinks it has moral importance, especially in India. Indian universities and government laboratories have spent a lot of money subscribing to journals, but they are still unable to subscribe to all the required journals. Data released by Sci-Hub show that Indians are the largest users of its website, suggesting that university licenses have not gone far enough. The Open Access Movement in Europe and the United States is very precious, and India also needs to liberate the right to access scientific knowledge. Malamud said, "I don't think we can wait for Europe and the United States to solve this problem, because time is pressing. "

Links to the original text:https://www.nature.com/articles/d41586-019-02142-1

