Advances in natural language processing (NLP) and large language models mean that it's becoming possible for machines to understand natural language, including the prose in scientific papers. That could change how scientific literature is created and consumed over the next few years.
TL;DR: NLP now works. Anybody with a technical skillset can train a model to solve any number of NLP tasks. Tooling has become better and we're starting to see scientific natural language processing being scaled up into real products.
A prototypical work in this space is SPECTER, a system from the Allen AI institute to find semantically related papers. A BERT-like model is pretrained on a masked language modeling task on unlabeled scientific texts (SciBERT): words are blanked out at random and the transformer is trained to find the missing word. This gives rise to one vector per token, in this case a 768-dimensional one.
The pre-trained network is then fine-tuned on a set of papers with a triplet loss: it moves similar papers together and dissimilar papers away from each other. Citations are used as a proxy signal for similarity. The result is a model that can generate fixed-length dense semantic vectors from each document. This can be used to retrieve similar documents.
View Full Article