There are cases where the inputs to your Transformer model are pairs of sentences, but you want to process each sentence of the pair at different times due to your application’s nature. Search applications are one example.
Search applications involve a large collection of documents that can be pre-processed and stored before a search action is required. On the other hand, a query triggers a search action, and we can only process it in real-time. Search apps’ goal is to return the most relevant documents to the query as quickly as possible. …
This post describes a simple way to get started with fine-tuning transformer models. It will cover the basics and introduce you to the amazing
Trainer class from the
transformers library. You can run the code from Google Colab but do not forget to enable GPU support.
!pip install pandas transformers
To fine-tune the BERT models for the cord19 application, we need to generate a set of query-document features and labels that indicate which documents are relevant for the specific queries. …
This is the second on a series of blog posts that will show you how to improve a text search application, from downloading data to fine-tuning BERT models.
The previous post showed how to download and parse TREC-COVID data. This one will focus on evaluating two query models available in the cord19 search application. Those models will serve as baselines for future improvements.
You can also run the steps contained here from Google Colab.
We can start by downloading the data that we have processed before.
import requests, json
from pandas import read_csv
topics = json.loads(requests.get(
relevance_data = read_csv(
Vespa is the faster, more scalable and advanced search engine currently available, imho. It has a native tensor evaluation framework, can perform approximate nearest neighbor search and deploy the latest developments in NLP modeling, such as BERT models.
This post will give you an overview of the Vespa python API available through the pyvespa library. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.
We are going to connect to the CORD-19 search app and use it as an example here. You can later use your own application to replicate the following steps. …
This is the first on a series of blog posts that will show you how to improve a text search application, from downloading data to fine-tuning BERT models.
You can also run the steps contained here from Google Colab.
The team behind vespa.ai have built and open-sourced a CORD-19 search engine. Thanks to advanced Vespa features such as Approximate Nearest Neighbors Search and Tranformers support via ONNX it comes with the most advanced NLP methodology applied to search that is currently available.
Our first step is to download relevance judgments to be able to evaluate current query models deployed in the application and to train better ones to replace those already there. …
The Vespa team has been working non-stop to put together the cord19.vespa.ai search app based on the COVID-19 Open Research Dataset (CORD-19) released by the Allen Institute for AI. Both the frontend and the backend are 100% open-sourced. The backend is based on vespa.ai, a powerful and open-sourced computation engine. Since everything is open-sourced, you can contribute to the project in multiple ways.
As a user, you can either search for articles by using the frontend or perform advanced search by using the public search API. As a developer, you can contribute by improving the existing application through pull requests to the backend and frontend or you can fork and create your own application, either locally or through Vespa Cloud, to experiment with different ways to match and rank the CORD-19 articles. My goal here with this piece is to give you an overview of what can be accomplished with Vespa by using the cord19 search app public API. …
The COVID-19 Open Research Dataset can help researchers and the health community in the fight against a global pandemic. The Vespa team is contributing by releasing a search app based on the dataset. Since the data comes with no reliable labels to judge a good search result from a bad one, we would like to propose objective criteria to evaluate search results that do not rely on human-annotated labels. We use this criterion to run experiments and evaluate the value delivered by term-matching and semantic signals. …
If we want to investigate the power and limitations of semantic vectors (pre-trained or not), we should ideally prioritize datasets that are less biased towards term-matching signals. This piece shows that the MS MARCO dataset is more biased towards those signals than we expected and that the same issues are likely present in many other datasets due to similar data collection designs.
MS MARCO is a collection of large scale datasets released by Microsoft with the intent of helping the advance of deep learning research related to search. It was our first choice when we decided to create a tutorial showing how to setup a text search application with Vespa. It was getting a lot of attention from the community, in great part due to the intense competition around leaderboards. …
Vespa.ai have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa. The tutorials were based on the full document ranking task released by Microsoft’s MS MARCO dataset’s team.
The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa.
The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context. …