How to ensure training and serving encoding compatibility

There are cases where the inputs to your Transformer model are pairs of sentences, but you want to process each sentence of the pair at different times due to your application’s nature. Search applications are one example.

Image for post
Image for post
Photo by Alice Dietrich on Unsplash

The search use case

Search applications involve a large collection of documents that can be pre-processed and stored before a search action is required. On the other hand, a query triggers a search action, and we can only process it in real-time. Search apps’ goal is to return the most relevant documents to the query as quickly as possible. …


Setup a custom Dataset, fine-tune BERT with Transformers Trainer, and export the model via ONNX

This post describes a simple way to get started with fine-tuning transformer models. It will cover the basics and introduce you to the amazing Trainer class from the transformers library. You can run the code from Google Colab but do not forget to enable GPU support.

Image for post
Image for post
Photo by Samule Sun on Unsplash

We use a dataset built from COVID-19 Open Research Dataset Challenge. This work is one small piece of a larger project that is to build the cord19 search app.

Install required libraries

!pip install pandas transformers

Load the dataset

To fine-tune the BERT models for the cord19 application, we need to generate a set of query-document features and labels that indicate which documents are relevant for the specific queries. …


Using pyvespa to evaluate cord19 search application ranking functions currently in production.

This is the second on a series of blog posts that will show you how to improve a text search application, from downloading data to fine-tuning BERT models.

The previous post showed how to download and parse TREC-COVID data. This one will focus on evaluating two query models available in the cord19 search application. Those models will serve as baselines for future improvements.

You can also run the steps contained here from Google Colab.

Image for post
Image for post
Photo by Agence Olloweb on Unsplash

Download processed data

We can start by downloading the data that we have processed before.

import requests, json
from pandas import read_csv

topics = json.loads(requests.get(
"https://thigm85.github.io/data/cord19/topics.json").text
)
relevance_data = read_csv(
"https://thigm85.github.io/data/cord19/relevance_data.csv" …


A pyvespa library overview: Connect, query, collect data and evaluate query models.

Vespa is the faster, more scalable and advanced search engine currently available, imho. It has a native tensor evaluation framework, can perform approximate nearest neighbor search and deploy the latest developments in NLP modeling, such as BERT models.

This post will give you an overview of the Vespa python API available through the pyvespa library. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.

Image for post
Image for post
Photo by David Clode on Unsplash

We are going to connect to the CORD-19 search app and use it as an example here. You can later use your own application to replicate the following steps. …


Your first step to improve the cord19 search application.

This is the first on a series of blog posts that will show you how to improve a text search application, from downloading data to fine-tuning BERT models.

You can also run the steps contained here from Google Colab.

The team behind vespa.ai have built and open-sourced a CORD-19 search engine. Thanks to advanced Vespa features such as Approximate Nearest Neighbors Search and Tranformers support via ONNX it comes with the most advanced NLP methodology applied to search that is currently available.

Our first step is to download relevance judgments to be able to evaluate current query models deployed in the application and to train better ones to replace those already there. …


A taste of what you can do with Vespa

The Vespa team has been working non-stop to put together the cord19.vespa.ai search app based on the COVID-19 Open Research Dataset (CORD-19) released by the Allen Institute for AI. Both the frontend and the backend are 100% open-sourced. The backend is based on vespa.ai, a powerful and open-sourced computation engine. Since everything is open-sourced, you can contribute to the project in multiple ways.

Image for post
Image for post

As a user, you can either search for articles by using the frontend or perform advanced search by using the public search API. As a developer, you can contribute by improving the existing application through pull requests to the backend and frontend or you can fork and create your own application, either locally or through Vespa Cloud, to experiment with different ways to match and rank the CORD-19 articles. My goal here with this piece is to give you an overview of what can be accomplished with Vespa by using the cord19 search app public API. …


Objective criteria for text search results and some surprising results

The COVID-19 Open Research Dataset can help researchers and the health community in the fight against a global pandemic. The Vespa team is contributing by releasing a search app based on the dataset. Since the data comes with no reliable labels to judge a good search result from a bad one, we would like to propose objective criteria to evaluate search results that do not rely on human-annotated labels. We use this criterion to run experiments and evaluate the value delivered by term-matching and semantic signals. …


And likely not many other widely used datasets either

If we want to investigate the power and limitations of semantic vectors (pre-trained or not), we should ideally prioritize datasets that are less biased towards term-matching signals. This piece shows that the MS MARCO dataset is more biased towards those signals than we expected and that the same issues are likely present in many other datasets due to similar data collection designs.

Image for post
Image for post
Photo by Free To Use Sounds on Unsplash

MS MARCO is a collection of large scale datasets released by Microsoft with the intent of helping the advance of deep learning research related to search. It was our first choice when we decided to create a tutorial showing how to setup a text search application with Vespa. It was getting a lot of attention from the community, in great part due to the intense competition around leaderboards. …


Getting started with Text Search

Vespa.ai have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa. The tutorials were based on the full document ranking task released by Microsoft’s MS MARCO dataset’s team.

The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa.

The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context. …

About

Thiago G. Martins

Working on Vespa.ai. Follow me on Twitter @Thiagogm

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store