Contextual Word Representations applied in Insurance

Published on October 21, 2020

Mohamed Hanini, CEO & Founder at Koïos Intelligence

Montreal, October 21, 2020 – Koïos Intelligence has been awarded a project with a total cost of $615,000 through the PROMPT organization and the Ministry of Economy and Innovation of Quebec, to manage contextual word representation. We introduce a language model based on BERT deep architectures, to tackle Natural Language Processing tasks for the insurance industry.

Being at the forefront of the state-of-the art Natural Language Processing (NLP) is a necessity for Koïos Intelligence. Our virtual supply chain is mainly composed of tasks related to Intent Classification and Named Entity Recognition.

Since Turing proposed to consider the question “Can machines think? [Computing Machinery and intelligence], the Question-Answering task has attracted a lot of attention due to the complexity of the underlying process. First, the understanding of a need (intent), which is the information expressed in natural language by a question, then to recover resources relevant for the writing of an answer. The main goal of the Question-Answering task is to effectively encode conversation messages (questions) into latent semantic vector space, thus get the answer as accurate as it can. A better classification of intent, as well as an adequate prediction of the response will significantly increase the performance of a dialogue system.

One of the biggest challenges in Natural Language Processing is the lack of trained data, when in reality a large amount of data is available publicly in various areas of interest. Conventional Natural Language Processing models require labeling of task-specific training data, which would involve substantial human effort considering that deep networks require a large amount of data. Unlike classic Word Embedding model architectures [Global Vector for Word Representation (Glove)] and Word2vec continuous vector representations [Efficient Estimation of Word Representations in Vector Space] associating a static vector with each word, Universal Language Model Fine Tuning (ULMFIT), Embeddings from Language Models (ELMO) and Bidirectional Encoder Representations from Transformers (BERT) generalize the Word Embedding mechanism by constructing word representations taking into account syntax as well as semantic context.

Transfer Learning typically re-uses information on the lower layers of a deep network. The lower layers encode the generic information, in a context of image classification and we can think of the text area indicators or the contrasts between homogeneous areas.

Most BERT architectures available on the market are trained with a generic corpus, such as Wikipedia, which makes current models inefficient when it comes to Question-Answering tasks in a specific area of interest. Fine-tuning a sentence pair classification model with pre-trained BERT parameters is often used, which requires the loading of the pre-trained BERT architectures and adding the classification layer. Indeed, most commercial architectures are using Large Scale Language Models (LSLMs) as BERT and its descendants such as GPT-2, XL-Net requires significant computing during inference due to its 12/24-layers stacked multi-head self-attention network and fully connected layers. BERT’s performance on downstream tasks improve by adding extra training examples on supervised tasks.

The acceleration of the training of expensive architectures such as BERT is a debatable topic. Typically, deep neural networks require the optimization of a gigantic amount of parameters; the original BERT model holding 110 million parameters. Optimization methods based on Stochastic Gradient Descent (SGD) are often used during the training phase of DNNs in order to find network weights that minimize a loss function. This iterative approach is sensitive to the step of the gradient and is often referred to as the learning rate.

Naively, we are tempted to define a fairly broad learning rate, in order to converge more quickly. In reality, the convergence of the estimators, the weights in our case, are very sensitive to the learning rate, hence the importance of training the model with a small batch requiring a small learning rate (Wilson and Martinez 2003) to maintain stability of the model.

“The learning rate is perhaps the most important hyper parameter. If you have time to tune only one hyper parameter, tune the learning rate” [Deep Learning 2016].

Optimizing hyper parameters is therefore a subject of research in itself. The combination of grid search and manual search based on a set of trials are the most used strategies to optimize the learning rate.

The optimization of the hyper parameters of the model is not sufficiently studied, and the choice of strategy presented in the current architecture is often arbitrary.