In January, at our AI for Business Meetup in Berlin, our colleague Dr. Tae-Gil Noh from the OMQ development team took the opportunity to give a presentation about transformer models in customer service.
In this article, we summarize this lecture and give an insight to what kind of technology we are using and what we have experienced throughout the time that we have used transformer models.
The core task of NLU is finding out whether a given request text equals a certain knowledge base case or not. So it is a decision being made through a text-to-text comparison. For example: “I want my money back” should match with “refund” in e-commerce context, which the machine has to know and recognize. However, finding out that “X” does not match with “Y” is equally as important in this process, which builds the center of our engine and is called textual comparison.
The three different representations that OMQ currently keeps are the following:
The classical search model is not trained. It works with a bag of lemmatized tokens and makes decisions with a classical IR rating score.
Here is an example to explain this: The input sentence: “Yesterday, we saw two dogs walking down the park in parallel” is tokenized in the following way: [Yesterday, we, saw, two, dogs, walking, down, the, park, in, parallel]. After lemmatization and the removal of not very useful words, it becomes: [yesterday, see, two, dog, walk, down, the, park, parallel]. For words such as “see”, “walk” and “dog”, their root forms have been recovered for the search.
The engine converts text into representation space, then compares two representations and eventually decides if the meaning is semantically equivalent or not. We work with a shallow neural network model, which we trained in a combination of WordVector-based model and composition model, therefore a joint model. There are pre-trained sub-word vectors and the fine-tuning of this joint model works on domain data, self supervised and supervised.
The representation learns a single vector within the concept space. Conceptually, this request seems to be talking about the same topic of the same one knowledge base, which is the point where similarity metric is used to decide whether this is the case or not.
This model is a slightly deeper or bigger neural network model, where transformators are used. Pre-trained transformators are fine-tuned on the domain data and work self-supervised as well as supervised. The output is not a single vector, but a sequence of vectors that holds the full syntactic and semantic information of the text. After that, a task layer makes a decision.
The year 2019 was the changing point for neural network models. Previously, the models were too weak and could not comprehend all the data they were given. If you have enough data, the model will now follow up and reproduce the data’s meaning.
Natural Language Processing engines are transformer models. Among others, Google BERT and Facebook XLM, for example, are mostly based on the same neural network mechanism known as self-attention mechanism.
OMQ is progressively rolling out transformer-based models in its products. The following 5 aspects are the lessons that we have learned while working on these models.
One common bias against Transformer models is that they might be too slow. In fact, these models might be notorious for their heavy computation. However, training any model with that many parameters and that much data is bound to take much time. Also, nothing is trained from scratch. Choosing a pre-trained model will need some fine-tuning, but this does not necessarily have to be costly.
Inference time, on the other hand, is more critical. It makes or breaks the CPU. You have to be able to scale the CPU. Managing scaling down can achieve near-real time inference with reasonable computational cost.
One way to visualize the effect of fine-tuning on pre-trained models is the “force-multiplier”. When you push something with the power of just one, something is helping you with increasing the force, so it is working as if you are pushing it with the power of ten. When the model is pre-trained and already knows topical relatedness, paraphrasing and simple reasoning, you might give it just one example, but it generalizes pretty fast.
However, not every kind of generalization works that way. Most of the time, Transformer models are based on language models and generally do not know negations and presuppositions yet. Therefore, generalization is slower in these cases. If the task requires such phenomena that are not covered by pre-training, a good amount of data is still needed.
Standard steps of using pre-trained Transformers are the following:
- Step 1: Pick a decent enough pre-trained transformer
- Step 2: Further train on your domain text with self-supervised task
- Step 3: Fine-tune on your task data
The self-supervised task of pre-training is easily generated from an unlabeled, raw corpus. This one can, for example, predict a masked word or predict the next sentence. However, domain data has yet more to give. While web scale is great, there are things that can be picked up only by the domain text.
Multilingual Transformers are Transformer models that are pre-trained in multiple languages and therefore have the ability to make multilingual representations. In the disparity of data over languages, multilingual transformers perform really well.
However, not all language pairs are equal. While English, French and German work pretty well, languages like Korean or Chinese pick up almost nothing. Also, there are vocabulary-size matters. For some languages, there is limited vocabulary and the model does not work that well for them.
Transformers have the ability to generalize over syntax and semantics but given the chance, it will cheat and get stuck on the simplest explanations. That is, for example, asking a person for their name and answering with the second name of the last sentence.
Also, unintended bias can enter the system, especially if the data is small. Transformers are relatively strong against various issues, for example overfitting and forgetting, but it knows more. It can also be more creative at spotting such holes. To overcome issues, a well-designed training data preparation is needed.
The new model handles better sentence structure. When the customer enters the request: “I want to add a 2nd photo book to my shopping cart.”, for instance, the model figures that “add to shopping card” means “to order”. Another example for this is the question: “Can I have my last photo book printed again?”, which the machine rightfully matches with: “How can I make a repeat-order?”.
The new representation is also better at handling nuances and cross-lingual matching. The system therefore can report cross-lingual matches for requests. If there is an English request that goes: “How to unlock previously purchased downloads?” and there are, for instance, 14 Dutch requests that seem to match the score, the system will report it.
Pre-trained Transformer models can handle almost every linguistic phenomenon, if there is enough data provided. Pre-training thereby works like a force-multiplier, which gets a small amount of data to work better, however, this has to be regarded with caution. The Transformer model also nicely catches a variety of natural language expressions.
OMQ is currently rolling out these representations for selected customers and services. We aim to deliver the latest progress in NLP research to the market and want to share this process.
Click here for the slides of the presentation: PDF - Transformers in Action