Technology Toolkit 2021 is a technical white paper describing core technologies that
are being researched and developed by Samsung SDS R&D Center. We would like to introduce in this paper a total of
seven technologies concerning AI, Blockchain, Cloud, and Security with details on their technical definition, key
features, differentiating points, and use cases to give our readers some insights into our work.
① Limitations of Statistical Language Models in Linguistic Representation
Language is the process of communicating intentions or meanings. Sentences have certain meanings, and humans
communicate through understanding and deciphering words in the sentence. Then how are we supposed to reflect this
process in deep learning models? Deep learning models are expressed in real number value computation, so it involves
the process of converting words into real number values. Previous deep learning models simply substitute words into
pre-defined real number values. But there is a downside to this approach since such model is unable to handle
linguistic ambiguities like homonyms properly. Take phrases like “Bank Account” and “River
Bank” for example. Although the two phrases contain the same word, they have very different meaning. Previously
available deep learning models could not convey this difference because they use the same real number value for the
word “bank” as shown in [Figure 1]. Therefore, it was necessary to develop a new model that can capture
contextual information.
② Limitations in building datasets for DL language model training
Training a deep learning language model requires a massive amount of data although varied according to the size of the
model. Building a dataset for supervised learning in particular requires answers (label) which often entails
significant time and costs. So creating datasets for large-sized models with tens of millions of sentences and
associated labels becomes an extremely challenging task by itself.
③ Introduction to pre-trained models
First introduced by Google in 2018, BERT (Bidirectional Encoder Representation from Transformers) is a deep learning
language model pre-trained with large amount of English datasets, and it is developed to tackle the limitations of
statistical language models mentioned above. BERT presented state-of-the-art (SOTA) results in a wide variety of NLP
benchmarks such as General Language Understanding Evaluation (GLUE) and machine learning comprehension (SQuAD v1.1,
SQuAD v2.0), outperforming previous statistical language models that were available at that time. Various other
similar language models were presented afterwards. Since most of them work in a similar way to BERT, let’s take
a look at BERT more closely and find out how the above-mentioned drawbacks were addressed.
④ Contextual representations
BERT uses two training strategies to learn the context of a word. Learning contextual relations between words or
sentences allows the model to provide more accurate representations of a language.
• Masked Language Model (MLM)
Masked Language Modeling (MLM) is one way to train a language model. Here, certain words of a sentence are masked out
at random, and the model learns the language by guessing the masked words. The model is trained to predict and find
correct answers through contexts surrounding the masked words. This allows the model to naturally catch different
meanings of a word based on its relationship with surrounding words.
• Next Sentence Prediction (NSP)
In the Next Sentence Prediction (NSP) model, the model is given a pair of sentences as input and learns to predict
whether one sentence is the subsequent of the other. This way, the model can figure out the context of sentences.
⑤ Overcoming drawbacks of self-supervised learning
There are a wide variety of text analysis tasks (or features) that are based on deep learning language models. These
tasks can be used to predict sentiment of a sentence, provide answers to a question, or find similar sentences.
Acquiring data and training models for each one of these tasks separately would require a tremendous amount of data.
But although each task may be different from one another, they all still use the same language and therefore need the
same language understanding. BERT models are pre-trained through various tasks like MLM or NSP and learn the basic
language skills. For the MLM task, texts for datasets are taken from Wikipedia, with some percentage of the sentences
masked out. In the NSP tasks, training datasets are generated by combining a pair of connected sentences with another
pair of sentences from other documents. That way, the model can learn a language representation by itself without
having humans provide labeled data. An unsupervised learning method where a model learns autonomously with unlabeled
data input is called “self-supervised learning,” and a language model built this way is referred as
“pre-trained language model.”
⑥ Fine-tuning based on transfer learning
Transfer learning is a machine learning technique where a pre-trained model on one task is re-trained for a new
purpose-specific task. Fine-tuning refers to this specific stage of the learning process in which a pre-trained model
is refined for specific task. The fine-tuning process is necessary because a pre-trained BERT model itself is not
trained yet to perform a specific task. Let’s find out what kind of tasks can be performed with a fine-tuned
model in the use cases below.
There are many publicly available pre-trained BERT models for English but it is difficult to utilize them for Korean
due to linguistic differences. Samsung SDS R&D Center gathered a massive volume of Korean text data and created
pre-trained models specifically built for Korean language.
① KoreALBERT
The first model is KoreALBERT (Korea + ALBERT) [Figure 6], which learns Korean using ALBERT (A Lite BERT)
architecture, a lite version of BERT. To increase the usability of the model in various business settings, datasets
were made of formal Korean texts taken from a wide variety of sources including Wikipedia, news and book outlines. In
addition, we applied a new light-weight architecture and patented training methodology to obtain a competitive edge
over other previously available Korean language models (3. Differentiating Points).
② KoELECTRA
Introduced after BERT, ELECTRA is a language model created to address the inefficiency of BERT model which uses only
15% of masked tokens for training. In contrast with BERT, ELECTRA utilizes the rest 85% of the input tokens as well,
improving the training speed and performance compared to BERT. KoreELECTRA is a model trained on Korean texts with
ELECTRA model architecture. Although heavier, KoreELECTRA achieves better performance compared to KoreALBERT.
A pre-training model cannot perform a specific task in itself, but it can be used a wide range of tasks after
fine-tuning. NLP (Natural Language Processing) can be largely divided into two groups: NLU (Natural Language
Understanding) and NLG (Natural Language Generation). NLU helps the comprehension of the text inputs and contexts, and
NLG is responsible for generating sentences based on the understanding of the context. So if you want to build a
chatbot, you need both NLU for understanding user’s intent and NLG for generating responses.
① NLU
Pre-training model is used for almost all of NLU tasks. Various NLU tasks including machine reading comprehension,
text classification and textual similarity analysis can be performed after fine tuning of a pre-trained model.
② NLG
NLG model is mounted with a decoder module for generating sentences, which doubles the cost of pre-training compared
to NLU models. But some of the recent studies have suggested ways to reduce NLG model pre-training costs using NLU
models.
Samsung SDS adopted new training techniques to reduce model size while improving the accuracy of pre-trained models
that serve as the core engine for textual analysis. A new step was added [Figure 9] to the training process for
enhanced efficiency. In this stage, the model learns to predict the sequence of randomly shuffled words, which not
only significantly reduces parameter size (1/10 size of BERT parameter) but also increases the model accuracy. Samsung
SDS team patented this training technique and presented the result of the study at one of the top-tier global
conferences, ICPR 2020. The upgraded KoreALBERT model has a use case for ERP-related VoC analysis system which is
being operated in CPU-based server (4. Use Cases).
In addition, Samsung SDS developed multi-lingual BERT model that can be used for various languages by employing
transfer learning and CNN (convolutional neural network) architecture. The model proved its superior performance
ranking top in the official leaderboard of a Korean MRC dataset, KorQuAD v1.0 (Korean Question Answering Dataset) in
August 2020.
[Figure 10] below displays the size and performance of different pre-trained models applicable in Korean language
comprehension tasks. With an improved pre-training method, KoreALBERT showed enhanced average scores at various Korean
language understanding tasks* with a much smaller parameter size, 7-20% level compared to other models. (*Evaluated
tasks: machine reading comprehension, semantic textual analysis, named entity recognition, intent classification,
question classification, sentiment analysis)
Machine Reading Comprehension task can be used for understanding text passages and answering relevant questions. As
displayed in [Figure 11], a pre-trained model was used as a baseline model, and enhanced CNN was added to augment the
MRC model’s performance. Our MRC model showed a strong performance in KorQaAD 1.0.
Text classification task is used for categorizing texts into organized groups. One area where it can be used is
sentiment classification for determining whether the sentiment of given text input is positive or negative. Often used
for analyzing user comments or feedback on products, text classification task can be also used for intent
classification, as shown in [Figure 12]. Using the text classification task, any kind of text can be efficiently
organized and categorized into groups so it is one of the most frequently used tasks in business settings. Performance
of a text classification model may vary depending on the pre-trained model (KoreALBERT).
Semantic Textual Similarity deals with determining how similar two sentences are in terms of meaning or context. STS
can be useful in implementing a search feature for finding similar sentences to a query sentence and a saved sentence.
Samsung SDS’s STS task takes sentiment and context into consideration, therefore boasts a powerful performance
and quality compared to other previously available keyword search features [Figure 13 – Upper].
As the number of query sentences increases, so does the processing time. To address this issue, Samsung SDS suggested
a new architecture that utilizes Siamese Neural Network and Convolutional Neural Network (CNN) [Figure 13 –
Lower]. The result of this study was patented and presented at a renowned international symposium on artificial
intelligence, ICPR 2020.
As shown in [Figure 14], text summarization model is divided into two types: extractive summarization model that
generates a summary by extracting key sentences from the document passage, and abstractive summarization model which
comprehends the details of a document and generates sentences with context. Extractive summarization model can be
built based on NLU, and abstractive summarization model on NLG engine. Textual summarization feature can save a lot of
time and costs for analytics experts who need to obtain insights from a massive volume of documents.
Below is one of the use cases of language models for VoC (Voice of Customer) management.
Company A’s ERP system related VoC management process.
(1)System user submits ERP system inquiries.
(2)Service desk assigns the case to the person in charge.
(3)Relevant staff in charge verifies the inquiry type, and re-assigns the case if necessary.
In the previous VoC management system, re-classification rate was as high as 30% due to misclassification, which also
delayed overall lead time. Samsung SDS R&D center was able to address this issue by applying two DL-based models
to the system: automatic relevant manager classification model and automatic similar VoC search model. Adoption of the
two new models increased the classification accuracy to 86%, reduced re-classification rate down to 11% and lead time
to 23%.
The number of business cases of pre-trained language model is growing with the extended application of the models in
wider area. Samsung SDS will continue to introduce more powerful and effective pre-trained models that could
potentially be applied in various tasks (4. Use Cases), while creating values for customer through insights gained
from various textual information and analysis.
# References
[1] KoreALBERT: Pretraining a Lite
BERT for Korean Language Understanding
[2] https://korquad.github.io/category/1.0_KOR.html
[3] Evaluation of BERT and ALBERT
Sentence Embedding Performance on Downstream NLP Tasks
[4] Analyzing Zero-shot
Cross-lingual Transfer in Supervised NLP Tasks
▶ The content is proected by law and the copyright belongs to the author.
▶ The content is prohibited to copy or quote without the author's permission.
Judong Kim, AI Core Lab at Samsung SDS R&D Center
Judong Kim is a researcher at Samsung SDS AI Core Lab, responsible for deep learning language models and model performance improvement.
Hyunjae Lee, AI Core Lab at Samsung SDS R&D Center
Hyunjae Lee is a researcher at Samsung SDS AI Core Lab. His main research area is NLP technology with a particular focus on Korean language models.
Hyunjin Choi, AI Core Lab at Samsung SDS R&D Center
Hyunjin Choi joined Samsung SDS as a software engineer although she studied business administration in college. After web/mobile and Windows programming, she is currently working on computational linguistics at AI Core Lab.
If you have any inquiries, comments, or ideas for improvement concerning technologies introduced in Technology Toolkit 2021, please contact us at techtoolkit@samsung.com.