Technology Toolkit 2021 is a technical white paper describing core technologies that
are being researched and developed by Samsung SDS R&D Center. We would like to introduce in this paper a total of
seven technologies concerning AI, Blockchain, Cloud, and Security with details on their technical definition, key
features, differentiating points, and use cases to give our readers some insights into our work.
In a world that revolves around data, protecting one’s privacy, and by extension the data itself, is becoming
not an option but a necessity. In response to this trend, global laws/regulations such as Personal Information
Protection Act (Korea), GDPR (EU), and CCPA (USA, California) have been modified to beef up privacy protection, and a
new market is emerging that supports data utilization in privacy-protected environment.
Privacy protection technology, commonly referred to as de-identification or anonymization technology, has been
implemented using relatively intuitive methods such as identifier removal and generalization. However, the latest
law/regulation reinforcement and increased requirements concerning data utilization have led to more theoretical and
quantifiable technologies fitted with cryptographic function and a number of international standards have been
proposed pertaining to the matter.[1]
In this paper, we will give you an overview of encryption-based privacy enhancing technology, a technology that is
being actively adopted by global companies such as Google, Microsoft, IBM, Intel, and Ant Financial and numerous
startups for sectors like finance, healthcare, and marketing. We hope to use our technology to help companies as well
as individuals to protect their private information.
PET is an encryption-based privacy enhancing technology that defeated the limitations of existing technologies such
as re-identification risk and analytical value degradation. Traditionally, cryptographic technologies have been
applied to areas such as data encryption, digital signatures, and safe cryptographic protocols to protect corporate
security rather than individual privacy. However, following changes led by the age of 4th industrial revolution, 5G,
and data-driven technology, the encryption-based anonymization technology - theoretically researched for more than 10
years – has come under spotlight for its business potentiality.
Representative technologies include 1) homomorphic encryption technology that enables computation in encrypted state,
2) differential privacy technology that can quantify the level of privacy, and 3) synthetic data generation technology
that can create fake data with statistical and probabilistic characteristics that are similar to the original one.
Needless to say, it takes time to assimilate these technologies but for the sake of this paper, here we will take a
brief look into these three different technologies in terms of their definition and main features.
① HE: Homomorphic Encryption
Homomorphic encryption is an encryption technology that supports data analysis in an encrypted state. For example, as
shown in [Figure 3], each ciphertext that homomorphically encrypted 2 and 5 is seen as a random number and if you
either add or multiply these numbers and decrypt the result, you will get 7(=2+5) or 10(=2x5). Now if you apply a
general encryption technology such as AES, you would get a completely different value when the computation result of
ciphertext is decrypted, whereas with homomorphic encryption, you would get a computation result that’s exactly
the same as that of original data.
Some of homomorphic encryption technology attributes can be found among the cryptographic technologies that are
currently in use. Even in the 70s, there were partially homomorphic encryption technologies available that supported
either multiplication or addition computation - such as RSA ciphers widely used in areas such as public certificates
and HTTPS, or the international standard Paillier ciphers.
Fully homomorphic encryption technology preserving both addition and multiplication computation was first proposed by
IBM researcher Gentry in 2009 and since then, active research has been carried out to date. Fully homomorphic
encryption preserves both addition and multiplication computation, allowing you to preserve computations of all
functions that are to be applied to data analysis, machine learning and AI analysis.[2]
② DP: Differential Privacy
Let's say you are making a query in a database. In this particular situation, if there is a significant difference in
the responses between when the database contains person A’s private information and when it is not, there is a
risk of exposing his/her privacy. For example, if you are making a query on average annual salary in a database
containing annual salary information, and no measures are taken to protect one’s privacy, you will be able to
work out person A’s exact annual salary by accounting for value difference between when his/her personal
information is included in the database and when it is not. This is where differential privacy protection technology
can demonstrate its merit with its ability to mix in the right distribution of noises into response to prevent you
from computing person A’s annual salary.
Average? Maximum Valju? Machine learning result?
Differental privacycaused by the difference in the analytics results of two subtly different database
This technology, first took its concrete form by Microsoft researcher Dwork in 2006. It offers a great technical
advantage in that it can measure in numeric numbers how much data processing results are damaged (error radius) and
how much privacy can be protected. Following recent announcement by Google and Apple of using differential privacy
technology for their respective Chrome Browser and iPhone/iCloud/Uber in the course of promoting their privacy
protection level, this differential privacy technology has been attracting a lot of attention.
③ SD: Synthetic Data
Synthetic data adopts a de-identification method that creates fake data (virtual data) so that data can be safely
analyzed and utilized without compromising the security of personal information contained in original data. This is
similar to Deep Fake where realistic fake videos are created using artificial intelligence to compose and manipulate
videos/photos.
Research on synthetic data first took off in 1981 when professor Rubin from Statistics Department at Harvard
University adopted synthetic data as a means to substitute missing values like in surveys. The research has since been
expanded to include fully synthetic data (all public data is fake data), partially synthetic data (only some
information in public data is fake data) and complex synthetic data (data newly created using partially synthetic
data).
Methods for creating synthetic data include traditional statistical method, machine learning model (GAN: Generative
Adversarial Networks) application method, and differential privacy protection application method for fake data privacy
assurance.
Security can be enhanced with differential privacy techonlogy and holomorphic encryption technology
[Figure 5] Synthetic DataPET provides multiple connecting functions to ensure privacy protection using the world's best homomorphic
encryption, differential privacy protection and synthetic data technology.
Existing de-identification technologies modify original text data to avoid exposure of personal information, however
they lose meaningful data in the process. In addition, because data are morphed according to the objectives of
analysis, it makes it difficult to use the data for other analytical purposes. However, because homomorphic encryption
technology encrypts and utilizes the original data without losing them, it allows more accurate analysis and prevents
original data from being leaked at its source.
Homomorphic encryption technology supports basic statistical function as well as various machine learning training
& inference function and deep learning inference function. In particular, it can provide an optimal solution
suitable to the needs of our customers by incorporating various PET technologies depending on application
environment.
Synthetic data is an imitation data created using the original data from data holder, and it can be used freely
without restrictions of privacy protection laws and regulations. In particular, for researchers and developers who
have had difficulty collecting data due to security measures imposed by data holder, synthetic data can provide them
with means to explore and learn data and develop analysis functions needed. The technology will provide developers
with an opportunity to implement analytics solutions in real situations for sectors like finance, healthcare and
security industry that handle sensitive data and acquire know-hows over the course of time.
Homomorphic Encrypted Data, Analytics Service, Homomorphic Encrypted Data
Our homomorphic encryption technology is the world-best. We secured and refined original PET technology recognized at
top-tier conferences like Eurocrypt'18 and Asicrypt'19, and we used this enhanced technology to win iDash, an
international Encrypted Analysis Support Competition in 2020.[3] And since then, our technology has been
refined further as proven at AAA’19 where our analysis result yielded 10 times faster analysis speed and same
accuracy compared to the cutting-edge machine learning training. Our PET technology offers the world's best efficiency
with its optimal approximate computation and concurrent processing of tens of thousands of data and it yields over
99.99% consistency in analysis result compared to the original data.
When an institution (data holder) wants to use high-performance analysis model owned by another institution (service
provider), the service provider in possession of the model may not be very forthcoming with sharing their model, given
that it is their important asset. In addition, the data holder is prevented from transferring their data outside their
premises due to privacy reason. This is where our homomorphic encryption technology can be useful as it can provide a
predictive service that can keep the analysis model from leaking whilst protecting customer data.
When using data containing sensitive personal information such as financial transaction and medical records,
pseudonymization is required to prevent leakage. Here, deep learning-based synthetic data technology can be used to
obtain meaningful analysis result from pseudonym information created by training on specific features of the original
data. In addition, differential privacy protection technology can be used to control the level of protection when
generating synthetic data.
Now let’s take a look at some of the cases where we adopted PET in actual business.[4]
Technology verification was conducted to test our PET technology for its ability to predict credit rating using
actual customer data held by domestic financial companies. The data was provided to service provider in their
encrypted state, so the personal information of customers were not exposed at all in credit score prediction and
analysis process. In addition, the credit rating results were derived in their encrypted state and could only be
verified by users with decryption keys.
The results of our technology verification were as follows. We were able to confirm that the accuracy of our
homomorphic encryption-based analysis was exactly the same as that of original data-based analysis and that our
technology provided excellent analysis speed, processing millions of data in 12 hour-time frame, thereby proving once
again its applicability to real environment.
The second case covers medical sector. In collaboration with domestic hospital, we developed a homomorphic encryption
technology suitable for deep learning-based model that is adept at predicting recurring chronic disease.
Our technology verification showed that encrypted analysis delivered almost the same accuracy to non-encrypted
analysis, and it took an average of 30 seconds to perform an analysis of each medical case. It also showed that if we
apply homomorphic encryption parallelization to the process, we could reduce processing time to less than 1 second per
case, which is good enough for on-site application.
These use cases helped us confirm how important it is to protect our customers’ sensitive data and their assets
(high-performance analysis models) and how worried they are about possible institutional and legal risks that may come
with new services offered using their sensitive data/assets. Moreover, we could see how important it is to these
concerning customers that we provide PET-based analysis solution that can accommodate high-quality services and also
avoid data/asset leakage at their sources.
If we could provide PET-based analysis service that is both user-friendly and is adept at responding quickly to the
needs of various industries, countless number of data that are sitting idly away at the moment could be used towards
new businesses more aggressively.
# References
[1] ISO/IEC 20889:2018,
Privacy enhancing data de-identification terminology and classification of techniques
[2] https://www.samsungsds.com/kr/insights/GDPR_data_analytics.html
[3] http://www.humangenomeprivacy.org/2020/
[4] https://www.mk.co.kr/today-paper/view/2020/4679442/
▶ The content is proected by law and the copyright belongs to the author.
▶ The content is prohibited to copy or quote without the author's permission.
Security Algorithm Team at Samsung SDS R&D Center
With his experience and expertise in encryption technology, he is involved in research & development of new encryption technology and privacy protection technology.
If you have any inquiries, comments, or ideas for improvement concerning technologies introduced in Technology Toolkit 2021, please contact us at techtoolkit@samsung.com.