Abstract
With increased concerns about data protection and privacy over the past several years, and concomitant introduction of regulations restricting access to personal information (PI), archivists in many jurisdictions now must undertake ‘sensitivity reviews’ of archival documents to determine whether they can make those documents accessible to researchers. Such reviews are onerous given increasing volume of records and complex due to how difficult it can be for archivists to identify whether records contain PI under the provisions of various laws. Despite research into the application of tools and techniques to automate sensitivity reviews, effective solutions remain elusive. Not yet explored as a solution to the challenge of enabling access to archival holdings subject to privacy restrictions is the application of privacy-enhancing technologies (PETs) —a class of emerging technologies that rest on the assumption that a body of documents is confidential or private and must remain so. While seemingly being counterintuitive to apply PETs to making archives more accessible, we argue that PETs could provide an opportunity to protect PI in archival holdings whilst still enabling research on those holdings. In this article, to lay a foundation for archival experimentation with use of PETs, we contribute an overview of these technologies based on a scoping review and discuss possible use cases and future research directions.
- [1] . 2011. On the declassification of confidential documents. In Modeling Decision for Artificial Intelligence: 8th International Conference, MDAI 2011, Changsha, Hunan, China, July 28–30, 2011, Proceedings 8. Springer, 235–246.Google ScholarCross Ref
- [2] . 2021. A survey of synthetic data generation for machine learning. In 2021 22nd International Arab Conference on Information Technology (ACIT ’21). IEEE, 1–7.Google ScholarCross Ref
- [3] . 2018. A survey on homomorphic encryption schemes: Theory and implementation. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–35.Google ScholarDigital Library
- [4] . 2010. New Technologies to Support Declassification. DARPA-SN-10-73. Retrieved March 5, 2023 from https://sgp.fas.org/news/2010/09/darpa-declass.pdfGoogle Scholar
- [5] . 2019. Privacy-preserving smart parking system using blockchain and private information retrieval. In 2019 International Conference on Smart Applications, Communications and Networking (SmartNets). IEEE, 1–6.Google ScholarCross Ref
- [6] . 2020. PrivFT: Private and fast text classification with homomorphic encryption. IEEE Access 8 (2020), 226544–226556.Google ScholarCross Ref
- [7] . 2020. Privacy-preserving deep learning NLP models for cancer registries. IEEE Transactions on Emerging Topics in Computing 9, 3 (2020), 1219–1230.Google ScholarCross Ref
- [8] . 2015. Review of Government Digital Records. Retrieved March 5, 2023 from https://www.gov.uk/government/publications/government-digital-records-and-archives-review-by-sir-alex-allanGoogle Scholar
- [9] . 2021. Computing blindfolded on data homomorphically encrypted under multiple keys: A survey. ACM Computing Surveys (CSUR) 54, 9 (2021), 1–37.Google ScholarDigital Library
- [10] . 2018. From keys to databases-real-world applications of secure multi-party computation. Comput. J. 61, 12 (2018), 1749–1771.Google Scholar
- [11] . 2016. The Application of Technology-assisted Review to Born-digital Records Transfer, Inquiries and Beyond. Retrieved February 26, 2023 from http://www.nationalarchives.gov.uk/documents/technology-assisted-review-to-born-digital-records-transfer.pdfGoogle Scholar
- [12] . 2022. Providing more efficient access to government records: A use case involving application of machine learning to improve FOIA Review for the deliberative process privilege. ACM Journal on Computing and Cultural Heritage (JOCCH) 15, 1 (2022), 1–19.Google ScholarDigital Library
- [13] . 2022. Web archives and the problem of access: Prototyping a researcher dashboard for the UK government web archive. In Archives, Access and Artificial Intelligence. Bielefeld University Press, 61–82.Google Scholar
- [14] . 2022. A critical review on the use (and misuse) of differential privacy in machine learning. Comput. Surveys 55, 8 (2022), 1–16.Google ScholarDigital Library
- [15] . 2021. PReDIHERO–privacy-preserving remote deep learning inference based on homomorphic encryption and reversible obfuscation for enhanced client-side overhead in pervasive health monitoring. In 2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA ’21). IEEE, 1–8.Google Scholar
- [16] . 2021. SoK: Privacy-preserving computation techniques for deep learning. Proceedings on Privacy Enhancing Technologies 2021, 4 (2021), 139–162.Google ScholarCross Ref
- [17] . 2021. SoK: Privacy-preserving collaborative tree-based model learning. Proceedings on Privacy Enhancing Technologies 2021, 3 (2021), 182–203.Google ScholarCross Ref
- [18] . 2022. The rise of fully homomorphic encryption: Often called the holy grail of cryptography, commercial FHE is near. Queue 20, 4 (2022), 39–60.Google ScholarDigital Library
- [19] . 2020. SoK: Differential privacies. Proceedings on Privacy Enhancing Technologies 2020, 2 (2020), 288–313.Google ScholarCross Ref
- [20] . 2018. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- [21] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference (TCC’06), New York, NY, USA, March 4-7, 2006. 265–284.Google Scholar
- [22] . 2019. Participatory information governance: Transforming recordkeeping for childhood out-of-home care. Records Management Journal 29, 1/2 (2019), 178–193.Google ScholarCross Ref
- [23] . 2021. Security vulnerabilities of SGX and countermeasures: A survey. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–36.Google ScholarDigital Library
- [24] . 2019. Secure and private function evaluation with Intel SGX. In Proceedings of the 2019 ACM SIGSAC Conference on Cloud Computing Security Workshop. 165–181.Google ScholarDigital Library
- [25] . 2018. Constrained-based differential privacy for mobility services. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 1405–1413.Google ScholarDigital Library
- [26] . 2020. Machine learning for cultural heritage: A survey. Pattern Recognition Letters 133 (2020), 102–108.Google ScholarCross Ref
- [27] . 2019. Decision tree classification with differential privacy: A survey. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–33.Google ScholarDigital Library
- [28] . 2022. Text classification for records management. Journal on Computing and Cultural Heritage (JOCCH) 15, 3 (2022), 1–19.Google ScholarDigital Library
- [29] . 2009. Fully homomorphic encryption using ideal lattices. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing. 169–178.Google ScholarDigital Library
- [30] . 2016. A review of e-voting: The past, present and future. Annals of Telecommunications 71 (2016), 279–286.Google ScholarCross Ref
- [31] . 2017. A matter of life or death: A critical examination of the role of records and archives in supporting the agency of the forcibly displaced. Journal of Critical Library and Information Studies 1, 2 (2017).Google ScholarCross Ref
- [32] . 2021. Rights in records: A charter of lifelong rights in childhood recordkeeping in out-of-home care for Australian and Indigenous Australian children and care leavers. The International Journal of Human Rights 25, 9 (2021), 1625–1657.Google ScholarCross Ref
- [33] . 2014. On using information retrieval for the selection and sensitivity review of digital public records. In PIR@ SIGIR. 39–40.Google Scholar
- [34] . 2004. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. American Journal of Clinical Pathology 121, 2 (2004), 176–186.Google ScholarCross Ref
- [35] . 2019. Use of homomorphic encryption with GPS in location privacy. In 2019 4th International Conference on Information Systems and Computer Networks (ISCON ’19). IEEE, 42–45.Google ScholarCross Ref
- [36] . 2021. Investigation on privacy-preserving techniques for personal data. In Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval. 62–66.Google ScholarDigital Library
- [37] . 2020. A systematic comparison of encrypted machine learning solutions for image classification. In Proceedings of the 2020 Workshop on Privacy-preserving Machine Learning in Practice. 55–59.Google ScholarDigital Library
- [38] . 2021. SoK: Efficient privacy-preserving clustering. Cryptology ePrint Archive (2021).Google Scholar
- [39] . 2017. Deep models under the GAN: Information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 603–618.Google ScholarDigital Library
- [40] . 2021. US and UK to Partner on Prize Challenges to Advance Privacy-Enhancing Technologies. Retrieved March 5, 2023 from https://www.whitehouse.gov/ostp/news-updates/2021/12/08/us-and-uk-to-partner-on-a-prize-challenges-to-advance-privacy-enhancing-technologies/Google Scholar
- [41] . 2017. Protecting privacy in the archives: Preliminary explorations of topic modeling for born-digital collections. In 2017 IEEE International Conference on Big Data (Big Data ’17). IEEE, 2251–2255.Google ScholarCross Ref
- [42] . 2023. Homepage. Retrieved March 4, 2023 from https://interparestrustai.org/Google Scholar
- [43] . 2023. Blockchain-based federated learning for securing Internet of Things: A comprehensive survey. Comput. Surveys 55, 9 (2023), 1–43.Google ScholarDigital Library
- [44] . 2015. ArchExtract. Retrieved March 5, 2023 from https://github.com/j9recurses/archextractGoogle Scholar
- [45] . 2022. Archives, Access and Artificial Intelligence: Working with Born-digital and Digitized Archival Collections. Bielefeld University Press.Google Scholar
- [46] . 2022. Privacy-preserving high-dimensional data collection with federated generative autoencoder. Proc. Priv. Enhancing Technol. 2022, 1 (2022), 481–500.Google ScholarCross Ref
- [47] . 2022. Synthetic data–what, why and how? arXiv preprint arXiv:2205.03257 (2022).Google Scholar
- [48] . 2021. A decentralized identity-based blockchain solution for privacy-preserving licensing of individual-controlled data to prevent unauthorized secondary data usage. Ledger 6 (2021).Google ScholarCross Ref
- [49] . 2009. Systematic literature reviews in software engineering–a systematic literature review. Information and Software Technology 51, 1 (2009), 7–15.Google ScholarDigital Library
- [50] . 2018. Breaking rules for good? How archivists manage privacy in large-scale digitisation projects. Archives and Manuscripts 46, 3 (2018), 289–308.Google ScholarCross Ref
- [51] . 2021. Addressing audit and accountability issues in self-sovereign identity blockchain systems using archival science principles. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC ’21). IEEE, 1210–1216.Google ScholarCross Ref
- [52] . 2022. Break the data barriers while keeping privacy: A graph differential privacy method. IEEE Internet of Things Journal (2022).Google Scholar
- [53] . 2020. POSTER: Attacks to federated learning: Responsive web user interface to recover training data from user gradients. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. 901–903.Google ScholarDigital Library
- [54] . 2020. Secure multiparty computation. Commun. ACM 64, 1 (2020), 86–96.Google ScholarDigital Library
- [55] . 2021. When machine learning meets privacy: A survey and outlook. ACM Computing Surveys (CSUR) 54, 2 (2021), 1–36.Google ScholarDigital Library
- [56] . 2021. Data privacy based on IoT device behavior control using blockchain. ACM Transactions on Internet Technology (TOIT) 21, 1 (2021), 1–20.Google ScholarDigital Library
- [57] . 2020. Consumers’ intentions to adopt blockchain-based personal health records and data sharing: Focus group study. JMIR Formative Research 4, 11 (2020), e21995.Google ScholarCross Ref
- [58] . 2018. Archival records and training in the age of big data. In Re-Envisioning the MLS: Perspectives on the future of library and information science education. Emerald Publishing Limited.Google Scholar
- [59] . 2015. A framework for enhanced text classification in sensitivity and reputation management. In 6th BCS-IRSG Symposium on Future Directions in Information Access (FDIA ’15) 6. 56–58.Google Scholar
- [60] . 2021. A framework for technology-assisted sensitivity review: Using sensitivity classification to prioritise documents for review. In ACM SIGIR Forum, Vol. 53. ACM New York, NY, USA, 42–43.Google Scholar
- [61] . 2020. How the accuracy and confidence of sensitivity classification affects digital sensitivity review. ACM Transactions on Information Systems (TOIS) 39, 1 (2020), 1–34.Google ScholarDigital Library
- [62] . 2019. A differential privacy-based protecting data preprocessing method for big data mining. In 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE ’19). IEEE, 693–699.Google ScholarCross Ref
- [63] . 2018. Protecting personal information using homomorphic encryption for person re-identification. In 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE ’18). IEEE, 166–167.Google ScholarCross Ref
- [64] . 2019. Towards deep neural network training on encrypted data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.Google ScholarCross Ref
- [65] . 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8, 1 (2008), 1–17.Google ScholarCross Ref
- [66] . 2012. Document sanitization: Measuring search engine information loss and risk of disclosure for the Wikileaks cables. In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2012, Palermo, Italy, September 26–28, 2012. Proceedings. Springer, 308–321.Google ScholarDigital Library
- [67] . 2021. Confidentiality. Retrieved March 5, 2023 from https://www.ons.gov.uk/census/2011census/confidentiality#::text=Census%20records%20are%20kept%20confidential,public%20release%20before%20January%202112Google Scholar
- [68] . 2022. Secure and distributed assessment of privacy-preserving GWAS releases. In Proceedings of the 23rd ACM/IFIP International Middleware Conference. 308–321.Google ScholarDigital Library
- [69] . 2017. Auto-categorization methods for digital archives. In 2017 IEEE International Conference on Big Data (Big Data ’17). IEEE, 2288–2298.Google ScholarCross Ref
- [70] . 2021. Human-centered artificial intelligence for designing accessible cultural heritage. Applied Sciences 11, 2 (2021), 870.Google ScholarCross Ref
- [71] . 2022. Blockchain-enabled federated learning: A survey. Comput. Surveys 55, 4 (2022), 1–35.Google ScholarDigital Library
- [72] . 2021. Synthetic data. Annual Review of Statistics and Its Application 8 (2021), 129–140.Google ScholarCross Ref
- [73] . 2006. Lattice-based cryptography. In Advances in Cryptology-CRYPTO 2006: 26th Annual International Cryptology Conference, Santa Barbara, California, USA, August 20–24, 2006. Proceedings 26. Springer, 131–141.Google ScholarDigital Library
- [74] . 2020. Recordkeeping and relationships: Designing for lifelong information rights. In Proceedings of the 2020 ACM Designing Interactive Systems Conference. 205–218.Google ScholarDigital Library
- [75] . 2023. Searching for solutions: MITRE tool simplifies freedom of information act requests. MITRE News & Insights (2023), online. Retrieved Retrieved March 5, 2023 from from https://www.mitre.org/news-insights/impact-story/mitre-tool-simplifies-freedom-information-act-requestsGoogle Scholar
- [76] . 1993. Statistical disclosure limitation. Journal of Official Statistics 9, 2 (1993), 461–468.Google Scholar
- [77] . 2012. Detecting sensitive information from textual documents: An information-theoretic approach. In Modeling Decisions for Artificial Intelligence: 9th International Conference, MDAI 2012, Girona, Catalonia, Spain, November 21–23, 2012. Proceedings 9. Springer, 173–184.Google ScholarDigital Library
- [78] . 2021. Digital identities and verifiable credentials. Business & Information Systems Engineering 63, 5 (2021), 603–613.Google ScholarCross Ref
- [79] . 2018. Practical secure computation outsourcing: A survey. ACM Computing Surveys (CSUR) 51, 2 (2018), 1–40.Google ScholarDigital Library
- [80] . 2016. Born-digital archives at the Wellcome Library: Appraisal and sensitivity review of two hard drives. Archives and Records 37, 1 (2016), 20–36.Google ScholarCross Ref
- [81] . 2019. Protecting Privacy in Practice: The Current Use, Development and Limits of Privacy Enhancing Technologies in Data Analysis. Retrieved March 5, 2023 from https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Protecting-privacy-in-practice.pdfGoogle Scholar
- [82] . 2022. How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artificial Intelligence Review (2022), 1–66.Google Scholar
- [83] . 1996. Replacing personally-identifying information in medical records, the Scrub system. In Proceedings of the AMIA Annual Fall Symposium. American Medical Informatics Association, 333.Google Scholar
- [84] . 2022. The internet computer for geeks. Cryptology ePrint Archive (2022).Google Scholar
- [85] . 2022. Data protection law and multi-party computation: Applications to information exchange between law enforcement agencies. In Proceedings of the 21st Workshop on Privacy in the Electronic Society. 69–82.Google ScholarDigital Library
- [86] . 2017. Efficient detection for malicious and random errors in additive encrypted computation. IEEE Trans. Comput. 67, 1 (2017), 16–31.Google ScholarDigital Library
- [87] . 2021. Adversarial interference and its mitigations in privacy-preserving collaborative machine learning. Nature Machine Intelligence 3, 9 (2021), 749–758.Google ScholarCross Ref
- [88] . 2018. No peek: A survey of private distributed deep learning. arXiv preprint arXiv:1812.03288 (2018).Google Scholar
- [89] . 2019. An improved method for sharing medical images for privacy preserving machine learning using multiparty computation and steganography. In 2019 9th International Conference on Advances in Computing and Communication (ICACC ’19). IEEE, 42–45.Google ScholarCross Ref
- [90] . 2020. Homomorphic encryption for machine learning in medicine and bioinformatics. ACM Computing Surveys (CSUR) 53, 4 (2020), 1–35.Google ScholarDigital Library
- [91] . 2022. A survey of oblivious transfer protocol. ACM Computing Surveys (CSUR) 54, 10s (2022), 1–37.Google ScholarDigital Library
- [92] . 2019. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2 (2019), 1–19.Google ScholarDigital Library
- [93] . 2023. Improved multiparty quantum private comparison based on quantum homomorphic encryption. Physica A: Statistical Mechanics and its Applications 610 (2023), 128397.Google ScholarCross Ref
- [94] . 2022. A survey on differential privacy for unstructured data content. ACM Computing Surveys (CSUR) 54, 10s (2022), 1–28.Google ScholarDigital Library
- [95] . 2021. Secure scheme for locating disease-causing genes based on multi-key homomorphic encryption. Tsinghua Science and Technology 27, 2 (2021), 333–343.Google ScholarCross Ref
- [96] . 2022. ZKProof Community Reference. Retrieved March 5, 2023 from https://docs.zkproof.org/reference.pdfGoogle Scholar
Index Terms
- Protecting Privacy in Digital Records: The Potential of Privacy-Enhancing Technologies
Recommendations
Privacy-enhancing technologies: approaches and development
In this paper, we discuss privacy threats on the Internet and possible solutions to this problem. Examples of privacy threats in the communication networks are identity disclosure, linking data traffic with identity, location disclosure in connection ...
How Privacy Concerns, Trust and Risk Beliefs, and Privacy Literacy Influence Users' Intentions to Use Privacy-Enhancing Technologies: The Case of Tor
Due to an increasing collection of personal data by internet companies and several data breaches, research related to privacy gained importance in the last years in the information systems domain. Privacy concerns can strongly influence users' decision ...
Collaborative privacy management
The landscape of the World Wide Web with all its versatile services heavily relies on the disclosure of private user information. Unfortunately, the growing amount of personal data collected by service providers poses a significant privacy threat for ...
Comments