Insights and Survey about the Capability, Efficiency and Security (original) (raw)

Yuanchun Li11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT†normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Hao Wen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT‡normal-‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT, Weijun Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT‡normal-‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT, Xiangyu Li11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT‡normal-‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT, Yizhen Yuan11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT‡normal-‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT, Guohong Liu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT‡normal-‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT,
Jiacheng Liu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Wenxing Xu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Xiang Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yi Sun11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Rui Kong11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yile Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Hanfei Geng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,
Jian Luan22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Xuefeng Jin33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Zilong Ye44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT, Guanjing Xiong55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT, Fan Zhang66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT, Xiang Li77{}^{7}start_FLOATSUPERSCRIPT 7 end_FLOATSUPERSCRIPT,
Mengwei Xu88{}^{8}start_FLOATSUPERSCRIPT 8 end_FLOATSUPERSCRIPT, Zhijun Li99{}^{9}start_FLOATSUPERSCRIPT 9 end_FLOATSUPERSCRIPT, Peng Li11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yang Liu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Ya-Qin Zhang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yunxin Liu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Institute for AI Industry Research (AIR), Tsinghua University
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Xiaomi AI Lab 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Huawei Technologies Co., Ltd. 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Shenzhen Heytap Technology Co., Ltd.
55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT vivo AI Lab 66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT Viomi Technology Co., Ltd. 77{}^{7}start_FLOATSUPERSCRIPT 7 end_FLOATSUPERSCRIPT Li Auto Inc.
88{}^{8}start_FLOATSUPERSCRIPT 8 end_FLOATSUPERSCRIPT Beijing University of Posts and Telecommunications 99{}^{9}start_FLOATSUPERSCRIPT 9 end_FLOATSUPERSCRIPT Soochow University

††{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Project Lead ‡‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT Section Lead
Contact: liyuanchun@air.tsinghua.edu.cn
Website: https://github.com/MobileLLM/Personal_LLM_Agents_Survey

Abstract

Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of the smartphone and Internet of Things, computing and sensing devices have become ubiquitous, greatly expanding the functional boundaries of intelligent personal assistants. However, due to the lack of capabilities such as user intent understanding, task planning, tool using, and personal data management etc., existing intelligent personal assistants still have limited practicality and scalability.

In recent years, the emergence of foundation models, represented by large language models (LLMs), brings new opportunities for the development of intelligent personal assistants. With the powerful semantic understanding and reasoning capabilities, LLM can enable intelligent agents to solve complex problems autonomously. In this paper, we focus on Personal LLM Agents, which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance. We envision that Personal LLM Agents will become a major software paradigm for end-users in the upcoming era. To realize this vision, we take the first step to discuss several important questions about Personal LLM Agents, including their architecture, capability, efficiency and security. We start by summarizing the key components and design choices in the architecture of Personal LLM Agents, followed by an in-depth analysis of the opinions collected from domain experts. Next, we discuss several key challenges to achieve intelligent, efficient and secure Personal LLM Agents, followed by a comprehensive survey of representative solutions to address these challenges.

_K_eywords Intelligent personal assistant ⋅⋅\cdot⋅Large language model ⋅⋅\cdot⋅LLM agent ⋅⋅\cdot⋅Mobile devices ⋅⋅\cdot⋅Intelligence levels ⋅⋅\cdot⋅Task automation ⋅⋅\cdot⋅Sensing ⋅⋅\cdot⋅Memory ⋅⋅\cdot⋅Efficiency ⋅⋅\cdot⋅Security and privacy

Contents
  1. 1 Introduction
  2. 2 A Brief History of Intelligent Personal Assistants
    1. 2.1 Timeline View of the Intelligent Personal Assistants History
    2. 2.2 Technical View of the Intelligent Personal Assistants History
      1. 2.2.1 Template-based Programming
      2. 2.2.2 Supervised Learning Methods
      3. 2.2.3 Reinforcement Learning Methods
      4. 2.2.4 Early Adoption of Foundation Models
  3. 3 Personal LLM Agents: Definition & Insights
    1. 3.1 Key Components
    2. 3.2 Intelligence Levels of Personal LLM Agents
    3. 3.3 Opinions on Common Problems
  4. 4 Fundamental Capabilities
    1. 4.1 Task Execution
      1. 4.1.1 Task Automation Methods
      2. 4.1.2 Autonomous Agent Frameworks
      3. 4.1.3 Evaluation
    2. 4.2 Context Sensing
      1. 4.2.1 Sensing Sources
      2. 4.2.2 Sensing Targets
    3. 4.3 Memorizing
      1. 4.3.1 Obtaining Memory
      2. 4.3.2 Managing and Utilizing Memory
  5. 5 Efficiency
    1. 5.1 Efficient Inference
      1. 5.1.1 Model Compression
      2. 5.1.2 Inference Acceleration
      3. 5.1.3 Memory Reduction
      4. 5.1.4 Energy Optimization
    2. 5.2 Efficient Customization
      1. 5.2.1 Context Loading Efficiency
      2. 5.2.2 Fine-tuning Efficiency
    3. 5.3 Efficient Memory Manipulation
      1. 5.3.1 Searching Efficiency
      2. 5.3.2 Index Optimization
  6. 6 Security and Privacy
    1. 6.1 Confidentiality
      1. 6.1.1 Local Processing
      2. 6.1.2 Secure Remote Processing
      3. 6.1.3 Data Masking
      4. 6.1.4 Information Flow Control
    2. 6.2 Integrity
      1. 6.2.1 Adversarial Attacks
      2. 6.2.2 Backdoor Attacks
      3. 6.2.3 Prompt Injection Attacks
    3. 6.3 Reliability
      1. 6.3.1 Problems
      2. 6.3.2 Improvement
      3. 6.3.3 Inspection
  7. 7 Conclusion and Outlook

1 Introduction

Science fiction has portrayed numerous striking characters of Intelligent Personal Assistants (IPAs), which are software agents that can augment individuals’ abilities, complete complicated tasks, and even satisfy emotional needs. These intelligent agents represent most people’s fantasies regarding artificial intelligence (AI). With the widespread adoption of personal devices (e.g., smartphones, smart home equipment, electric vehicles, etc.) and the advancement of machine learning technology, this fantasy is gradually becoming the reality. Today, many mobile devices embeds IPA software, such as Siri [1], Google Assistant [2], Alexa [3], etc. These intelligent agents are deeply entwined with users, capable of accessing user data and sensors, controlling various personal devices, and accessing personalized services associated with private accounts.

However, today’s intelligent personal assistants still suffer from the limitations of flexibility and scalability. Their level of intelligence is far from adequate, particularly evident in their understanding of user intent, reasoning, and task execution. Most of today’s intelligent personal assistants are limited to performing tasks within a restricted domain (e.g., simple functions in built-in apps). Once a user requests for tasks beyond these boundaries, the agent fails to comprehend and execute the actions accurately. Altering this circumstance necessitates a significant expansion of the agent’s capability to support a broader and more flexible scope of tasks. However, it is difficult for current IPA products to support tasks at scale. Most of the today’s IPAs require to follow specific predefined rules to complete tasks, such as developer-defined or user-demonstrated steps. Therefore, developers or users must explicitly specify which functions they wish to support, in addition to defining the triggers and steps for task execution. This approach inherently restricts the scalability to wider range of tasks, since supporting more tasks demands extensive time and labor cost. Some approaches have attempted to automatically learn to support tasks through supervised learning or reinforcement learning [4, 5, 6]. However, these methods also rely on a substantial amount of manual demonstrations and/or the definition of reward functions.

The emergence of Large Language Models (LLMs) [7] in recent years has brought brand new opportunities for the development of IPAs, demonstrating the potential to address the scalability issues of intelligent personal assistants. In comparison to traditional methods, large language models such as ChatGPT, Claude, and others have exhibited unique capabilities such as instruction following, commonsense reasoning, and zero-shot generalization. These abilities have been achieved through unsupervised learning on massive corpora (exceeding 1.4 trillion words) and subsequently fine-tuned with human feedback. Leveraging these capabilities, researchers have successfully adopted large language models to empower autonomous agents (aka. LLM agents), which aims to solve complex problems by automatically making plans and using tools such as search engines, code interpreters, and third-party APIs.

As a unique type of intelligent agents, IPAs also have the potential to be revolutionized by LLMs with significantly enhanced scalability, capability, and usefulness. We call such LLM-powered intelligent personal assistants as Personal LLM Agents. As compared with normal LLM agents, Personal LLM Agents are more deeply engaged with personal data and mobile devices, and are more explicitly designed for assisting people rather than replacing people. Specifically, the primary way to assist users is by reducing repetitive, tedious, and low-value labor in their daily routine, letting the users focus on more interesting and valuable things, thereby enhancing the efficiency and quality of their work and life. Personal LLM Agents can be built upon existing software stacks (e.g., mobile apps, websites, etc.), while bringing refreshing user experience with ubiquitous intelligent automation abilities. Therefore, we expect Personal LLM Agents to become a major software paradigm for personal computing devices in the AI era, as shown in Figure 1.

Refer to caption

Figure 1: We envision Personal LLM Agents to become the dominating software paradigm for individual users in the upcoming era.

Despite the promising future of Personal LLM Agents, related research is still in its nascent stage, presenting numerous intricacies and challenges. This paper takes the first step to discuss the route map, design choices, main challenges and possible solutions in implementing Personal LLM Agents. Specifically, we focus primarily on the aspects related to “_personal_” parts within Personal LLM Agents, encompassing the analysis and utilization of users’ personal data, the use of personal resources, deployment on personal devices, and the provision of personalized services. The straightforward integration of the general language capabilities of LLMs into IPAs is not within the scope of this paper.

We started by taking a survey with domain experts of Personal LLM Agents. We invited 25 chief architects, managing directors, and/or senior engineers/researchers from leading companies who are working on IPAs and/or LLMs on personal devices. We asked the experts’ opinions about the opportunities and challenges of integrating LLMs in their consumer-facing products. Based on our understanding and analyses of experts’ insights, we summarized a simple and generic architecture of Personal LLM Agents, in which the intelligent management and utilization of personal data (user context, environment status, activity history, personalities, etc.) and personal resources (mobile apps, sensors, smart-home devices, etc.) play the most vital role. The ability to manage and utilize these personal objects differentiates the intelligence of Personal LLM Agents. Inspired by the L1-L5 intelligence levels of autonomous driving, we also give an taxonomy of five intelligent levels of Personal LLM Agents.

Our findings also highlight several major technical challenges to implement such Personal LLM Agents, which can be categorized into three aspects including the fundamental capabilities, efficiency, and security & privacy. We further dive deeper into these aspects with detailed explanations of the challenges and comprehensive survey of possible solutions. Specifically, for each technical aspect, we briefly explain its relevance and importance to personal LLM agents, then break it down to several main research problems. For example, the foundamental capabilities for personal LLM agents include task execution, context sensing, and memorization. The efficiency of agents is primarily determined by the LLM inference efficiency, customization efficiency, and memory retrieval efficiency. The security and privacy concerns of personal LLM agents can be categorized as data confidentiality, decision reliability, and system integrity. For each research problem, we summarize the main techniques involved with the problem, followed by a brief introduction of the related work. Due to the wide scope of the techniques in personal LLM agents, we only include the most relevant or recent works, rather than attempting to cover all related approaches.

The main content and contributions of this paper can be summarized as follows:

    1. We summarize the status quo of existing intelligent personal assistants in both industry and academia, while analyzing their primary limitations and future trends in the LLM era.
    1. We collect insights from senior domain experts in the area of LLM and personal agents, proposing a generic system architecture and a definition of intelligence levels for personal LLM agents.
    1. We review the literature on three important technical aspects of personal LLM agents, including foundamental capabilities, efficiency, and security & privacy.

2 A Brief History of Intelligent Personal Assistants

Refer to caption

Figure 2: Major milestones in the history of intelligent personal assistants (IPAs). We mark different development stages with different colors, and some significant or ground-breaking events are highlighted with bold text.

2.1 Timeline View of the Intelligent Personal Assistants History

Intelligent Personal Assistants (IPAs) have a long history of development. We depict the rough timeline of the IPA history in Figure 2. The development progress can be divided into four stages, each marked with a unique color in the figure.

The 1st stage spans from the 1950s to the late 1980s, which is mainly about the development of speech recognition techniques. The early stage of speech recognition started from basic digits and words. Bell Laboratories developed “Audrey”, which could recognize numbers 0-9 with about 90% accuracy. In 1962, the “shoebox” [8] system came out from Advanced Systems Development Division Laboratory at IBM, which was capable to recognize for up to 16 words. From 1971 to 1976, the Speech Understanding Research (SUR) project, funded by the US Department of Defense, significantly advanced speech recognition technology. The Harpy system [9] was particularly representative, as it could understand sentences composed of 1011 words, equivalent to the proficiency of a three-year-old child. In 1986, IBM developed the Tangora speech recognition typing system [10], capable of recognizing 20,000 words and offering predictive and error-correction capabilities. The Tangora system utilized Hidden Markov Models [11], requiring individual speaker training for voice recognition, with pauses between each word.

The 2nd stage covers the period from the 1990s to the late 2000s, since speech recognition started to be integrated into software for certain advanced functions. In 1990, the “Dragon Dictate” software [12] was released, which was the first speech recognition product for consumers. It was originally designed to work on Microsoft Windows, supporting discrete speech recognition. “Speakable items” [13] was introduced by Apple in 1993, enabling users to control their computer with natural speaking. In 1996, IBM launched “MedSpeak” [14] for radiologists, which is also the first commercial product supporting continuous speech recognition. Microsoft integrated speech recognition into Office applications in 2002 [15], and Google added voice search to Google Mobile App on iPhone in 2008 [16].

The 3rd stage extends from the early 2010s. In this period, always-on virtual assistant services began to appear on mobile devices such as smartphones and personal computers. Siri [1], widely considered as the first intelligent personal assistant installed on modern smartphones, was integrated into Apple’s iPhone 4S in 2011. Since its launch, Siri has remained a key built-in software for Apple devices, including iPhones, iPad, Apple Watch, HomePod and Mac, continuously undergoing updates and iterations to incorporate new features. Similar to Siri, many other virtual intelligent assistant started to appear in the period. In 2014, Microsoft released Cortana [17], and gradually integrated it into desktop computers and other platforms. Amazon released Alexa [3] in the same year, which could complete tasks such as voice interaction, music playing, setting alarms, etc. Beyond voice search, Google Assistant [2] was unveiled in 2016, supporting users to interact with both speaking and keyboard input.

The 4th stage started recently when LLMs start to draw attention from all over the world. Based on LLMs, there emerged many intelligent chatbots (e.g., ChatGPT [18]), as well as some LLM-powered IPA software installed on personal devices (e.g., Copilot [19]). The details of this stage will be covered in Section 2.2.4.

2.2 Technical View of the Intelligent Personal Assistants History

Since there are many aspects that can reflect the intelligence of personal assistants, we select one of the most important ability of Intelligent Personal Assistants, namely the task automation ability (following instructions and completing tasks), to be mainly focused on. In the following subsections, we will introduce four main types of techniques to enable intelligent task automation in IPA. Note that these types of solutions have been developing concurrently, and there is no strict chronological order between them.

2.2.1 Template-based Programming

Most of the commercial IPA products support task automation through template-based approaches. In these approaches, the functions that can be automated are predefined as templates, each of which usually contains the task description, related actions, example queries to match, supported parameters to fullfil, etc. Given a user command, the agent first map the command to the most relevant template, then follow the predefined steps to complete the task. The workflow is illustrated in Figure 3.

When using this method to automate tasks, app developers are required to follow the document of certain APIs (e.g., the Google Assistant API [2], SiriKit [20], etc.) to create the template for each function they want to automate. Besides, some approaches are proposed to enable end-users to create their own templates of tasks, such as the “Shortcuts” [21] feature on iPhone devices, enabling the automation of repetitive operation sequences. Similar functions are also implemented in many products and academic research for the Android system, such as Tasker [22], Anywhere [23], Epidosite [24] and Microsoft’s uLink[25] system, etc.

The advantages of such template-based task automation method lie in its reliability and accuracy, since the steps in the template are deterministic and carefully programmed. However, its scalability is pretty limited, because of the relatively complex mechanism for supporting new tasks. As a result, most apps, including the popular apps from large companies, do not support any automated task or only support some elementary ones, leading to very unflexible user experience. End-users can easilly give up the idea to use IPAs after several unsuccessful attempts [26, 27, 28, 29]. This limitation poses a major obstacle to the further development of template-based intelligent personal assistants.

Refer to caption

Figure 3: The workflow of template-based task automation.

2.2.2 Supervised Learning Methods

To address the constraints of template-based IPA methods, researchers are actively investigating automated approaches for enhanced UI understanding and automation. Supervised learning offers a direct method for task automation by training models that predicts subsequent actions and states based on task inputs and current states. The main research questions include how to learn a representation of software GUI and how to train the interaction model.

The idea of learning an interaction model from human interaction traces is introduced in Humanoid [30], which aims to generate human-like test inputs based on the GUI layout information. Seq2act [4] firstly focused on the mobile UI task automation domain, where the natural language instructions need to be mapped to a sequence of actions that can be directly executed. The framework decomposed the problem into an action phrase-extraction part and a grounding part, both using the Transformer [31] network. Inspired by the success of pretraining in NLP, ActionBert [32] uses self-supervised pretraining to enhance the model’s understanding of UIs. Specifically, to capture the semantics information of the UI switching actions, the model is designed to take a pair of UIs as input, and output embeddings of both UIs and individual components. Aimed at better compatibility with the restricted resource on mobile devices, Versatile UI Transformer (VUT) [33] was proposed to learn different UI grounding tasks within a single small model. It handles images, structures, and text-based types of data, using 3 task heads to support performing 5 distinct tasks simultaneously, including UI object detection, natural language command grounding, widget captioning, screen summarization and UI tappability prediction. Based on the self-aligned characteristics between components of different modalities, UIBert [34] presented a well-designed joint image-text model to utilize the correspondence, learning contextual UI embeddings from unlabeled data. To address the problem of lacking UI metadata, such as DOM tree and view hierarchy, SpotLight [35] introduced a vision-only approach for mobile UI understanding by taking screenshots and a region of interest (the “focus”) as input. Composed of a vision encoder and a language decoder, it can complete tasks according to the provided screenshot and prompt. Besides, Lexi [36] was proposed to leverage text-based instruction manuals and user guides to curate a multimodal dataset. By fusing text and visual features as input to the co-attention transformer layers, the model is pre-trained to form connections between text-based instructions and UI screenshots. UINav [37] utilized a referee model to evaluate the performance of the agent, immediately inform the users of the feedback. It also adopted demonstration augmentation to increase the data diversity.

As compared with template-based methods, supervised learning approaches have the potential to generalize to unseen tasks after sufficient training. However, training the model typically requires a lot of high-quality human-annotated data. Given the diversity of tasks and apps in the real world, obtaining the training data that covers diverse use cases is challenging.

2.2.3 Reinforcement Learning Methods

Unlike supervised learning-based task automation approaches that require a large amount of training samples, reinforcement learning (RL)-based approaches allows the agent to acquire the capability of task automation by continuously interacting with the target interfaces. During the interaction, the agent gets feedback of rewards that indicate the progress of task completion, and it gradually learns how to automate the tasks by maximizing the reward payoff.

To train RL-based task automation agents, a reward function that indicates the progress towards task completion is required. World of Bits (WoB) [38] was proposed as a general platform for agents to complete tasks on the Web using keyboard and mouse. The platform came with a benchmark called “MiniWoB”, containing tasks on a set of self-created toy websites with predefined rewards. Glider [5] defines the reward function for real-world websites based on the semantic similarity between the task description and the UI action sequence, as well as the locality and directionality of the action sequence.

Another challenge of RL-based task automation is the huge action space and the sparse reward. A typical GUI-grounded task usually involves 5555-10101010 steps, each of which contains 10101010-100100100100 candidate actions, leading to a search space size of 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT-10010superscript10010100^{10}100 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. The task is completed only if the correct sequence of actions is taken. In order to tackle such challenge, many frameworks have been proposed.Liu et al. [6] introduced the method to use high-level “workflows” to constrain the allowable actions at each time step. The workflows can prune out bad exploration directions, accelerating the agent’s ability to discover rewards.Gur et al. [39] decomposed the complicated instruction into multiple smaller ones, and schedule a curriculum for the agents to gradually manage to follow an increasing number of sub-instructions. Besides, a meta-learning framework is also proposed to generate instruction-following tasks.Jia et al. [40] framed the actions of agent on the web into three distince categories, namely, DOM selection, token selection, and mode selection. What’s more, a factorized Q-value function is designed, assuming the independence of DOM selection and token selection. Glider [5] achieves its goal of reducing action space with a hierachical policy, which contains a master policy to handle the overall navigation and sub-policies to deal with specific widgets.Humphreys et al. [41] proposed the framework to directly use mouse and keyboard to complete tasks instead of depending on the specialized action spaces, which simplifies the use of behavioural priors informed by actual human-computer interactions

Similar to supervised learning methods, the RL-based methods also suffer from poor generalization ability. To achieve flexible and robust task automation, the RL agent needs to train on a large amount of tasks, each requires a well-designed reward function. Defining the reward functions for massive diverse tasks can be difficult.

2.2.4 Early Adoption of Foundation Models

In recent years, pretrained large fundation models, represented by large language models (LLMs), have seen rapid development and brought new opportunities for personal assistants.

The scaling law [42] for language models reveals the importance of increasing model parameters for improving model performance, followed by a bunch of models with billions of parameters. The LLMs are typically trained with large-scale open-domain text data in an unsupervised manner, followed by instruction fine-tuning [43] and reinforcement learning with human feedback (RLHF) [44, 43] to improve performance and alignment. ChatGPT [18] unveiled by OpenAI at the end of 2022 is a milestone of LLM that demonstrated astounding question-answering capabilities. By feeding simple task descriptions into the LLM as input prompts, the tasks and responses of LLMs can be easily customized. Besides, these models have also demonstrated robust generalization abilities across various language understanding and reasoning tasks. ChatGPT itself can be viewed as an intelligent personal assistant that assist users by returning information in text responses.

Inspired by the capabilities of LLMs, researchers have attempted to let LLMs use tools [45] autonomously to accomplish complex tasks. For instance, such as controlling browsers [46, 47] for information retrieval and summarization, invoking robot programming interfaces for robot behavior control [48, 49, 50], and calling code interpreters for complex data processing [51, 52, 53, 54], among others. It is a natural idea to integrate these capabilities into intelligent personal assistants, enabling more intelligent ways to manipulate personal data, personal devices and personalized services.

There are already some commercial products that have attempted to integrate LLM with IPA. For instance, Microsoft’s Copilot system [19] has integrated the capabilities of GPT-4 [55], assisting users of Windows in automatically drafting documents, creating presentations, summarizing emails, and thereby enhancing user work efficiency. New Bing [56] also improves the experience of surfing the internet, providing a powerful efficient search engine which better understands what users want. Similarly, Google has integrated LLMs (Bard [57], Gemini [58]) into the search engine to enable more convenient web search experience. Smartphone companies including Huawei, Xiaomi, Oppo, Vivo have also integrated large models (PanGu [59], MiLM [60], etc.) into their on-device IPA products. It is worth noting that some of them adopt solutions based on locally-deployed lightweight LLMs. So far, most of these commercial products are just simple integration of the chat interfaces of LLMs into the personal assistants. Research about deeper functional integration will be discussed in Section 4.1.

Despite exhibiting vast potential, this research direction is currently in an early exploration stage. There is still a substantial distance away from the ultimate goal of truly understanding and assisting users with intelligent agents. What’s more, many issues related to efficiency, security and privacy have not been adequately addressed yet. The subsequent parts of this paper will systematically summarize and discuss the key issues in this direction.

3 Personal LLM Agents: Definition & Insights

Witnessing the great potential of LLM-based intelligent personal assistants and wide interests in both academia and industry, we take the first step to systematically discuss the opportunities, challenges and techniques related to this direction.

We define Personal LLM Agents as a special type of LLM-based agent that is deeply integrated with personal data, personal devices, and personal services. The main purpose of personal LLM agents is to assist end-users, helping them to reduce repetitive and cumbersome work and focus more on interesting and important affairs. Following this definition, the generic automation methods (prompting, planning, self-reflection, etc.) are similar to normal LLM-based agents. We focus on the aspects that are related to the “personal” parts, such as the management of personal data, the use of smartphone apps, deployment to resource-constrained personal devices, etc.

We envision that Personal LLM Agents will become a major software paradigm for personal devices in the LLM era. However, the software stack and ecosystem of Personal LLM Agents are still at a very early stage. Many important questions related to the system design and implementation are unclear yet.

Therefore, we attempted to address some of the questions based on insights collected from domain experts. Specifically, we invited 25 experts who are chief architects, managing directors, or senior engineers/researchers from 8 leading companies that are working on IPA-related products, including smartphone personal assistants, smart-home solutions, and intelligent cockpit systems. We talked with them casually on the topics of Personal LLM Agents and asked them several common questions, ranging from the application scenarios to the deployment challenges. Based on our discussion and collected answers, we summarize the insights into three subsections, including the key components of Personal LLM Agents, a taxonomy of intelligence levels, and expert opinions about common problems.

3.1 Key Components

Based on our discussions about the desired features of Personal LLM Agents, we first summarize the main components to support such features, as shown in Figure 4.

Refer to caption

Figure 4: Main components of Personal LLM Agents.

Undoubtedly, the core of Personal LLM Agents is a foundation model (large language model or other variants, we call it LLM for simplicity), which connects all other components. Firstly, the LLM is the basis to support different skills for serving the users, including responsive skills that directly execute tasks as users requested (such as question answering, weather checking, event scheduling, etc.) and proactive skills that offer services without explicit user commands (such as life logging, managing user attention, activity recommendation, etc.).

Secondly, to support these skills, the LLM manages various local resources, including mobile applications, sensors, and IoT devices. For example, the agent may complete weather checking by interacting with a smartphone weather app. Meanwhile, many people have mentioned the importance of Personal LLM Agents to provide personalized and context-aware services. Therefore, the LLM should maintain the information about the user, including the current user context (status, activity, location, etc.) and historic user memory (profile, logs, personality, etc.). To manipulate these resources, contexts and memories, it is also desired to use dedicated management systems like vector databases in combination with the LLM.

The combination of these key components is analogous to an operating system [61], wherein:

    1. The foundation model is like the kernel in traditional operating systems. It is employed for systematic management and scheduling of various resources, thereby facilitating the functions of the agents.
    1. The local resource layer is similar to the driver programs in traditional operating systems. In traditional OS, each driver manages a specialized set of hardware. While in Personal LLM Agents, each local resource component manages a type of tool and provides APIs for the LLM to use.
    1. User context and user memory correspond to the program contexts and system logs maintained during system operations. These components form the basis for the agent to support personalized services.
    1. The skills at the top layer are analogous to the software applications in traditional OS. Similar to the installation and removal of applications, the skills of agents should also be allowed to be flexibly enabled or disabled.

3.2 Intelligence Levels of Personal LLM Agents

The desired features of Personal LLM Agents require different kinds of capabilities. Inspired by the six levels of autonomous driving, we categorize the intelligence levels of Personal LLM Agents into five levels, denoted as L1 to L5, as shown in Figure 5. The key characteristics and representative use cases of each level are listed in Table 1.

Refer to caption

Figure 5: The duties of Personal LLM Agents at different intelligence levels.

Table 1: Different levels of intelligence for Personal LLM Agents.

Level Key Characteristics Representative Use Cases
L1 - Simple Step Following Agent completes tasks by following exact steps predefined by the users or the developers. - User: “Open Messenger”; Agent opens the app named Messenger.- User: “Open the first unread email in my mailbox and read its content”; Agent follows the command step by step.- User: “Call Alice”; Agent matches a developer-defined template, finds Alice’s phone number in the address book, and calls the number.
L2 - Deterministic Task Automation Based on the user’s description of a deterministic task, agent auto-completes the necessary steps in a predefined action space. - User: “Check the weather in Beijing today”; Agent automatically calls the weather API with parameter “Beijing” and parses info. from the response.- User: “Make a video call to Alice”; Agent automatically opens the address book, finds Alice’s contact, and clicks on “video chat”.- User: “Tell the robot vacuum to clean the room tonight”; Agent opens the robot vacuum app, clicks ‘schedule’, and sets the time to tonight.
L3 - Stratigic task Automation Based on user-specified tasks, agents autonomously plan the execution steps using various resources and tools, and iterates the plan based on intermediate feedback until completion. - User: “Tell Alice about my schedule for tomorrow”; Agent gathers tomorrow’s schedule information from the user’s calendar and chat history, then summarizes and sends them to Alice via Messenger.- User: “Find out which city is suitable for travel recently”; Agent lists several cities suitable for travel, checks the weather in each city, summarizes the information, and returns recommendations.- User: “Record my sleep quality tonight”; Agent checks every 10 minutes during sleep time if the user is using the phone, moving, or snoring (based on smartphone sensors and microphone), summarizes the information, and generates a report.
L4 - Memory and Context Awareness Agent senses user context, understands user memory, and proactively provides personalized services at appropriate times. - Agent recommends suitable financial products automatically based on User’s recent income and expenses, considering User’s personality and risk preference.- Agent estimates User’s recent anxiety level based on the conversations and behaviors, recommends movies/music to help relax and notifies user’s friends or doctors depending on the severity.- When a user falls in the bathroom, the Agent detects the event and decides whether to ask the user, notify the user’s family members, or call for help based on the user’s age and physical conditions.
L5 - Digital Persona Agent fully represents the user in completing complex affairs, can interact on behalf of user with other users or agents, ensuring safety and reliability. - Agent automatically reads emails and messages on behalf of User, replies to questions without user intervention, and summarizes them into an abstract.- Agent attends the work discussion meeting on behalf of the user, expresses opinions based on user’s work log, listens to suggestions, and writes the minutes.- Agent records User’s daily diet and activities, privately researches or ask experts on any anomalies, and makes health improvement suggestions.

At each level, the user and agent are responsible for different duties. At Level 1 (Simple Step Following), agents only take charge of step execution, and the other duties are in charge of the user. For example, when users give the command, agents follow explicit steps defined by the developer or given by the user to complete the task. The L1 agents do not have any ability of sensing or planning. Most template-based IPA products belong to this category.

As the intelligence level increases, the agents gradually take on more duties. At level 2, the supported tasks are still deterministic (i.e., involving a fixed sequence of actions to complete), but the detailed steps to execute each task are no longer given explicitly. The agents have to auto-complete the necessary steps based on the user’s task description. For instance, given a user query “How is the weather of Beijing today”, the agent calls the weather API with Beijing” as a parameter and retrieves weather information from the response. Unlike the deterministic tasks at level 2, agents at level 3 can complete more complicated tasks that require strategic planning and self-reflection. For instance, the command “Tell Alice about my schedule for tomorrow” needs the agent to determine how to gather the schedule information (e.g., using the user’s calendar and chat history) and how to inform Alice about the information (e.g., summarizing the calendar events and sending via the messenger app). In these tasks, agents autonomously and iteratively generate and perform the execution plan based on intermediate feedback until completing the tasks.

The agents in L1-L3 work passively driven by the users’ commands, while agents at level 4 can understand users’ historical data, sense the current situation, and proactively offer personalized services at appropriate times.

With ultra intelligence at level 5, agents play the role of a Digital Persona that can fully represent the user in completing complex affairs, thus users only need to focus on creativity and emotion. Agents not only sense the current status, but also predict the users’ future activities and take actions to facilitate them. Beyond directly serving users, Digital Persona can also collaborate with other agents to alleviate the burden of their users’ communication. Moreover, the level-5 agents should be able to continuously improve themselves through self-evolution.

3.3 Opinions on Common Problems

Next, we report the aggregrated results of the experts’ opinions towards several common questions. The questions include the design choices and the potential challenges to deploy Personal LLM Agents, as summarized in Table 2.

We analyze the answers to the questions and summarize the following main takeaways.

Table 2: The common questions that we asked the domain experts. In Questions 1 to 6, we gave several common options for the experts to select/prioritize, while the experts were also allowed to give free-form answers. In Questions 7 and 8, the experts were asked to answer with text.

ID Question
1 If the LLM is applied to personal intelligent agents, do you think it should be deployed locally or remotely?
2 How do you think customized models tailored for different users or organizations should be implemented?
3 For the LLM deployed on personal devices, which modality(ies) do you think needs to be supported?
4 What do you think is the most important capability of LLMs for personal LLM agents?
5 Considering the industry you are in, which ways of interaction do you think are the most promising for personal LLM agents?
6 In the future development of personal LLM agents, which aspect is the most crucial?
7 What features do you hope a future personal LLM agent can provide for you or your customers?
8 When integrating LLM with personal devices, what challenges do you think will be faced? What are the most urgent technical issues that needs to be addressed?

Opinion 1 (where to deploy the LLM): _Edge-cloud (local-remote) collaborated deployment of LLM is preferred, while existing cloud-only (remote-only) (e.g.,, ChatGPT) is not a widely acceptable solution._As shown in Figure 7, 88% of participants prefer an edge-cloud collaborated architecture, 58.33% of them support local deployment, and 81.82% of them are not satisfied with the existing cloud-only solutions. Their main concerns are 1) the high latency of remote LLM service, 2) the privacy issue of transmitting personal data to the cloud, and 3) the huge cost of cloud-based LLM services.

Figure 6: The vote distribution of different LLM deployment strategies in Personal LLM Agents.

Refer to caption

Refer to caption

Figure 6: The vote distribution of different LLM deployment strategies in Personal LLM Agents.

Figure 7: The vote distribution of different model customization methods for Personal LLM Agents.

Opinion 2 (how to customize the agents): _Combining fine-tuning and in-context learning is the most acceptable way to achieve customization._In Personal LLM Agents, customizing the agent for different users and scenarios is considered necessary. Figure 7 shows that 66.67% of participants support combining the advantages of both fine-tuning and in-context learning to reach personalization (L4 intelligence). 43.75% of them do not believe L4 can be achieved by in-context learning; one possible reason is our participants are from the industry, thus they are more focused on the LLM for specific vertical domains where in-context learning hasn’t received much attention.

In questions 3-5, we ask participants to rank the options and the following tables (Table 3-5) summarize their ranks. Rank 1st-4th denotes the rankness of these options voted by the participants; for example, 72% in Table 3 means that 72% participants rank Text as their first preferred modality. The “score” in each table is calculated based on the Borda Count [62], where each candidate receives points equal to the average of the number of candidates they outrank in each ballot, with the lowest-ranked getting 2222 and the highest n+1𝑛1n+1italic_n + 1 points, where n is the total number of candidates. For instance, 4.564.564.564.56 in Table 3 equals to 5×72%+4×20%+3×0+2×8%5percent724percent20302percent85\times 72\%+4\times 20\%+3\times 0+2\times 8\%5 × 72 % + 4 × 20 % + 3 × 0 + 2 × 8 %.

Opinion 3 (what modalities to use): _The multi-modal LLM, especially Textual and Visual modalities, is desired for Personal LLM Agents._In our statistical result, Text is the most preferred modality just as the most popular LLMs used (e.g., GPT series and LLaMA series). The second-ranked Image option and the Video modality which is specifically mentioned by 20% of the participants show that the visual modality plays a promising role in the future of personal LLM agents.

Table 3: The favored modalities to be used in Personal LLM Agents.

Options Scores Rank 1st Rank 2nd Rank 3rd Rank 4th
Text 4.56 72% 20% 0% 8%
Image 3.64 4% 64% 24% 4%
Voice 3.18 16% 4% 60% 20%
Sensors 2.18 9.52% 14.29% 9.52% 66.67%

Opinion 4 (which LLM ability is the most crucial for IPA products): _Language understanding is considered the most important capability of LLMs, whereas the ability to handle long contexts is regarded as the most unimportant one._On the contrary, in academia, the capability to handle long context is regarded as very important and is extensively studied. This different opinion originates from the specific vertical-domain LLMs our participants supposed and the general-purpose LLMs of academic researchers. In vertical-domain LLMs, the queries and tasks from users are not very diverse, hence the capacity of long context is not that critical.

Table 4: The importance ranking of LLM abilities for IPA products.

Options Scores Rank 1st Rank 2nd Rank 3rd Rank 4th
Language understanding 4.52 83.33% 8.33% 4.17% 4.17%
In-context learning 3.16 4.55% 50% 45.45% 0%
Common sense reasoning 3 8.33% 33.33% 29.17% 20.83%
Long context 1.8 5.56% 11.11% 16.67% 61.11%

Opinion 5 (how to interact with the agents): _Voice-based interaction is the most popular way._Unsurprisingly, just like the existing virtual assistant Siri, mimicking the human communication method – voice interaction is the most common and efficient choice. Text-based chatbots and GUI rank second and third since most of the participating experts focus on mobile devices, e.g., smartphones. Virtual reality only obtains a 1.521.521.521.52 score which is the lowest across all questions; this may stem from the high price of VR devices and the unsatisfied user experience of current VR techniques.

Table 5: The favored interaction method of Personal LLM Agents.

Options Scores Rank 1st Rank 2nd Rank 3rd Rank 4th
Voice interaction 4.04 60.87% 17.39% 21.74% 0%
Text chatbox 3.32 22.73% 45.45% 18.18% 13.64%
GUI 3.24 23.81% 38.1% 38.1% 0%
Virtual reality 1.52 0% 6.25% 25% 68.75%

Opinion 6 (which agent ability is needed to develop): In the future development of Personal LLM Agents, “more intelligent and autonomous decision-making capability” is considered the most critical feature among our participants; almost half of the participants (47.83%) rank it at first place. The options “Continuous improvement of user experience and interaction methods” and “Secure handling of personal data” also received much attention, with 36.36% and 33.33% respectively, tying for the second place. Although "Integration with IoT devices" ranks last, 47.63% of participants still believe it is important as an infrastructure for Personal LLM Agents.

Opinion 7 (what features are desired for an ideal IPA):Based on the responses from the participants, we summarize the following six key features of an ideal agent:

Opinion 8 (what are the most urgent technical challenges):According to the responses from the participants, the most urgent challenges and technical issues are categorized as follows:

Motivated by the valuable opinions of domain experts, the following sections will discuss the desired capabilities and potential challenges in more detail.

4 Fundamental Capabilities

We first discuss the capabilities required by Personal LLM Agents to support diverse features. Excluding the general capabilities of normal LLM agents, we focus on three fundamental capabilities for personal assistants, including task execution, context sensing, and memorization. Task execution (§4.1) is to translate the users’ commands or the proactively perceived tasks into actions on personal resources. The purpose of context sensing (§4.2) is to perceive the current state of the user and the environment, providing comprehensive information for task execution. Memorization (§4.3) is to record the user data, enabling the agent to recall past events, summarize knowledge and self-evolve. While context sensing and memorization are abilities associated with querying information from users, task execution refers to the ability of providing services to users. Figure 8 depicts the relation of these fundamental capabilities. The following sections discuss these capabilities in details.

Refer to caption

Figure 8: The fundamental capabilities of Personal LLM Agents.

4.1 Task Execution

Task execution is a fundamental capability of a Personal LLM Agent, enabling it to respond to user requests and carry out specified tasks. In our scenario, the agent is designed to interact with and control various personal devices such as smartphones, computers and IoT devices to automatically execute users’ commands.

A fundamental requirement for task execution is the agent’s ability to accurately interpret tasks as communicated by users. Typically, tasks may originate from users’ verbal or written instructions, from which the intelligent agent discerns the user’s intent. With the maturation of voice recognition technology, converting voice information into text has become highly convenient [63, 64].

Personal LLM Agents should make plans and take actions automatically after converting the users’ commands into text. While planning poses a challenge for traditional DNNs, LLM-based agents exhibit greater proficiency in this regard. The planning and reasoning abilities of LLM agents have been discussed in the former surveys [65, 66, 67]. Our paper primarily focuses on the manipulation of personal data and interaction with personal devices. A significant consideration is that Personal LLM Agents might need to interact with applications or systems that may lack comprehensive API support. Consequently, we also explore the user interface (UI) as an important tool for personal agents, enabling effective interaction in scenarios where API limitations exist.

4.1.1 Task Automation Methods

Based on the types of interaction mode, the methods of task execution can be categorized into code-based and UI-based approaches. In the code-based scenario, agents primarily complete tasks by automatically generating code to call APIs. Under UI-based scenarios, agents interact with personal devices by automatically simulating human interactions with the UI interface.

Code-based Task Automation often involves generating appropriate code to interact with APIs, databases, and DNN models. Traditional code-based personal assistants are often based on slot-filling-based task-oriented dialogue (TOD) frameworks. In the era of LLM, more researchers are attempting to directly use LLMs to directly generate code that calls APIs in order to accomplish more complex tasks.

Code-based methods can complete thousands of tasks from web searching to image generating. However, not all the needed APIs are available for agent developers in real-life apps out of security concerns or business interests. Besides, there are tasks that can be executed easily for human users but are difficult for calling system APIs [72]. Depending solely on publicly available APIs may not fully meet the highly diverse requirements for mobile task automation.

UI-based Task Automation. Autonomous UI agents attempt to translate users’ tasks into UI actions on smartphones or other personal devices, automating these tasks through direct UI interaction. Compared to code-based task execution, autonomous UI agents do not rely on publicly available APIs, potentially allowing for more versatile automation capabilities. However, executing users’ tasks by UI actions is not easy for traditional DNN models because of the implicit relations between tasks and UI elements. Recently, researchers utilize the comprehension and reasoning abilities of LLMs to improve the performance of autonomous UI agents.

The input of the UI agent is a task described in natural language, and a representation of the current UI, and the output is the UI action to be executed on the UI. Depending on how they represent the UI, we can categorize the autonomous UI agents into text-based GUI representation and multimodal GUI representation.

While UI-based task automation has the potential to achieve a more flexible personal agent framework compared to API-based automation, its research is still in the early stages. It remains challenging to accomplish more complex user commands. Besides, the privacy and security issues have not been fully addressed [92, 96]. It also remains controversial about the UI representation. While multimodal representation can handle elements that cannot be parsed through accessibility services, it is plagued by the heavy demands of screen recording and the limited reasoning abilities of current vision language models [101].

4.1.2 Autonomous Agent Frameworks

An LLM-powered autonomous agent is composed of an LLM brain to make plans and self-reflection, a memory to store past information and knowledge, and a tool usage module to interact with tools (e.g. APIs, UIs, programming languages) [102, 66]. There are a lot of popular projects that provide frameworks for users to create LLM-powered agents [103, 104, 105, 106, 107, 108, 109, 110, 111]. They attempt to enhance the ability of LLMs by interacting with other external tools and retrieving long/short-term memory. Auto-GPT [103] is one of the most famous frameworks, which can execute users’ commands by generating prompts for GPT and using external tools. LangChain [104] is another popular framework that helps developers to create more sophisticated and context-aware applications using LLMs. Due to the ability to understand and produce natural language, LLM-powered agents can also engage with one another effortlessly, fostering an environment where collaboration and competition among multiple agents can thrive [112, 113, 108, 114]. These autonomous agent frameworks make significant engineering contributions, providing a more user-friendly framework for the LLM-powered applications.

For mobile devices, AutoDroid [92] provides an effective framework for developing mobile agents. Developers can easily create an automator for mobile tasks by either exploring apps using a test input generator or through manual demonstration. AutoDroid then automatically analyzes these records and utilizes them to improve Language Learning Models (LLMs) for more efficient task automation.Huang et al. [115] develop a new method to effectively extract macros (basic units of user activity in apps such as “login”, or “call a contact”) from user-smartphone interaction traces. These macros can help agents to automatically complete tasks.

4.1.3 Evaluation

Evaluating the performance of task execution is a challenging issue. For API-based task execution, former surveys have provided a comprehensive summary on how to evaluate them [65, 67]. Our paper mainly focuses on the evaluation of UI-based task automation.

Metrics: The metrics of UI-based task execution are completion rate [4, 95, 92] and manually designed reward [116, 117]. The completion rate is the probability that all actions predicted by the model are entirely consistent with the ground truth. However, since there may be different methods to complete a task, and the ground truth typically represents only one of these methods, the accuracy evaluated by this approach is not entirely correct [92]. Manually designing rewards based on the crucial steps can be more precise [117], but they are less scalable because of the complex annotating process.

Table 6: UI task automation benchmarks. The structured UI form are view hierarchy (VH) and document object model (DOM) for Android and web respectively. For Windows, the metadata stems from the textual metadata within the operating system.

Benchmark Name Platforms Human annotations UI format High-level tasks Exploration memory
Datasets PhraseNode [118] Web 51,663 DOM, Screen
UIBert [34] Web 16,660 DOM, Screen
RicoSCA [4] Android N/A VH, Screen
PixelHelp [4] Android 187 VH, Screen
MoTiF [119] Android 6,100 VH, Screen
META-GUI [95] Android 4,684 VH, Screen
UGIF [120] Android 523 VH, Screen
Mind2Web [91] Web 2,350 DOM, Screen
AITW [121] Android+Web 715,142 Screen
DroidTask [92] Android 158 VH, Screen
Platforms MninWoB++ [38, 6] Web 17,971 DOM, Screen
WebShop [122] Web 12,087 DOM, Screen
WebArena [123] Web 812 DOM, Screen
AndroidEnv [116] Android N/A Screen
MobileEnv [117] Android N/A VH, Screen
AssistGUI [124] Windows 100 Metadata, Screen

Benchmarks: Table 6 lists the benchmarks of UI-based task automation. One group of benchmarks is static datasets, which often include a set of human-annotated tasks, structured UI data (and screenshots), and actions to complete the tasks. Some of the tasks are synthetically generated [4, 116, 117]. The early works mainly focus on low-level tasks with clear instructions [118, 34], for example, click the ‘settings’ button, and then click ‘Font size’. Later works introduce high-level tasks that could be completed in multiple steps [4, 119, 95, 120, 91, 121], for example, delete all the events in my calendar. Another group of benchmarks are platforms that enable the agent to interact with. MiniWoB++ [38, 6], WebShop [122], and WebArena [123] provide web environments where agents can navigate and operate on the web by clicking, typing, closing page, and so on. AndroidEnv [116] and MobileEnv [117] provide a dynamic environment where agents can engage with any Android-based application and the core operating system. This framework allows for a wide scope of interaction and task-solving capabilities within the diverse Android platform.

Remark. Existing approaches have demonstrated the remarkable ability of LLM agents in task reasoning and planning. However, there are several important problems to solve to realize practical Personal LLM Agents. 1. How to accurately and efficiently assess the performance of agents in real-world scenarios. Because there are usually various ways to accomplish the same task, it is inaccurate to use a static dataset to measure the accuracy of task execution. Meanwhile, dynamically testing the tasks in a simulated environment may be inefficient and hard to reproduce. 2. How to robustly determine if a task has been completed. LLMs often experience hallucinations during task execution, making it difficult to determine whether the current task has been completed. 3. Regarding UI agents, what is the best way to represent the software UI? The vision-based representation (e.g. screenshot) is generally available, while the text-based representation is usually more lightweight and friendly for LLM agents to operate.

4.2 Context Sensing

Context Sensing refers to the process that the agent senses the status of the user or the environment, in order to provide more customized services. In this work, we adopt a broad definition of context sensing, by considering generic information gathering process as a form of sensing. Hardware-based sensing aligns with the conventional notion of sensing, primarily involving data acquisition through various sensors, wearable devices, edge devices, and other data sources. On the other hand, software-based sensing emphasizes diverse means of data acquisition. For example, analyzing user typing habits and common phrases constitutes a form of software-base sensing.

In Personal LLM Agents, context sensing capability serves various purposes.1. Enabling Sensing Tasks: Some tasks inherently require the agent to do sensing. For instance, when a user requires the agent to detect snoring during sleep, the agent must possess the ability to actively acquire, process, and analyze audio data.2. Supplementing Contextual Information: The sensed information can facilitate the execution of ambiguous or complex tasks. For example, when the user wants to listen some music, it’s good to know the current activity of the user to recommend appropriate music.3. Triggering Context-aware Services: The sensing capability is also the basis to provide proactive services. For example, the agent may notice the users to keep focus upon detecting dangerous driving behaviors.4. Augmenting Agent Memory: Some information perceived through sensing can become a part of the agent memory, which can be used by the agent for further customization and self-evolution.

We introduce the techniques of context sensing from two perspectives, including sensing sources and sensing targets.

4.2.1 Sensing Sources

Hardware Sensor. Modern personal devices are equipped with a wide range of built-in hardware sensors, including accelerometers, gyroscopes, magnetic field sensors, light sensors, thermometers [125], microphones [126], GPS modules, cameras [127], etc. Some other modules such as bluetooth and Wi-Fi [128] can also be used for sensing purposes. With the growing prevalence of wearable and IoT devices such as smart watches, bluetooth headphones [129], and smart home devices [130], the sensing scope and sensing modalities are greatly expanded.

Software Sensor. Unlike hardware sensing that obtains data from real sensor devices, software sensing focuses on obtaining information from existing data, such as app usage [131], call records [132], and typing habits [133], etc. In reality, the realm of software sensing is incredibly broad. For instance, in the field of natural language processing or audio, there exists a plethora of sensing research based on text or speech. Furthermore, recommendation systems such as e-commerce or short video platforms, the process typically involves first sensing certain user information and subsequently recommending specific products or content. These sensors let agents better understand the users, enabling them to provide with more intelligent and personalized services.

Combination of Multiple Sensors.Multi-sensor collaborative sensing stands out as an effective method for enhancing perceptual capabilities. Previous endeavors have demonstrated the assessment of user emotions, stress levels, and emotional states based on touchscreen and inertial sensors [134], identification of time spent through screen capture and sensor data [135], breath detection through headphone microphones [136], and nuanced motion detection through sensors and audio [137].

The significance of multi-sensor collaboration extends to the proliferation of intelligent wearables and smart homes. For instance, automatic recognition of when a user is working or resting using data collected from personal devices [138] (smartwatches, laptops, and smartphones), or action detection through the combination of headphones and smartphone microphones [129]. Furthermore, technologies involving the fusion of household appliances, such as user action perception based on existing wired devices [139], motion recognition in smart home environments [130], Wi-Fi-based motion detection [140], multiperson detection [128], and sleep monitoring [141].

Multi-sensor and multi-device scenarios necessitate intricate considerations in data source selection, data fusion, and data analysis methods. Existing methodologies include LLM-driven strategies for generating multi-sensor policies in human behavior understanding [142], emotion-agnostic multi-sensor data multitask learning frameworks [143], cross-modal fusion of sensing data [144], wearable device motion recognition with a focus on multi-sensor fusion [145], and predictive anxiety in sensor data under conditions of data absence [146]. Furthermore, there are studies that analyze the importance of data features in fall detection [147].

With the evolution of sensing technologies, multi-sensor and multi-device collaborative sensing has become a staple approach for perceiving complex scenarios. Effectively integrating diverse data sources to maximize accuracy and determining methods to eliminate less crucial data from a multitude of sources to conserve resources are vital research areas.

4.2.2 Sensing Targets

The objectives of context sensing can be categorized into environment sensing and user sensing. Environment sensing encompasses factors such as location, occasion, religious and cultural backgrounds, national and societal contexts, and more. Meanwhile, user sensing incorporates elements such as user activities, states, personal information, personality traits, emotions, goals, physical conditions, and other related aspects.

Sensing the Environment.We further categorize environment sensing into two dimensions: scene sensing and occasion sensing. Scene sensing predominantly involves more tangible environmental factors, such as locations and places. Occasion sensing delves into deeper environmental information, including religious and cultural backgrounds, national differences, and social relationships.

With the recent advancements in LLMs, there have also emerged some environment sensing methods. For example, forecasting pedestrian flow through the analysis of public events [156] and LLM-based environmental understanding using multiple sensors [157].

Environment sensing is crucial context information for a personal agent. Different environments lead to distinct behaviors and focal points, extending beyond mere locations to encompass social occasions, cultural backgrounds, and deeper conceptual elements, all environment individuals and relationships, interactions, and anticipating the impacts on both the environment and the user. These considerations directly influence the level of intelligence exhibited by the personal agent.

Sensing the User.User awareness is one of the primary features of Personal LLM Agents. A deeper understanding of the user can better reflects the value and significance of the Personal LLM Agents. We categorize user sensing into two temporal dimensions, including short-term and long-term. Short-term sensing exhibits higher temporal variability and increased randomness. On the other hand, long-term sensing necessitates extended maintenance and correction, making it relatively more stable and reliable.

In the realm of user sensing, there are also several LLM-based initiatives, such as employing LLM for recommendation tasks [172, 173], sentiment analysis with LLM [174], and the development of a personal doctor equipped with inquiry and perception capabilities [175].

Remark. Existing methods often confine themselves to specific sensors, individual apps, or particular domains. In Personal LLM Agents, a possible opportunity is to unify all sensing results concerning the environment and the user to originate from diverse sources. However, to achieve this goal involves several important research challenges. 1. What is a unified format or ontology of the sensed information? The agents should be able to convert diverse sensing data into this format and conveniently use the data for various downstream tasks. 2. Given the broad scope of sensing, how can the agents decide when and what to sense, in order to provide context-aware services with minimal overhead?

4.3 Memorizing

Memorizing denotes the capability to record, manage and utilize historical data in Personal LLM Agents. This capability enables the agents to keep track of the user, learn from past experiences, extract useful knowledge, and apply this acquired knowledge to further enhance the service quality. The related work is mainly aimed to answer two questions, including how to obtain the memory and how to utilize the memory.

4.3.1 Obtaining Memory

The agent memory can be in various formats. For example, the basic user profiles (e.g., birthdate, addresses, personalities, preferences) are often stored in key-value pairs, allowing for easy key-based retrieval. Historical records are usually represented as sequences indexed by timestamps, which archive user service access, activities, system events and so on over the time. The user’s documents, photos, videos, etc. are stored as files, which are often produced by other applications. There are mainly two ways to obtain the memory: directly logging the raw data or indirectly inferring knowledge from raw data.

Logging.The most straightforward way to obtain memory is through logging, such as recording user input, system events, and sensed contexts. Logging data is often relatively simple. Life logging is a commonly-discussed topic that focuses on tracking and recording user data created through the activities and behaviors of users, contributing to a comprehensive understanding of individuals’ lifestyles and preferences [176, 177]. Data recorded at specific moments using video cameras provide deeper overview of daily activities [178]. Moreover, recording data over long periods of time can provide valuable insights into behavior patterns, which will support the personalization of intelligent agents [179].

Inferring.Another way of Personal LLM Agents to obtain memory is to extract knowledge from the raw data. With the advancements in machine learning and data analytics, it has become possible to infer user behavior, patterns, and interactions to gain insights into their psychology, preferences, and other high-level information. For example, user personality can be extracted from texts [180, 181], emotions can be read from image and text data [182, 183], preferences can be modeled from historical interaction information [184], and knowledge graphs can be extracted from smartphone push notifications [185]. These extracted high-level information will also be stored as memories of the agent and utilized in services.

4.3.2 Managing and Utilizing Memory

After obtaining the memory, the next question is how to manage and utilize the memory to provide better services in Personal LLM Agents. Based on the purposes of utilizing memory, we divide the relevant techniques into following three parts, including raw data management, memory-augmented LLM inference, and agent self-evolution.

Raw Data Management and Processing.A basic ability of Personal LLM Agents is to access and process the raw memory data (e.g., selecting, filtering, transforming to other formats, etc.), in order to facilitate other advanced functions. This line of work primarily focus on enabling more natural and human-comprehensible access, manipulation, and modification of data. Since the input-output and reasoning processes of LLMs are based on natural language, such interfaces are more easily integrated with other capabilities of large models. In this research area, numerous endeavors have explored the use of machine learning models or template-based methods to map user data requests to database SQL statements [186, 187]. There are also framework-level works examining how to unify and simplify data interfaces. For instance, PrivacyStreams [188] unifies all personal data access and processing interfaces into a stream-based framework, which is more conducive for large language models to comprehend and manage.

Memory-augmented LLM Inference.To enable the Personal LLM Agents to provide customized services based on the user-related memory, it is usually desired to make use of the memory data in the LLM inference process. Recent research in LLM agents has explored leveraging memory to enhance decision-making and reasoning [83, 189, 190, 191, 192], which provides inspiration for a solution where Personal LLM Agents can offer personalized services to users through memories. The techniques can be different based on the types of the memory.

Agent Self-evolution.To better accommodate users, Personal LLM Agents may also need to dynamically update themselves based on the memory data. We refer to this as “self-evolution”. The foundational functionality of intelligent agents is predominantly reliant on LLM. Therefore, the key to the self-evolution of intelligent agents lies in how to leverage LLM for the discovery and exploration of new skills, as well as in the continuous update of the LLM itself.

Remark. The ability to generate and leverage the memory about the user is the basis of personalization in Personal LLM Agents. We highlight following three open problems surrounding the memory mechanism of Personal LLM Agents. 1. The agent memory can potentially be huge, heterogeneous and dynamic. What is the most effective and efficient way for the agents to organize and retrieve the memory? 2. Human has the ability to forget. Since inappropriate data in the memory can be harmful for the agents’ service quality and efficiency, how can the agents determine what information to memorize? 3. What is the best way for the agents to self-evolve with the memory? Specifically, what data to use, when to evolve, and how (fine-tuning or else)? How can the personalized models accept updates of the base foundation model?

5 Efficiency

Refer to caption

Figure 9: The mapping relations between the low-level processes and high-level capabilities of Personal LLM Agents.

Due to the limited hardware resource and power supply on many personal devices, it is important to improve the efficiency of Personal LLM Agents in the deployment stage. We’ve discussed in Section 4 the fundamental capabilities of Personal LLM Agents, including task execution, context sensing, and memorizing. These capabilities, as shown in Figure 9, are backed by more elementary processes, mainly including the inference, customization and memory retrieval of the LLM agent. Each of these processes desires careful optimization of efficiency, as described below.

Inference of LLMs is the basis of an agent’s various capabilities. For example, the agent may first decompose a complex task into several steps with the help of the LLM, then solve each step through either LLM inference or invoking personal tools (e.g., schedule a meeting). Sensing the context or generating the memory may also rely on the reasoning abilities of LLMs. While the cost of using the tools or sensors is usually hard to estimate due to the diversity, LLM inference is a common procedure that demands a lot of both computation and memory resources. Therefore, the LLM inference becomes the performance bottleneck for the Personal LLM Agents, requiring careful optimizations on its efficiency.

Customization is another important process of Personal LLM Agents for accommodating different user requirements. Customization is needed when the agents are installed to different users or used in different scenarios. The self-evolution of Personal LLM Agents is also a process of customization. To offer customized services, an agent can either feed the LLM with different context tokens or tune the LLM with domain-specific data. Due to the frequent needs of customization, the processes may impose considerable pressure on the system’s computational and storage resources.

Memory manipulation is another costly process. To provide better services, the agents may require access to longer contexts or external memories, such as environment perceptions, user profiles, interaction histories, data files, etc. Consequently, this gives rise to two considerations. The first pertains to necessitating LLMs to handle longer inputs. The second issue centers around the management and acquisition of information from an external memory bank.

{forest}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=center, font=, rectangle, draw=hidden-draw, rounded corners, align=left, text centered, minimum width=4em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=10em,font=,, where level=2text width=15em,font=,, where level=3text width=15em,font=,, [Efficiency, ver [Efficient
Inference (§5.1), fill=blue!10 [Model Compression (§5.1.1), fill=blue!10 [Quantization, fill=blue!10 [ Weight-only-Quant: GPTQ [216], AWQ [217], LLM-QAT [218], etc. , leaf, text width=29em ] [ Co-Quant: ZeroQuant [219], SmoothQuant [220], etc. , leaf, text width=23em ] ] [Pruning, fill=blue!10 [ LLM-Pruner [221], SparseGPT [222], Wanda [223], etc. , leaf, text width=24em ] ] [Knowledge Distillation, fill=blue!10 [ White-box: BabyLlama [224], MiniLLM [225], etc. , leaf, text width=22em ] [ Black-box: Hsieh et al. [226], SCoTD [227], etc. , leaf, text width=21em ] ] [Low-rank Factorization, fill=blue!10 [ ZeroQuant-V2 [228], LoSparse [229], etc. ,leaf, text width=18em ] ] ] [Inference Acceleration (§5.1.2), fill=blue!10 [Context Compression, fill=blue!10 [ Quantization: ZeroQuant [219], SmoothQuant [220], etc. , leaf, text width=24em ] [ Pruning:Li et al. [230], Jiang et al. [231], Chevalier et al. [232],
Anagnostidis et al. [233], Zhang et al. [234], etc. , leaf, text width=27em ] ] [Kernel Optimization, fill=blue!10 [ FlashAttention [235, 236], FlashDecoding++ [237], etc. , leaf, text width=24em ] ] [Speculative Decoding, fill=blue!10 [Chen et al. [238], Leviathan et al. [239], etc. , leaf, text width=20em ] ] ] [Memory Reduction (§5.1.3), fill=blue!10 [KV Quantization, fill=blue!10 [ ZeroQuant [219], SmoothQuant [220], etc. , leaf, text width=18em ] ] [KV Pruning, fill=blue!10 [Anagnostidis et al. [233], Zhang et al. [234], etc. , leaf, text width=22em ] ] [Offloading, fill=blue!10 [ FlexGen [240], etc. , leaf, text width=9em ] ] ] [Energy Optimization (§5.1.4), fill=blue!10 [Software Approaches, fill=blue!10 [ Same above, leaf, text width=8em ] ] [Hardware Approaches, fill=blue!10 [ NPU [241], TPU [242], FPGA [243], etc. , leaf, text width=18em ] ] ] ] [Efficient
Customization (§5.2), fill=blue!10 [Fine-tuning Efficiency (§5.2.2), fill=blue!10 [Parameter-efficient Fine-tuning, fill=blue!10 [Houlsby et al. [244], LLM-Adapters [245], LoRA [246], etc. , leaf, text width=26em ] ] [Efficient Optimizer Design, fill=blue!10 [ LOMO [247], Sophia [248], etc. , leaf, text width=14em ] ] [Training Data Curation, fill=blue!10 [ phi-1 [249], phi-1.5 [250], phi-2 [251], etc. , leaf, text width=18em ] ] ] [Context Loading Efficiency (§5.2.1), fill=blue!10 [Loading Acceleration, fill=blue!10 [ CacheGen [252], etc. , leaf, text width=10em ] ] ] ] [Efficient Memory
Manipulation (§5.3), fill=blue!10 [Searching Efficiency (§5.3.1), fill=blue!10 [Search Mechanism Design, fill=blue!10 [ Search Plan: rule-based [253, 254], cost-based [255, 256], etc. , leaf, text width=26em ] [ Metadata Filtering: AnalyticDB-V [255], HQANN [257], etc. , leaf, text width=26em ] ] [Search Process Execution, fill=blue!10 [ OPENMP: Faiss [258], Milvus [256], etc. , leaf, text width=18em ] [ SIMD: Faiss [258], Milvus [256], Quicker ADC [259], etc. , leaf, text width=25em ] [ GPU: Faiss [258], Milvus [256], etc. , leaf, text width=16em ] [ Distributed: Vald [260], Qdrant [253], etc. , leaf, text width=18em ] ] ] [Index Optimization (§5.3.2), fill=blue!10 [Indexing Algorithms, fill=blue!10 [ Randomization Partition: E2LSH [261], RPTree [262, 263], etc. , leaf, text width=27em ] [ Learned Partition: SPANN [264], etc. , leaf, text width=16em ] [ Navigable Partition: NSW [265], HNSW [266], etc. , leaf, text width=22em ] ] [Hardware-aware Indexing, fill=blue!10 [ DRAM-SSD: DiskANN [267], DiskANN++ [268], etc. , leaf, text width=23em ] [ Co-design: CXL-ANNS [269], FANNS [270], etc. , leaf, text width=21em ] ] ] ] ]

Figure 10: Overview of techniques to improve the efficiency of LLM agents. The leaf nodes are part of representative works we have cited.

We’ll dive into the efficiency of each component in the following subsections, as is shown in Figure 10.

5.1 Efficient Inference

Since the runtime cost of Personal LLM Agents is dominated by LLM inference, it is important to improve the inference efficiency to enhance the overall efficiency of the agent. Although the total inference cost can be significantly influenced by the design of agents, including how the agents send requests to LLMs, what prompts to use, etc., we will be focused on model and system-level approaches only. The reason is that the designs of agents may vary based on the actual applications and don’t directly contribute to the efficiency of LLM inference itself.

Many model and system-level approaches have been proposed to improve the efficiency of LLM inference. While some of them are generic for the overall performance and efficiency (e.g., model compression), there are also techniques targeting the efficiency of specific perspectives, such as model size, inference latency, memory consumption, energy consumption, etc. We will discuss these aspects separately in the following parts of this subsection.

5.1.1 Model Compression

Model compression techniques, which directly reduce the model size and computations, are generic optimizations to enhance the inference efficiency of LLMs, including computation, memory, energy and etc. The model compression techniques are further categorized into various approaches, including quantization, pruning (sparsity), distillation and low-rank factorization.

Quantization is one of the most important compression approaches for LLMs. It reduces the model size by using fewer bits to represent the model parameters, and also reduces computations with system-level support for quantized kernels. Quantization methods can be further divided into post-training quantization (PTQ) and quantization-aware training (QAT), based on whether additional training is required after quantization. Unlike QAT (e.g., LLM-QAT [218]) which requires non-negligible additional training effort, PTQ is more available and flexible for on-device deployment under different hardware constraints.

Recent works have revealed that the difficulty of LLM quantization mainly lies in activations, where the outliers are hard to quantize [271, 272]. Existing works have proposed various approaches to tackle this challenge. A typical line of work adopts the weight only quantization (WOQ) paradigm, which conduct integer quantization (e.g., INT4 and INT8) on weights only, while preserving activations in float formats (e.g., FP16 and FP32). WOQ achieves a trade-off between the compression ratio and model perplexity. A straightforward way of WOQ is the group-wise uniform quantization implemented in current mobile deployment frameworks (e.g., llama.cpp [273] and MLC-LLM [274]). Recent works also proposed different quantization algorithms to enhance model capability, such as GPTQ [216] and AWQ [217].

Despite the WOQ techniques, another line of work quantizes both weights and activations. For example, ZeroQuant [219] performs INT8 quantization for both weights and activations, using group-wise quantization for model weights and token-wise quantization for activations. However, the activations, including key-value (KV) pairs, are usually more difficult to quantize compared to model weights because of outliers. There have been extensive works to tackle this challenge. SmoothQuant [220] migrates the quantization difficulty of activations to weights through additional scaling operations that “smooth” the outliers in activations, and thereby achieve negligible accuracy degradation in W8A8 quantization. Subsequent works further attempt to lower the usable quantization bitwidth down to 4-bit through various techniques including channel re-ordering (RPTQ [275]), channel-wise shifting and scaling (Outlier Suppression+ [276]), and adaptive channel reassembling (QLLM [277]). Notably, RPTQ addresses the KV storage issue by developing a new quantization scheme that focuses solely on KV cache when quantizing activations, which is the major memory consumer in long-context inference.

Pruning reduces the model size and computations by removing less important connections in the network. Pruning is categorized into structured pruning and unstructured pruning. Structure pruning usually removes weights in regular patterns, such as a rectangle block in the matrix or an entire channel, while unstructured pruning doesn’t impose such constraints. Consequently, structured pruning (e.g., LLM-Pruner [221]) is more hardware-friendly but more difficult to maintain model accuracy. While traditional pruning approaches require costly retaining process to preserve model capability, recent works like SparseGPT [222] and Wanda [223] have explored to perform unstructured pruning in one-shot.

Knowledge Distillation (KD)involves using a well-performing teacher model (usually with a large number of parameters and high precision) to guide the training of a lightweight student model (usually with fewer parameters and lower precision). Through distillation, the student model is well-aligned to the teacher model with relative smaller training dataset, and has the chance to perform even better on downstream tasks [226]. Based on whether the teacher model’s parameters are required in the training process, distillation methods can be further categorized into white-box (e.g., BabyLlama [224] and MiniLLM [225]) and black-box ones (e.g., Distilling Step-by-Step [226] and SCoTD [227]). Since the student model are often lightweight quantized or pruned model, KD is also adopted in QAT and pruning techniques to enhance the training performance. For example, LLM-QAT [218] proposes a data-free distillation method to preserve the original output distribution in the quantized model.

Low-rank Factorizationrefers to approximating the original weight matrix by the product of two low-rank matrices, thereby reducing the model’s parameter size and computational load. Specifically, a weight matrix W𝑊Witalic_W of shape m×n𝑚𝑛m\times nitalic_m × italic_n is factorized into the product of Um×rsuperscript𝑈𝑚𝑟U^{m\times r}italic_U start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT and Vn×rsuperscript𝑉𝑛𝑟V^{n\times r}italic_V start_POSTSUPERSCRIPT italic_n × italic_r end_POSTSUPERSCRIPT, such that W≈U⁢VT𝑊𝑈superscript𝑉𝑇W\approx UV^{T}italic_W ≈ italic_U italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and r≪m,nmuch-less-than𝑟𝑚𝑛r\ll m,nitalic_r ≪ italic_m , italic_n. Low-rank Factorization can be combined with quantization (e.g., ZeroQuant-V2 [228]) and pruning (e.g., LoSparse [229]) methods to enhance the compression ratio. Besides, low-rank adapters effectively reduce the customization overhead of LLMs, which we leave to 5.2.

5.1.2 Inference Acceleration

Except for making the models more compact as discussed in Section 5.1.3, there are various other techniques to accelerate the LLM inference process.

A major characteristic that sets the LLM apart from the traditional non-Transformer models is the attention mechanism [31]. Since the computational cost of attention increases near quadratically with the context length, it is particularly important to enhance the computational efficiency of long-context inference. Existing works have explored to reduce context length and optimize attention kernels to better support long-context inference. We’ll dive into these techniques separately.

KV Cache is a widely adopted technique in both mobile (e.g., llama.cpp [273] and mlc-llm [274]) and cloud LLM serving frameworks (e.g., DeepSpeed [278] and vLLM [279]), to avoid redundant computation in LLM inference. Specifically, KV Cache involves storing (i.e., “caching”) and incrementally updating the Key-Value (KV) pairs, which are intermediate results in the attention calculation, in each token’s generation. Therefore, the repeated part in the KV computation is avoided to reduce the computational cost. However, in long-context inference, the computational cost of attention is still a system bottleneck despite the skipped KV calculations, making it crucial to compress the context length in such scenarios.

Context Compression methods enhance the inference efficiency by reducing the length of the context, especially the KV cache. Co-quantization of weights and activations, including KV cache, is an intuitive approach to compress the KV cache, which has been discussed in Section 5.1.1. Besides quantization, context pruning removes less important tokens in the context to reduce the computational cost. The effectiveness of this method is based on the observation that tokens have different impacts on the final output, and removing less important tokens won’t cause significant degradation of the model’s capability [233, 280, 234]. A typical line of work is to compress the context at the prefill stage based on different importance of tokens [230, 231, 232]. However, these methods are one-shot and cannot prune the KV cache when the context length continuously grows during token generation. To address this, Dynamic Context Pruning [233] uses a learnable mechanism to continuously determine and drop uninformative tokens. While the learnable mechanism introduces a fine-tuning overhead, Zhang et al. [234], proposes a token eviction strategy that can be applied without fine-tuning.

Inspired by the same observation that tokens are not equally important, other works also explored to reduce computations of less important tokens instead of directly removing them. COLT5 [281] employs a conditional computation mechanism, which devotes more resources to important tokens in both FFN and attention. SkipDecode [282] designs a token-level early exit method that works seamlessly with batched inference and KV cache, to skip some operators in the computational graph when a token is less important.

Kernel Optimization is another approach towards LLM inference acceleration. Optimization for small-batch or single-batch inference is especially important for edge scenarios including the locally-deployed Personal LLM Agents. Existing works have revealed that the attention calculation becomes a bottleneck when the sequence length is long, since the complexity of attention scales quadratically with the sequence length, while that of the FFN scales linearly. Therefore, efficient attention kernels including FlashAttention [235, 236] and FlashDecoding++ [237] have been proposed to improve the speed of long-text inference. Besides, some works reduce the computational complexity of attention from the algorithm aspect. For example, Linformer [283] achieves linear complexity for self-attention in the prefill phase.

Speculative Decoding [239, 238] is an effective approach in small-batch inference to improve the latency. The batch size of LLM inference at the edge is smaller than on the cloud, and is usually 1 (i.e., single query), which makes the inference workload extremely memory-bound. Speculative decoding mitigates this challenge by “guessing” several subsequent tokens through a lightweight “draft model”, and then validating the draft tokens in batches using the large “oracle model”.Miao et al. [284] and Spector and Re [285] further enhance speculative decoding with a tree-based verification instead of sequential ones to reuse intermediate results shared across these sequences. While these methods ensure zero bias in the generated results, BiLD [286] proposes to only fallback or rollback to the oracle model occasionally when the draft model is not capable to generate high quality contents.

5.1.3 Memory Reduction

LLM inference is not only computationally-intensive, but also memory-consuming, which causes challenges in the deployment of Personal LLM Agents. Therefore, it is necessary to perform optimizations on the memory efficiency of LLM inference. KV cache and model weights are two major causes of this memory overhead. In a short-context scenario where the KV storage requires much less memory than the model weights, the model compression techniques in Section 5.1.1 are very effective to reduce the memory requirement to store the weights. However, in the long-context scenario, the KV cache, whose size grows linearly with the context length, will dominate the total memory consumption.

An effective approach to address this issue is to compress the KV cache using quantization and pruning techniques mentioned in Section 5.1.1 and Section 5.1.2. While the quantization methods are generic to reduce the memory footprint of KV cache, not all the pruning-based methods directly contribute to the memory efficiency. Only those methods that prune the corresponding rows/columns in the KV cache when continuously removing input tokens in the context can prevent the KV cache size from exceeding the memory limit. For example, Anagnostidis et al. [233] and Zhang et al. [234] proposed to identify and evict uninformative tokens during generation. However, the one-shot approaches that only prunes the context at prefill stage are less effective regarding the generative scenarios.

Although the compression-based methods are demonstrated to be able to effectively reduce the memory requirement of LLM inference, the accuracy degradation caused by compression are not negligible in some cases. To address this, FlexGen [240] designs an offloading strategy to fully utilize GPU, CPU and disk, together with a zig-zag scheduling scheme to support high-throughput inference under constrained GPU memory. This approach is orthogonal to compression-based methods, and thus can be jointly used to further reduce GPU memory footprints.

5.1.4 Energy Optimization

The energy consumption is a critical factor that affects the real-world deployment of an intelligent personal agent. An energy-consuming agent not only increases the deployment cost and carbon footprint, but also hurts the quality of experience (QoE) due to increased temperature and potential thermal throttling. While the inference of the LLM involves costly computations and memory accesses, it is important to optimize the energy efficiency of LLM inference.

Since computation and memory access (mainly weights loading) are two major causes of the large energy consumption, there have been extensive works to optimize these two aspects, from both software and hardware perspectives. We have introduced various types of software optimizations in previous sections. For example, model compression methods save energy by reducing the model size and computations; KV cache saves energy by avoiding redundant computations; efficient attention kernels also improve energy efficiency through memory reuse and locality optimizations.

Besides software optimizations, utilizing energy efficient hardware provides new opportunities to improve the agent system’s efficiency. While CPUs and GPUs remain mainstream options to run LLM inference on edge devices, they are designed to support general purpose tasks and don’t have dedicated optimization for transformer-based models, especially the generative LLMs. Researchers have explored to utilize efficient processors that are more suitable to LLM inference workloads, including NPUs [241] and TPUs [242]. However, the limited operator and model support remain challenging in the real-world deployment. Besides, existing works also designed FPGA-based solutions to boost LLM inference with higher memory bandwidth and energy efficiency ratio (EER) [243, 287].

Remark. How to improve the efficiency of LLM inference has been extensively studied recently. Despite the remarkable progress, there is still a large gap towards the ubiquitous and affordable deployment of Personal LLM Agents. The open problems are: 1. Is it possible to further compress or design highly compact models without accuracy degradation, surpassing the scaling law of language models? 2. If the scaling law is unbreakable, how can we achieve optimal tradeoffs between efficiency and quality via dynamic inference (e.g., dynamic collaboration of big model and small model)? 3. How would the hardware and operating systems evolve to accommodate the efficient deployment of LLMs and Personal LLM Agents?

5.2 Efficient Customization

The Personal LLM Agents may need to serve different users, different tasks, and different scenarios with the same base LLM, which requires efficient customization for each situation. There are mainly two ways to customize the behaviors of LLMs, one is feeding the LLM with different contextual prompts for in-context learning, and another is tuning the LLM with domain-specific data. Therefore, the efficiency of customization is primarily determined by the context loading efficiency and LLM fine-tuning efficiency.

5.2.1 Context Loading Efficiency

Frequent context loading is inevitable during the multi-task serving of Personal LLM Agents, where each task or each scenario may require a new context for LLM inference. Nevertheless, the stringent resource constraints inherent to personal devices pose a significant challenge for Personal LLM Agents to process cumbersome context information fast and efficiently. There are various ways to make the context loading process more efficient. A straightforward way is to prune some redundant tokens or shorten the context length, which have been discussed in Section 5.1.

Another way to boost context loading is to reduce the bandwidth consumption during context data transmission. In some cases, pruning or discarding some tokens inevitably hurts the LLMs’ performance and loading the KV cache necessitates high bandwidth cost. CacheGen [252] addresses the challenges posed by context loading and it leverages the distinct characteristics of KV features across both tokens and layers thus introduces a novel KV encoder design. This encoder proficiently compresses the KV cache into a compact bitstream, effectively curtailing bandwidth demands while simultaneously reducing processing latency.

5.2.2 Fine-tuning Efficiency

It is also desirable to fine-tune a base LLM to better support domain-specific tasks, which poses a significant challenge on computational resources and memory footprint owing to the vast number of parameters in LLMs. There has been various efforts to tackle these problems, which can be roughly categorized as parameter-efficient fine-tuning techniques, efficient optimizer design and training data curation, which will be elaborated in the following sections.

Parameter-efficient fine-tuning (PEFT). A huge amount of parameters in LLMs make it costly to conduct full-parameter fine-tuning. Lots of efforts on parameter-efficient fine-tuning emerged to reduce LLMs’ training overhead. The fundamental concept of PEFT is to freeze the majority of parameters, focusing solely on training a limited set or introducing an adapter with significantly fewer parameters. A common practice is to introduce some adapters, i.e., small neural networks modules, into the existing network structure, including tuning hidden states [244, 245, 288], adding full layers [244] and prepending some prefix vectors into transformer architecture [289, 290, 291]. Liu et al. [292] also incorporates trainable vectors at the input layer, the performance of which highly depends on the capabilities of the underlying models. Some of these works fail to avoid extra adapter computation and introduce inference latency. LoRA [246] freezes all the model weights and augments each transformer layer with additional rank decomposition matrices, greatly reducing the memory and storage usage during fine-tuning without any additional inference latency. Another advantage of LoRA is that users can easily switch between different downstream tasks by simply adding or subtracting adapter matrices.

Efficient Optimizer Design. Efficient optimizer design is another group of training/fine-tuning strategies which aims to accelerate the training or reduce the memory overhead during training. Sophia [248], a lightweight second-order optimizer, addresses the high cost and time required for LLM pre-training by providing a more efficient optimization process compared to commonly used methods like Adam and its variants. On the other hand, the huge number of parameters necessitates storing more activation and optimizer states especially in larger batch size, which places substantial memory demands. LOMO [247] presents a detailed analysis of the memory profile, throughput, and downstream performance of the proposed optimizer compared to other methods, demonstrating significant reductions in memory usage while maintaining training efficiency.

Training Data Curation. Aforementioned approaches primarily focus on the process of training LLMs, while there are also some studies that aim to enhance the LLMs’ training performance from a distinct perspective, i.e., the amount and quality of training data. It has been demonstrated in phi-1 [249] that training the LLMs with a small amount of high-quality data can lead to significantly reduced training cost and achieve capabilities comparable to large-scale datasets and models. This challenges the traditional scaling laws in deep learning that emphasize larger datasets and models. Furthermore, phi-1.5 [250] and phi-2 [251] extend their focus on many other kinds of tasks such as common sense reasoning and language understanding, achieving comparable performance to models 5x and 25x larger, respectively.

Notably, these methods often assume that the LLMs can fit entirely within the device memory, which isn’t a practical assumption for Personal LLM Agents deployed on personal devices which usually have limited computing power and memory capacity. Fine-tuning LLMs on these devices often requires leverage of hierarchical storage like CPU memory even disk storage. Therefore, when fine-tuning LLMs on personal devices, it’s important to carefully consider the resource limitations of the current system.

Remark. While efficient model fine-tuning and in-context learning techniques have been extensively studied, it is yet unclear what is the ideal mechanism for customizing Personal LLM Agents under different situations. Here we highlight two open problems that may be specifically important in the system for Personal LLM Agents. 1. Similar to the operating system that manages the RAM for the applications, how should the agent system efficiently manage the contexts for different (and potentially parallel) agents, tasks, and users? 2. Similar to mobile apps that can be efficiently installed, uninstalled and moved between devices, how can a customized (fine-tuned) agent efficiently roll back to the previous versions or transfer to other base models?

5.3 Efficient Memory Manipulation

The Personal LLM Agents need to frequently retrieve internal or external memory to enable more informed decisions. The internal memory is represented as the context tokens and stored as KV cache during the LLM inference. The retrieval of the internal memory is handled implicitly by the self-attention module in Transformer architecture. This leads to LLM conducting more efficient computations over long contexts and trying to minimize the memory footprints while undergoing inference. These issues are similar to the inference efficiency of LLM as discussed in Section 5.1. Therefore, in this subsection, we mainly focus on the efficiency of manipulating external memory, which can be dynamically retrieved and added into the context.

Considering the diverse forms of external memory data, such as user profiles, interaction history, and local raw files (images, videos, etc.), the common practice is to use embedding models [293, 294] to represent memory data with a uniform and high-dimensional vector format. The distance between vectors stands for the semantic similarity between the corresponding data. For each given query, the agent needs to find the most relevant contents in external memory storage. This procedure, together with the maintenance of vectors, can be covered by vector libraries (like Faiss [295, 296, 297] and SCaNN [199]), vector databases [298, 299, 300], or some customized memory structures [301, 302]. Regardless of the functional differences between these systems, their efficiency optimizations are basically targeted at two key aspects, searching and indexing.

5.3.1 Searching Efficiency

The efficiency analysis and optimization can be viewed from different aspects of vector searching. Some aspects are related to search mechanism designs, such as similarity measurement, searching scope, as well as query types, selection, and optimizations. Some aspects, on the other hand, focus on the execution of the search process.

Search Mechanism Design.Multiple similarity criteria can be employed to evaluate the relevance of vector embeddings to a search query, including Hamming Distance, Cosine Distance, and Aggregate Scores [256]. However, the selection of scoring mechanisms lacks stringent principles and often relies on empirical rules [299]. Regarding the types of searches, both approximate and exact k(≥1)annotated𝑘absent1k(\geq 1)italic_k ( ≥ 1 ) nearest neighbors [303] search, as well as distance range search, can be utilized to retrieve corresponding vectors. Typically, there exists a trade-off between search accuracy and speed, wherein the use of approximate similarity search or larger k values speeds up the search process but may result in undesirable outcomes. To optimize search latency, rule-based [253, 254] or estimated-cost-based methods [255, 256] are often employed to determine the optimal search plan. These rules and cost models are typically configured offline to avoid unnecessary or time-consuming search actions. To further optimize the search process, hybrid operations that combine vector search with metadata filters are gaining popularity. This involves techniques such as pre-filtering [255, 256, 304], post-filtering, and single-stage filtering [257] to narrow the scope of vector searching. Similarly, the trade-off between search accuracy and speed still exists. For instance, in pre-filtering, metadata analysis is performed before searching, which may lead to overlooking some vectors that do not match the metadata criteria but could be potentially crucial for Personal LLM Agents. Moreover, the utilization of metadata filtering can impede the execution of the search, as it introduces additional computational costs.

Search Process Execution. Several hardware acceleration methods can be taken to improve the efficiency of search executions. For example, to enable parallel query process, Faiss [258] uses OpenMP multi-threading, while Milvus [256] further reduces CPU cache misses and uses a novel fine-grained mechanism to best leverage multi-core parallelism. Furthermore, Faiss and Quicker ADC [259] also support SIMD shuffle instruction to parallelize these table look-ups within a single SIMD processor. GPU is also used for fast query processing [305, 306, 307], such as vector databases like Faiss, and Milvus. Many vector database management systems also support distributed clusters to scale to larger datasets or heavier workloads, such as Vald [260], Qdrant [253], etc.

5.3.2 Index Optimization

Indexing is also very important to optimize the efficiency of external memory management and retrieval. While comparing the similarity between query vector q𝑞qitalic_q and vectors in external memory, a brute-force approach results in a computational complexity of O⁢(D⁢N)𝑂𝐷𝑁O(DN)italic_O ( italic_D italic_N ). However, this approach becomes impractical for scenarios with large vector dimensions (D𝐷Ditalic_D) and dataset sizes (N𝑁Nitalic_N). To address this problem, vector indexing is commonly employed to expedite query searching by reducing the number of required comparisons.

Typical Indexing Algorithms. This is achieved through partitioning schemes [299] that divide the dataset S𝑆Sitalic_S into smaller subsets, facilitating selective comparisons and faster search query processing. These partitions are then organized into data structures such as tables, trees, and graphs to enable efficient traversal. Commonly used partitioning methods include randomization (such as RPTree [262, 263] and E2LSH [261]), learned partitioning (such as SPANN [264]), and navigable partitioning (such as NSW [265] and HNSW [266]). These partitioning methods can be utilized in combination with different data structures. For example, Vamana [304] is a monotonic search network that comes in graph indexing and uses random initialization.

Hardware-aware Index Optimization.Since improving the scalability and efficiency of indexing has become a critical concern, research efforts have focused on hardware-aware approaches to extend external memory capacity while maintaining low latency and high throughput. This is achieved through the utilization of disk-based indexes or the co-design of hardware and algorithms [303]. For example, DiskANN [267] addresses cost-effectiveness by employing a hybrid DRAM-SSD approach. It incorporates Vamana graph indexing on SSDs and employs compressed point representation in DRAM. This configuration enables accurate query responses with less than 10ms latency, even when dealing with a billion-point database. DiskANN++ [268] further improves efficiency by introducing dynamic entry vertex selection and optimizing SSD layout. This enhancement results in a 1.5x to 2.2x increase in Query Per Second (QPS) while maintaining accuracy on real-world datasets. Moreover, CXL-ANNS [269] introduces a collaborative software-hardware approach for scalable approximate nearest neighbor search (ANNS). By utilizing Compute Express Link (CXL), CXL-ANNS disentangles DRAM from the host and consolidates essential datasets into its memory pool. FANNS [270] is a vector search framework on FPGAs, featuring automatic co-design of hardware and algorithms based on user-defined recall requirements and hardware constraints. It supports scale-out with a hardware TCP/IP stack and exhibits notable speedups compared to FPGA and CPU baselines.

Remark. Managing memory data with external vector storage is not a new requirement for LLM agents. While many basic technical challenges have been adequately addressed, we point out two problems that demand specific consideration for Personal LLM Agents. 1. Personal LLM Agents may frequently update the memory. Thus, the external memory is expected to facilitate fast updates, maintenance, and re-indexing. 2. The memory of Personal LLM Agents may be stored on personal devices with limited storage space, while the memory of the personal agents will accumulate over time. Therefore, it is necessary to effectively compress the memory to avoid fast-growing space and computational cost.

6 Security and Privacy

Refer to caption

Figure 11: The summary of techniques to address security and privacy issues of Personal LLM Agents.

The extensive integration of sensitive personal data and safety-critical personal tools sets Personal LLM Agents apart from regular LLM agents. As a result, ensuring the protection of user data privacy and service security in Personal LLM Agents becomes a crucial problem. In the context of Personal LLM Agents, we focus on three security principles including confidentiality, integrity, and reliability, as shown in Figure 11.Confidentiality represents the protection of user data privacy, ensuring that unnecessary and unauthorized disclosure of sensitive information does not occur during user interactions with the agents.Integrity represents the resilience of the agents’ decisions, ensuring that the behaviors performed by the agent align with the intended behaviors and have not been deliberately modified or influenced by malicious parties.Reliability focuses on making the agents’ behaviors more dependable and truthful. Unlike integrity, where incorrect answers are a result of intentional external manipulation, reliability addresses the agents’ internal mistakes.

6.1 Confidentiality

In this subsection, we discuss possible methods for protecting user privacy in Personal LLM Agents. As mentioned earlier, ensuring user privacy is of utmost importance for the personal agents that have access to a significant amount of user-sensitive data. Unlike traditional LLM-based chatbots where the users explicitly input text, Personal LLM Agents have the potential to spontaneously initiate queries in places without user awareness, which may contain sensitive information about the user. Meanwhile, the agents may also expose the user information to other agents or services. Consequently, the protection of user privacy becomes even more critical. There are various methods to enhance the confidentiality, including local data processing, homomorphic encryption, data masking, permission access control, etc.

6.1.1 Local Processing

A simple and effective approach to protect user privacy is to perform the computations locally on the users’ personal devices. While LLM service providers are currently working towards improving security and building user trust, it is important to acknowledge that transmitting private data to the cloud inherently introduces additional potential risks. Therefore, processing all data locally is considered a more secure method of interacting with LLMs compared to transmitting data to the cloud. However, deploying LLMs locally poses challenges in efficiently processing user requests due to resource constraints on personal devices. This can lead to slow inference speed or even the inability to perform inference due to the limitations of available memory. Since the data in Personal LLM Agents is mainly processed by the LLM, the key to achieve local computation is to run the LLM on users’ own devices. There are various existing lightweight models [308, 250] and deployment frameworks [309, 274] available for deploying models on edge devices. Furthermore, various model compression techniques [310, 220, 216] are proposed to reduce the model size to further enable the local deployment.

Nevertheless, despite the various efforts of researchers, using a locally-deployed model inevitably faces the challenge of limited model accuracy [42]. Most of the domain experts also suggest to adopt a cloud-edge-collaborated deployment approach to achieve better performance tradeoffs. Meanwhile, like other software applications, many Personal LLM Agents would also need to communicate with the cloud to provide online services. It is usually difficult or even impossible to keep the private data completely on local devices.

6.1.2 Secure Remote Processing

To invoke cloud-based model inference services while preserving privacy, an ideal solution is homomorphic encryption (HE) [311, 312]. In this method, the client employs encryption to encode the user’s plaintext request, and the server conducts model inference on the resulting ciphertext. Subsequently, the client receives the inference results in the encrypted format and gets plaintext results after decryption. There have been several studies [313] that have demonstrated the feasibility of applying HE to Deep Neural Networks, showcasing the potential for integrating HE into models.

When employing HE in Personal LLM Agents, two challenges arise. The first challenge pertains to the limitation that not all operations within the LLMs can be executed using HE. HE atmost supports an unlimited number of additions (equivalent to XOR in a boolean circuit) and multiplications (equivalent to AND in a boolean circuit). However, certain operations in the LLMs, such as max, min, and softmax, cannot be accurately performed using HE. The second challenge involves the slow inference speed associated with HE, given the large computational complexity of LLMs.

There are several solutions to address these two problems. The-x [314] presents a workflow for replacing original non-linear layers with layers that can be computed using HE. In cases where HE cannot perform certain operations, such as the Max operation, the ciphertext will be sent back to the local device. The local device will then perform the operation and send the re-encrypted text back to the cloud. Cheetah [315] encompasses a collection of algorithmic and hardware optimizations designed for HE inference on server-side systems. The primary objective of Cheetah is to enhance the computational efficiency of HE, thereby accelerating the speed of HE operations.

However, despite the numerous efforts on accelerating HE-based DNN inference, the current state of homomorphic encryption still falls significantly short of meeting the latency demands of agents [316].

Another way to achieve confidential remote data processing is using the trusted execution environments (TEE) [317] for model inference. However, TEE may be subject to various attacks [318] and may also lead to limited performance.

6.1.3 Data Masking

An alternative approach is using data masking to preprocess the information before sending to the cloud. The basic idea is to transform the original inputs into a form that is not privacy-sensitive while preserving the information that has a crucial impact on the inference results.

One direct approach of data masking is to transform the plaintext inputs by hiding or replacing sensitive content such as account numbers, addresses, and personal names. These types of information are commonly referred to as Personally Identifiable Information (PII). However, accurately defining PII can be challenging due to its obscure boundaries and diverse forms, making it difficult to consistently identify and remove it from the original content. The National Institute of Standards and Technology (NIST) has provided a guide [319] that offers recommendations for safeguarding the confidentiality of PII, which could help manage PII more securely.

On the other hand, researchers have proposed embedding-based data anonymization approaches where the client encodes the original user request into hidden vectors and sends these vectors to the cloud-based model for subsequent inference. The challenge is how to ensure privacy is protected, how to ensure inference accuracy will not degrade, and how to ensure the inference speed will not decrease too much. There are several solutions. Coavoux et al. [320] propose a metric to assess the extent of privacy leakage in neural representations and develop a defense method by altering training objectives to achieve a tradeoff between privacy and accuracy. Zhou et al. [321] protects user privacy by adding dynamic fusion to the intermediate representation. TextObfuscator [322] protects user privacy through text obfuscation techniques. During the encoding process, “adversarial representation learning” can be employed by introducing additional constraints to minimize the inclusion of privacy-sensitive information in the encoded vectors [323]. Although this method outperforms Homomorphic Encryption in terms of inference performance, it usually does not rigorously protect the data privacy, as the encoded vectors themselves still carry a risk of leaking sensitive information. Additionally, such methods require an explicit definition of privacy features for the encoder to learn how to remove privacy information during adversarial representation learning.

6.1.4 Information Flow Control

The aforementioned techniques primarily pertains to the privacy of model input data, while there may also exist the risks of privacy leakage in the model output. This is because the output of the model may not only returns directly to the user but also be sent to other third-party applications, models, users, or intelligent agents. For instance, when an intelligent agent assists a user in making restaurant reservations, it may take the user’s basic profile and schedule information and feed them into the restaurant reservation software. Similarly, when businesses aim to recommend products to users, they may rely on user preference information retrieved from the output of certain personal agents. This method of obtaining privacy information from the output of LLMs is similar to personal data access interfaces in traditional operating systems, where it is crucial to ensure the control and transparency of privacy data access with permission management systems [324]. Transparency necessitates informing users about access information regarding privacy data, including the accessing entity (who), content (what), time (when), intent (why), access method (how), etc.

One can also directly ask the LLMs to retain private information. However, since LLMs work statistically rather than based on explicit rules, their security cannot be rigorously proven. Therefore, we should not consider LLMs as a part of the Trusted Computing Base (TCB) when dealing with data confidentiality. Therefore, we may need rule-based permission control to constrain what LLMs can do and what LLMs can access. Permission mechanisms allow users to configure whether different entities are permitted to access different types of information. In Personal LLM Agents, one of the challenges in designing permission mechanisms lies in delineating the types of privacy data, as the content obtained by third-party applications is generated by the model. In traditional systems, researchers have proposed numerous methods for fine-grained privacy content subdivision and permission control, as well as privacy data traceability techniques based on information flow propagation [325]. However, establishing privacy data traceability for the output generated by LLM agents remains an open issue.

Remark. Ensuring the confidentiality of user data is crucial for Personal LLM Agents to build user trusts. However, existing privacy protection techniques are still not sufficient to support agents with higher levels of intelligence. There are following open problems: 1. Existing approaches face a common challenge to balance efficiency and effectiveness. For example, how can we enable powerful and efficient local LLMs, how can we scale homomorphic encryption (HE) or trusted execution environment (TEE) to large models, and how can data masking/obfuscation techniques achieve rigorous confidentiality? 2. As a new software paradigm, it is still unclear what is the systematic privacy protection mechanism for Personal LLM Agents. Do we still need symbolic rules or permissions for access control? How can they seamlessly integrate with the uninterpretable nature of LLMs?

6.2 Integrity

Integrity refers to the capability of Personal LLM Agents to ensure that it can output the intended content correctly, even when faced with various types of attacks. As Personal LLM Agents necessitate interactions with diverse data, applications, and other agents, there is a potential presence of hostile third parties seeking to steal user data and assets or disrupt the system’s normal function through unconventional means. Therefore, the system must be able to resist various types of attacks. Traditional attack methods such as modifications to model parameters, theft, and tampering of local data could be defended against using encryption, permissions, hardware isolation, and other measures. However, in addition to defending against traditional attack methods, attention should also be paid to new types of attacks that the LLM agents may encounter: adversarial attacks, backdoor attacks, and prompt injection attacks.

6.2.1 Adversarial Attacks

Malicious attacks primarily achieve their objectives through the specialized customization of the model’s inputs or malicious tampering with the model. A significant category of attacks, known as “adversarial attacks”, causes model inference errors by customizing or tampering with the model’s input data, which was initially discovered in image classification models [326]. This type of attacks can induce serious classification errors by adding imperceptible noise to images. Subsequently, researchers have extended this attack method to text data, graph data, and beyond [327]. Such attacks also persist in large langage models [328], which may also accept input of images [329], text [330], and other modalities of data [331] from third parties. For example, when assisting users in automating tasks, attackers may misguide the agent to delete calendar events and leak private conversation data [332], because LLMs often need to input the content of the application’s internal information to generate the next interaction decision. In such cases, if the third-party application feed the LLM with maliciously customized content, it could drive the intelligent agent to engage in unsafe interaction. Traditional defense methods against such attacks in deep learning models usually encompass adversarial defense, abnormal input detection, input preprocessing, output security verification, and more [327]. While these methods theoretically remain applicable to LLM and LLM agents, the large scale of parameters and the characteristics of autoregressive generation may render some computationally expensive methods (such as formalized output security validation and detection of anomalous data based on intermediate layer activations) challenging to implement. Furthermore, some defense methods may require adjustments in the context of LLM. For instance, training the LLM may incur substantial costs, making it impractical to enhance security through adversarial training. Therefore, exploring how to achieve good effects of adversarial defense through parameter-efficient fine-tuning is worth investigating.

6.2.2 Backdoor Attacks

Another common form of attack is the backdoor attack. Traditional model backdoor attacks are often achieved through data poisoning [333], i.e., inserting maliciously modified samples into the model’s training data, enabling the model to learn deliberate hidden decision logic, such as “when seeing an apple pattern, the model outputs an incorrect classification”. For LLMs, data poisoning may be more challenging due to the huge amount and strict unified management of training data, but another type of backdoor attack methods [334] is still valid, which implants insecure logic into the model by modifying the model input during the test time.Kandpal et al. [335] elicits targeted misclassification when the language models are prompted to perform a particular target task. ProAttack [336] directly utilizes prompts as triggers to inject backdoors into LLMs, which is the first attempt to explore clean-label textual backdoor attacks based on the prompt. PoisonPrompt [337] is a bi-level optimization-based prompt backdoor attack on soft and hard prompt-based LLMs. Since LLMs often use several fixed prompts in certain scenarios, this form of attack, achieved by modifying the prompts, essentially fine-tunes the model’s parameters and thus alters its decision logic.Sun et al. [338] proposed that testing the backward probability of generating sources given targets yields effective defense performance against different types of attacks. Indeed, when attackers mimic normal behavior, this defense method may become ineffective. Therefore, there isn’t a robust solution for backdoor defense in agent systems yet [339]. This highlights the request of developing effective defenses against sophisticated attacks that mimic legitimate behavior.

6.2.3 Prompt Injection Attacks

In the era of LLM, there emerges a new and particularly crucial security risk, namely prompt injection attacks [340, 341, 342, 343]. In this form of attack, the model itself incorporates certain security safeguards through alignment and prompts. Nevertheless, third-party model users can bypass these preset security safeguards by using subtle or special diction in the prompts. For instance, an intelligent personal assistant may be preset not to execute certain sensitive operations, such as modifying a user’s account password [344], but through prompt injection (e.g., requesting the LLM to “disregard the previously set limitations” or “assume operation in an authorized secure mod”), it could induce the model to violate regulations and perform these sensitive operations.

For such prompt-based attack methods, there are currently no perfect defense mechanisms. SmoothLLM [345] is the first general-purpose defense method for prompt injection, and it randomly perturbs multiple copies of a given input prompt and then aggregates the corresponding predictions to detect adversarial inputs. However, its defensive effectiveness is highly dependent on the model’s robustness, since there was only about 1% reduction in the attack success rate for some models. An essential way to mitigate this issue is to ensure the transparency and security of the LLM’s prompts. For example, a Personal LLM Agent could rigidly control the template and specifications of prompts, requiring all requests to comply with the preset template and specifications. Additionally, post-processing of the input content from third-party applications (summarization, translation, restatement, etc.) or prompt encapsulation (such as adding explicit text before and after to indicate their origin from a third party) can help the model clearly distinguish them from the system’s inherent prompts.

Remark.Ensuring the integrity of the decision process is crucial for Personal LLM Agents. The threats to integrity are very diverse and continuously evolving, while the development of defensive techniques are lying behind. Here we highlight two important open problems that apply to all types of attacks. 1. How can the agents know if their input or decision process has been tampered with by third parties? This requires the agents to have a sense of what are normal input and behaviors, and have the abilities to recognize the anomalies. 2. Since directly avoiding the attacks may be challenging, it would be more practical to consider user verification mechanisms, i.e., asking the user to verify when the agents are uncertain. How to design a secure and user-friendly verification mechanism is challenging.

6.3 Reliability

In Personal LLM Agents, numerous critical actions are determined by the LLMs, including some sensitive operations such as modifying and deleting user information, procuring services, and sending messages. Therefore, ensuring the reliability of the agent’s decision-making process is crucial. We discuss the reliability of LLMs from three perspectives, including the problems (i.e., where does reliability issues of LLMs manifest from?), improvement (i.e., how can we make the LLMs’ response more reliable?), and inspection (i.e., how can we deal with the LLM’s potentially unreliable output?).

6.3.1 Problems

Hallucination.LLMs may produce incorrect answers, which can lead to severe consequences. In comparison to LLM-based chatbots that directly interact with users via text, Personal LLM Agents minimize user disruptions by avoiding frequent result verifications, hence amplifying the severity of producing incorrect answers. Researchers have uncovered cases where LLMs generate text that is coherent and fluent but ultimately erroneous. This phenomenon, known as hallucination in natural language processing tasks, poses a challenge to the personal agents as well. Ji et al. [346] delves deeply into the various manifestations of hallucinations in natural language processing tasks. Rawte et al. [347] further discusses the hallucinations in multimodal foundation models, providing valuable references for interested readers.

Unrecognized Operation. Unlike the hallucination problem that focuses on the “wrong answer” produced by LLMs, there are many cases where the responses from these models are “not even wrong”. For instance, consider the scenario where the LLM is instructed to initiate a phone call by using the format “CALL XXXXXX”. In response, the LLM may generate a reply “I will make a call to XXXX”, which accurately conveys the intended meaning but deviates from the specified format, rendering it unexecutable. As we know, the essence of LLMs is language modeling, and the outputs of language models are typically in the form of language. Compared to other LLMs that interact directly with humans, Personal LLM Agents is required to execute actions. As a result, they have significantly higher requirements for the format and executability of their outputs [348].

Sequential Reliability.LLMs are initially pre-trained on sequential data (i.e., corpus) and training objectives (i.e., left-to-right language modeling task). However, problems in the real world may not be fully addressed sequentially. Achieving sequential reliability poses several challenges, including context preservation, coherence maintenance, etc. To better maintain a coherent and meaningful conversation with users and Personal LLM Agents, we need to elicit the LLMs’ ability to think from a global perspective, not solely relying on the previously generated tokens or contexts. On enhancing the ability of thinking and reasoning of LLMs, Yao et al. [83] propose Tree-of-Thought to generate and conclude over multiple different reasoning paths, Zhang et al. [349] propose Cumulative Reasoning in a cumulative and iterative manner to solve complex tasks. There is also potential for designing the overall plan for solving the task [87] or drawing insights from the previous work [350, 351].

6.3.2 Improvement

The improvement approaches aim to improve the quality of LLM output, thereby enhancing the reliability of LLM-based agents.

Alignment. As LLMs grow in size and complexity, concerns have arisen regarding their potential to generate biased, harmful, or inappropriate content. Alignment methods seek to mitigate these risks and ensure that the behavior of LLMs aligns with ethical and societal norms. One common alignment method is the use of pre-training and fine-tuning [352, 353, 354]. LLMs are pre-trained on vast amounts of text data to learn language patterns and representations. During the fine-tuning phase, the models are further trained on more specific and carefully curated datasets, including human-generated examples and demonstrations. This process helps align the models with desired behaviors by incorporating human values and intentions into their training. Another alignment method is reward modeling, which involves defining and optimizing a reward function that reflects desired outcomes or behaviors. By providing explicit rewards or penalties for specific actions, LLMs can be trained to generate outputs that align with those predefined objectives. Reinforcement learning techniques (e.g., RLHF [43], RLAIF [355], C-RLFT [356]) can be employed to optimize the model’s behavior based on these reward signals. oversight and intervention are critical alignment methods. Human reviewers or moderators play a crucial role in reviewing and filtering the outputs of LLMs for potential biases, harmful content, or inappropriate behavior. Their feedback and interventions are used to iteratively improve the model’s performance and align it with desired standards.

Self-Reflection. It has been shown that language models can provide probabilities of providing correct answers [357]. Inspired by the autonomous operation of LLMs, researchers have suggested leveraging the model’s self-reflection to mitigate the problem of incorrect content generation.Huang et al. [211] and Madaan et al. [358] show that LLMs are capable of self-improving with unlabeled data, Shinn et al. [359] propose Reflexion to let LLMs update through its linguistic feedback. Chen et al. [360] propose Self-Debug to iteratively improve the responses on several code generation tasks. SelfCheckGPT [361] allows large models to provide answers to the same input question multiple times and checks the consistency between these responses. If there are contradictions among the answers, there is a higher probability that the model has generated unreliable content.Du et al. [362] attempts to enhance the reliability of model outputs by enabling multiple large model agents to engage in mutual discussion and verification. There are various ways to combine models, similar to the diverse collaboration methods in the human world. However, just as more employees require increased expenses, having more models entails greater computational power requirements. The above works demonstrate a trend in which LLMs are evolving from mere textual generators to intelligent agents, transitioning from primitive comprehension-based reasoning to reflective reasoning with iterative updates.

Retrieval Augmentation.LLMs show strong performance across various tasks, however, the parametric knowledge stored in the models could still be incomplete and difficult to update efficiently. Alternatively, the retrieval augmented methods [199, 200, 363] provide a semi-parametric way to offer complementary non-parametric information, allowing the LLMs to draw on retrieved real-world knowledge when generating content, such as Wikipedia, documents, or knowledge graphs [364]. This approach offers the advantage of not requiring model modification, facilitates real-time information updates, and allows the traceability of generated results to the original data, thereby enhancing the interpretability of the generated information. Retrieval augmentation has been shown effective for traditional pre-trained models such as BERT [365]. However, as for LLMs that already have strong reasoning ability, augmenting the context could also have a negative impact due to irrelevant or noisy information [366]. To tackle these issues, Guo et al. [192] propose a prompt-guided retrieval method for non-knowledge-intensive tasks, enhancing the relevance of retrieved passages for more general queries. Yu et al. [367] propose Chain-of-Note to improve the robustness when dealing with noisy and irrelevant documents. Asai et al. [368] propose Self-RAG to enhance factuality through self-reflection. Wang et al. [369] propose SKR, a self-knowledge-guided retrieval method to balance external knowledge with internal knowledge. Wang et al. [370] propose FLICO to filter the context in advance and improve the fine-grained relevance of retrieved segments. The CRITIC [371] framework utilizes LLMs to verify and iteratively self-correct their output through interaction with external tools, such as a calculator, Python interpreter, and Wikipedia. However, this approach has limited assistance for user requests for which matching content cannot be found in external knowledge bases.

6.3.3 Inspection

The inspection-based approaches, on the other hand, do not interfere the LLM generation process. Instead, it focus on how to enhance or understand the reliability of agents based on the already generated results.

Verification. Given that the issue of unreliable content generation by LLMs cannot be entirely avoided when deploying such systems for actual use, it remains necessary to establish rule-based security verification mechanisms. Regarding the aforementioned unrecognized operation, “Constrained Generation” refers to the process of generating formatted and constrained output, which can be employed to tackle this issue.Kumar et al. [372] employs Langevin Dynamics simulation for non-autoregressive text generation as a solution to this problem. On the other hand, Miao et al. [373] introduces a method that suggests a candidate modification at each iteration and verifies if the modified sentence satisfies the given constraints to generate constrained sentences. Li et al. [374] and Weng et al. [375] propose self-verification to help the reasoning process of large language models. Responsible Task Automation [96] is a system that can predict the feasibility of commands, confirm the completeness of executors, and enhance the security of large language models. However, further research is needed to improve the accuracy and recall rates in identifying sensitive operations and to mitigate the decision burden on users.

Explanation. While it is mentioned earlier that intelligent personal assistants should minimize user interruptions, incorporating user opinions or human assistance can be valuable, particularly when making significant decisions. In case an intelligent personal assistant makes a mistake, having interpretable logic can also be helpful in the subsequent debugging process. There are several surveys [376, 377, 378] discussing about explainable language model. Traditionally, rationale-based methods [379, 380] can be used to explain the model output by explicitly training on human-annotated data. As for LLMs, chain-of-thought reasoning [82] approaches can also help the model generate textual explanations. To make the reasoning process more robust and reliable, recent studies further enhance chain-of-thought reasoning with majority voting [381] and iterative bootstrapping [382] mechanisms. It is evident that researchers place a significant emphasis on interpretability, as it not only contributes to reliability but also represents an intriguing research direction.

Intermediate Feature Analysis.Beyond the last-layer representation, some work involves analyzing the intermediate states in the model’s inference process to judge the generation of false information. Halawi et al. [383] discover that the behavior of a model may significantly diverge at certain layers, highlighting the importance of analyzing the intermediate computations of the model.Li et al. [384] find that the model activation of intermediate layers can reveal some directions of “truthfulness”, showing that the LLMs may already capture knowledge though not generated, they further propose shifting the model activation during inference and improving the responses of LLMs. van der Poel et al. [385] propose a method to leverage mutual information and alleviate hallucination by assessing the confidence level of the next token, where the underlying reason is that the neural activation pattern in LLMs during the generation of hallucinatory content differs from normal outputs. These studies highlight the drawbacks of solely depending on the final-layer representation for language modeling, revealing the potential benefits of harnessing hierarchical information across different layers of the model.

Remark.The reliability of LLM generation has received considerable amount of attention, especially around the hallucination problem. However, avoiding the unreliable behaviors is still difficult, if not impossible. The open problems include: 1. How can we evaluate the reliability of LLM and LLM agents? Existing methods rely on either black-box LLMs such as GPT-4 or costly human annotations. Authoritative benchmarks and methods are desired for evaluating and improving the reliability. 2. Similar to the confidentiality problem, incorporating rigorous symbolic rules in the decision process of Personal LLM Agents would be a practical solution for reliability. However, complying with the rules while retaining powerful capabilities of LLM agents is challenging. 3. The lack of transparency and interpretability of DNNs has been a long-standing problem, which is even more critical for all security & privacy aspects of Personal LLM Agents. How to interpret and explain the internal mechanisms of LLMs is a direction that worth continuous investigation.

7 Conclusion and Outlook

The emergence of large language models presents new opportunities for the development of intelligent personal assistants, offering the potential to revolutionize the way of human-computer interaction. In this paper, we focus on Personal LLM Agents, systematically discussing several key opportunities and challenges based on domain expert feedback and extensive literature review.

Currently, research on Personal LLM Agents is in the early stages. Task execution capabilities are still relatively inadequate, and the range of supported functionalities is rather narrow, leaving significant room for improvement. Moreover, ensuring the efficiency, reliability and usability of such personal agents requries to address numerous critical performance and security issues. There exists an inherent tension between the need of large-scale parameters in LLM to achieve better service quality and the constraints of resource, privacy and security in personal agents.

Going forward, except for addressing the respective challenges in each specific direction, a joint effort is needed to establish the whole software/hardware stack and ecosystem for Personal LLM Agents. Researchers and engineers also need to carefully consider the responsibility of such technology to guarantee the benign and assistive nature of Personal LLM Agents.

Acknowledgment

This work is supported by the National Natural Science Foundation of China (NSFC, Grant No.62272261) and collaborative research projects with AsiaInfo Technologies (China) Inc. and Xiaomi Inc. We sincerely thank the valuable feedback from many domain experts including Xiaobo Peng (Autohome), Ligeng Chen (Honor Device), Miao Wei, Pengpeng He (Huawei), Hansheng Hong, Wenjun Chen, Zhiyao Yang (Oppo), Xuesheng Qi (vivo), Liang Tao, Lishun Sun, Shuang Dong (Xiaomi), and the anonymous others. Among the co-authors, Jiacheng Liu, Wenxing Xu, and Rui Kong were interns at Institute for AI Industry Research (AIR), Tsinghua University when writing this paper.

References