PROSJEKTER
Del artikkel
Skrevet av: Mathias Thoresen Paasche (Development Engineer)
PROSJEKTER
Del artikkel
Skrevet av: Mathias Thoresen Paasche (Development Engineer)
In the current tech landscape, there’s significant talk about Artificial Intelligence (AI), and notably, Large Language Models (LLMs). These advanced technologies are gaining widespread attention and reshaping the digital ecosystem. This article will present a brief introduction to the topic of AI and LLM, with a specific focus on LLMs and how they may impact developers and business.
Artificial Intelligence can be described as the simulation of the human mind. It attempts to replicate the unique capabilities that separate us humans from other animals, such as communication, advance decision making and creativity. In recent years there have been large leaps in the research field of AI, from image classification and recognition to image and text generation. The beginning of generative AI (an AI capable of creating new information) occurred with the introduction of deep learning architecture transformers. The technology made it possible to generate new data from large data sets. One of the AI technologies that were created from these transformers, where Large Language models.
LLMs are a subcategory of generative AI models, specialized in generating and understanding text. They do this by using statistics to evaluate what word is the most likely word to be first word, the second word and so on, in the answer based upon the input it was given. To be able to estimate the possibilities needed to conclude what word should be next, the AI model must be trained on an enormous dataset containing text similar to the input and output. The process of creating such datasets and training the models comes at a huge economic cost and may take up to several weeks, if not months. To make an excellent data set, a supervised selection process must be conducted, resulting in a high staff cost. The cost is notably influenced by the demanding nature of supervised learning, wherein human experts carefully label and annotate data. This process, essential for effective model training, necessitates skilled personnel, thereby contributing significantly to staffing expenses. Additionally, the training requires remarkably high computational power over an extended period, using substantial amounts of electricity and very expensive hardware.
To be able to run an LLM yourself you do not require that large amounts of hardware, since you don’t need to train the LLM yourself. However, you still need enough memory to load the whole LLM into your system. The optimal hardware type for running a LLM are GPUs, where video RAM (VRAM) is the limiting memory. Larger models need more VRAM, but also create better answers than smaller sized models. This creates a relationship between quality, performance, and cost, where quality represents the usefulness of the created answers and performance is the time it takes to generate the answer. Given the relationship between the quality, performance and cost, there has been conducted a lot of research in the field of significantly reducing the size of the models without noticeably reduction in quality.
How does this new technology impact the developers? First and foremost, developers can use the AI to gain a better understanding through customized explanation of the code or system they are going to work on. This will reduce the on-boarding period for new developers. Secondly, the LLM can help the developer write better quality code, through correcting syntax errors and recognizing code optimization possibilities. Lastly, documentation and unit tests can be made more efficiently, allowing the developers to spend more time solving the core problems. Furthermore, it is also possible to generate code from scratch with LLM.
However, there are some problems that must be considered with AI and LLMs. One of them is sending confidential information to third parties such as OpenAI or GitHub, through their respective products ChatGPT and GitHub Copilot. Broadly speaking, one can assume that everything that is sent to an AI online is stored and most likely used to train the next generation of the AI. Therefore, using OpenAI’s ChatGPT to draft a response to a client, one can presume the information in the input to the LLM and the response is available for OpenAI to use in the future. For some companies, this may not be a problem, as the data is exclusively utilized for training purposes and not accessible to the public. Although companies such OpenAI and GitHub say all personal and identifying data is anonymized, errors can happen. This was latest discovered by a group of researchers in November 2023, where they were able to make the ChatGPT respond with names, phone and fax numbers, email, and physical addresses (Nasr et al., 2023). Their research shows how targeted attacks can gather private information from the training set. This poses a significant concern for Data Respons R&D Services, given their responsibility for managing and developing the intellectual property of other companies, including source code. Furthermore, sending data to third parties will breach non-disclosure, secutity and IPRs contractual regulations, among others. A solution may therefore be to inform the customer about AI usage and incorporate it into the contract, with guidelines for how and when LLMs can be used.
Another problem is that the generated code may be based upon or very similar to software under a license, without the developer or the project leader knowing about it. The probability of generating a licensed code is low, but due to the scale of the repercussion being as high as it is, some form of check list or control must be done before the code is given to the customer. Furthermore, due to the characteristics of the services provided by Data Respons, where specialist knowledge and expertise is one of the main selling points, there is a challenge regarding the knowledge of how the code works. Using code generation increases the amount of code that can be produced, but the lack of direct developer ownership and deeper understanding of the code is likely to cause its own set of problems in e.g. system integration and debugging processes.
This project started out as a research project with the goal of gaining new knowledge for Data Respons R&D Services within the field of AI and LLM and studying how a local system with running an open-source LLM could compare to the state-of-the-art services on market.
With such a wide research field many different methods and systems were researched: Standard text chat usage of LLM, such as of OpenAI’s ChatGPT. Autonomous agents, where AI models are placed in a system where it automatically analyzes a problem, comes up with a solution, analyzes the solution for improvements and implements such improvements without the interference of a human. Training our own LLM on our own Confluence database, to make it into a better and more advanced search function. Code generation in code editor, similar to GitHub Copilot.
From all these different topics, it was decided to first try and run an LLM locally on an office computer. The goal was to make a Linux Command Line Helper (LCLH), a tool that was integrated into the Linux terminal and able to answer short questions regarding different terminal commands. The response from the LLM should be a command that accomplishes the user’s request. For example:
In a Linux terminal:
User: howto find my ubuntu version?
AI: lsb_release -a
Here howto is used as a keyword to activate the LCLH. The question, also called the prompt, is sent to a LLM running locally on the computer. After some time, the LLM has generated an answer that is returned to the user. The locally run LCLH accomplished two things: 1) It was working as intended and was able to return useful answers most of the time, and 2) it removed all online third-party sources, and therefore not breaching any confidentiality regulations. A step in the right direction. Although not without a new problem: Subpar performance. Some prompts took up to 50 seconds to answer, which is not acceptable when you could find the correct answer on the internet in less than 15 seconds. The performance problem came from insufficient hardware.
Hardware problems can mainly be solved in two ways: Buy new and more powerful hardware or rent it. For a product that was supposed to be available for all Data Respons SW developers, the best solution was to rent. Amazon Web Services (AWS) was chosen as provider, and a server with a virtual machine running the LLM was set up. This reduced the response time to around 5 seconds, and at a later stage to only 3 seconds. Renting a server also opened the possibility of using a larger and better LLM, enabling more and different use cases.
Among the new use cases was a chat with a LLM specialized in code generation. Combining this capability with an open-source Visual Studio Code extension called Continue, produced a mix between OpenAI’s ChatGPT and GitHub Copilot. To connect Continue to AWS, a custom-made VS code extension was developed. The extension runs a Python program on startup, which is responsible for the http-connection between Continue and AWS and encrypts/decrypts all messages. The overall architecture can be viewed in the figure below.
The combined system can analyze and explain code (see right figure above), generate new code (see left figure above), edit existing code, generate unit tests, produce comments/documentation and more. Furthermore, since the system has a secure connection between the user and the server running the LLM, and access to the server is restricted to only Data Respons employees, the confidentiality problem is also solved. For now, the system is only offered to developers working on internal projects – for usage in customer projects, this must be agreed with the customers.
Although DR-Copilot on paper can aid a developer in many ways, in practice the result has been sub optimal compared to e.g. GitHub Copilot. This is directly connected to the small open-source LLM that has been used. Even with a larger LLM and better hardware, it is difficult to achieve a responsiveness comparable to GitHub Copilot, considering cost. As an alternative to DR-Copilot, it is possible to use GitHub Copilot Business version. With the business version of Copilot, GitHub promises confidentiality of all inputs from the user. Additionally, GitHub also promises that no inputs or responses will be used in future training. While the proposition appears promising, validation is challenging due to the closed-source nature of the product. The result is that each company must conduct its own analysis to determine whether they will accept the terms or not. Besides the terms, the cost question must be factored into the analysis.
In conclusion, the research and development of LLM has been a practical journey, marked by challenges and inventive solutions. From creating the test system, LCLH, to developing DR-Copilot, a complete system for code generation and analysis, the project showcases different approaches for how developers may utilize AI and the benefits that follow. Although the system works in its current state, the LLM size and hardware are significant limiting factors for the product, restricting it from advancing from a good product to a great product. While leveraging LLM capabilities offers multiple advantages, it needs to be used with care in a business context, to ensure all commitments are upheld, both IPRs and workmanship wise.
R&D Manager