I was on flight to Austin, Texas in May. My goal for next two hours – do some paper writing and code development. Sorry, I am not the “So, why are you headed to Austin?” flight guy. You know the dilemma I was faced with – dole out the $16 for Wifi or not? Hey, look, I might need Google for my inevitable programming question. At that moment a thought occurred to me, “Heck no, I’ve got Ollama!” Air gapped AI to the rescue! Throw in a Retrieval Augmented Generation (RAG) pipeline and I am a bona fide knowledge engineer pioneer.

Here is what you need to know. Air gapped AI [1] is coming into vogue and it will be the norm. In the context of the Department of Defense (DoD), the term means AI that is meant to work in an isolated, non-connected environment. You can do the exact same thing with your laptop (given you have more than 16GBs of RAM – this post explains how). Now, this is true democratization of AI. It is all at your fingertips. You are free to do with what you please in your own private environment. Simply read on.

What’s Ollama?

It’s an open source project that allows for running large language models (LLMs) locally [2,3]. You can download the Ollama executable and with it, download and experiment with different LLMs locally. It runs a command line client that you can chat with. Figure 1 provides an example. I run Ollama from the command line and then ask it a question. It fields the question and then patiently sits and waits for my next one. Good chat bot! Note, you need at least 16GBs of RAM to get a decent LLM up and running.

Figure 1: Asking Ollama with the Mistral LLM a Question

For software developers, this could not be easier to access from code. There is an Ollama Python API [4] and an Ollama Docker container [5].

What we have here, folks, is the following:

  • An LLM engine that can run any number of LLMs,
  • These LLM models are essentially on par with the capabilities offered by commercial LLMs (like OpenAI),
  • You can extend the models with a RAG (more on that in a second) or you could train your own,
  • You can embed and access the LLMs natively from your Python code relying on no external services, and
  • You can do it all via Docker if you want – easily handing your system to anyone.

Sounds sort of like the computational deal of the century – because it is. Except for the fact you need to learn to program in Python – but I am learning to deal with that [8].

What’s RAG?

What can be done with a local LLM? A lot! We at Canisius created a RAG pipeline for the Russian and Ukraine conflict[6]. A RAG pipeline takes a large corpus of documents and extends an existing LLM with these documents. It does not train on the documents; rather, it simply retrieves relevant information from them to provide context for generating responses. Our project was updated just last week to include a pipeline using Ollama and Docker – all run locally for free. Previously the project used only OpenAI and its fee based service over the Internet. You may find the project here on GitHub with an MIT license [6].

The project has two phases. In the first phase, it scrapes the Internet for references to documents summarized in an event timeline for the Russian and Ukraine conflict. It uses Python code to create three versions of embeddings for the document corpus: one with OpenAI, one with MiniLM, and one with MPNet. The embeddings are stored as collections in a ChromaDB instance running in Docker locally.

The second phase employs a chat bot backed by Ollama. Figure 2 below shows the data flow when a question is asked. The three embedding stores are consulted locally to find semantically relevant content. The OpenAI embedding results are sent over the Internet to OpenAI (costing me a fraction of a cent to get answered). The MiniLM and MPNet embedding results are sent to a local instance of Ollama running the Mistral LLM (costing me nothing). The user then has three answers to review for a single question. Those interested may refer to this project for examples of how to get up and running with RAG, Ollama, and Docker containers.

Figure 2: Data Flow from Question to Reply

Conclusion

This is just one example of how to use a RAG pipeline. I could easily have Dockerized my complete solution, hosted it on DockerHub, and shared it with the world. They would be free to download my chat agent, converse with it, all done locally. Pretty amazing considering the technology came on the scene just in November of 2022. Thank you to all the developers who have implemented open source projects like Ollama and Mistral to help with the whole democratization of AI.

But what about that local Ollama and Mistral instance? There is a lot to be said for its potential. We are fast reaching a point where operating systems will embed LLMs in their distributions to act as pseudo-intelligence if you will. They are simply too attractive a solution to automate and streamline how computers behave and respond in n number of both anticipated and unanticipated situations.  We are already seeing this with Apple [7]. The other operating systems will not be far behind.

So, get ready, when your operating system embraces and embeds an LLM the way we use computers will change forever – both good and bad. Expect more blog posts and research projects on that last point in the future (they are already in the works!)  

Footnotes:

[1] https://www.theverge.com/2024/5/8/24152424/microsoft-top-secret-ai-server-gpt-4-azure

[2] https://ollama.com/blog/run-llama2-uncensored-locally

[3] https://github.com/ollama/ollama

[4] https://github.com/ollama/ollama-python

[5] https://hub.docker.com/r/ollama/ollama

[6] https://github.com/Canisius-Open-Source-Initiative/RussianUkraineConflictKnowledgeStore

[7] https://www.tomshardware.com/tech-industry/artificial-intelligence/apple-intelligence-siri-gets-an-llm-brain-transplant-chatgpt-integration-and-genmojis

[8] Really – **kwargs are Python’s answer to overloaded functions? No wonder Python documentation is confusing at (all) times. It’s a shame that computer science student’s first introduction to programming is typically Python.