Please follow this URL to find the Canisius Open Source Initiative (COSI) and its implementation of an open source, custom chatbot for the Russian and Ukraine conflict:

https://github.com/Canisius-Open-Source-Initiative/RussianUkraineConflictKnowledgeStore

This project is useful to the following three audiences:

  • Those with an interest in the Russian and Ukraine conflict,
  • Those who want a template that explains how to create a custom knowledge store for a heterogeneous set of documents,
  • Those who want to know “How good are large language models (LLMs) at auto-generating code?”

It uses a 240 page chronology of the major events of the conflict as the chatbot’s knowledge store core. This open source document [1] was developed by researchers at the National Security Archives. It is the best chronology of the conflict’s events through May 2023. Critically, it contains links to the original articles where the event summaries were distilled from. Here is a link to the document itself:

Cyber Vault Ukraine Timeline

The GitHub project provides Python code that ingests this file and all files it points to via hyperlinks. This leads to a knowledge store with close to 700 documents in total. The major modules of the project are depicted in the image below. Note, roughly 85% of the code implemented in the project was auto-generated by LLMs [2] [3] [4] [5]. Thanks LLMs for helping to turn an idea into reality! Also, thanks to our Cybersecurity graduate student Ellie Furmanek who helped developed some of the code and is included as a collaborator on the GitHub site. Descriptions of each module follow the image.

1. Transform

This step takes the PDF and transforms it into a CSV file. It required identifying the dates, the narratives, and collecting the URLs for the linked documents from the PDF. None of this was easy. This project would probably not exist without an LLM guiding and generating code. That is the beauty of LLMs, they can take over the tedious tasks that might otherwise prohibit implementing an idea.

2. Collect

This step downloads the supporting documents pointed to by URLs from the Internet. PDFs are stored in their raw form. HTML pages are stored as text extraction representations wrapped in JSON. The project uses Python threads to efficiently download the resources pointed to by URLs.

3. Create

The final step is to create an LLM backed knowledge store. The project simply embeds the document corpus using OpenAI’s text-embedding-ada-002 model. A vast majority of the text in the collected documents is well-written, English narratives. The text-embedding-ada-002model does quite well at embedding the documents for question and answering. The project provides three examples of how to use the embedded documents. It demystifies what is meant by “context” and what a custom knowledge store is really all about.

Why this project?

Two reasons. First, it’s an interesting topic. What new insights might a custom knowledge store about the Russian and Ukraine conflict offer? Take a look at this sample below. I know that Starlink [5] has played a role in the Russian and Ukraine conflict. The knowledge base was able to give a great synopsis of the main points, some fine-grained details that are of importance, and finally provenance. The bottom of the image shows where the answer came from. These sources would be vital articles to read for those interested in Starlink and its use in the conflict :

Second, I wanted this project to be an example for the Canisius students in our Computer Science and Cybersecurity programs. I have explained to students who have ideas for entrepreneurship or research projects: Anyone can have an idea. A person can have a Powerpoint, a white paper, a compelling pitch. In tandem, you need something tangible. You need a prototype. You need software. You need a manifestation of the idea in some way shape of form. I plan to use this project to explain to students in my CSC 213 class how an idea migrates into code and then into a project you can share with others. It’s a long and winding road where the key is – attention to detail.

Expect more projects from COSI in the future!

Footnotes

[1] Conflict Timeline: https://nsarchive.gwu.edu/document/29562-cyber-vault-ukraine-timeline

[2] OpenAI LLM: https://openai.com

[3] Claude LLM: https://claude.ai/chats

[4] Langchain LLM: https://chat.langchain.com

[5] Ollama LLM: https://ollama.ai

[6] Starlink: https://www.starlink.com/