Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FidoNet and Generative AI: A New Approach to Museumification of Historical Content Resources

FidoNet and Generative AI: A New Approach to Museumification of Historical Content Resources

This is a presentation at IEEE HISTELCON 2023 Conference

Dmitri Soshnikov

September 08, 2023
Tweet

More Decks by Dmitri Soshnikov

Other Decks in Research

Transcript

  1. FIDONET Cybernetic immortality FidoNet and Generative AI: A New Approach

    to Museumification of Historical Content Resources Vasiliy Burov, Dmitry Soshnikov
  2. FIDOneT, BBS, etc. • Founded 1984 by Tom Jennings •

    Very popular in 1990s • Specific culture & communities • Writing style, quoting
  3. MUSEIFICATION of Traditional content • Museums have learned to present

    traditional content as physical artifacts • and make replicas from ancient books to give them the opportunity to leaf through them • But this is only static content…
  4. MUSEIFICATION of digital content Sites / Documents / Programs User

    generated content • FidoNet, Usenet, IRC – this is not only the frozen content itself, but also the style and dynamics… How can we recreate it?
  5. Goal: Fidonet cybernetic immortality • Train a language model capable

    of generating an infinite stream of potential FIDONet messages • What does this show? • Spirit of FIDONet being still alive • The idea of Cybernetic Immortality and that some form is already possible • Problems • Obtaining the dataset • Choosing and training base language model • Making an exhibit
  6. DATASET Source Years Size (original) Size (cleaned) Fido7 Usenet Archives

    2013-2015 16 Gb (compressed) - Private Archives (JAM) 2001-2004 100Mb 88Mb English Usenet fido group archives 1997-2002 1.7 Gb 0.8 Gb ExecPC BBS Archives (en) 1997-1999 500Mb 500Mb • Datasets are very difficult to find, due to different media available at a time • Google UseNet archives cannot be scraped • Not a single point of aggregation (separate echos on different BBS systems / backbones)
  7. Base language Model selection Model Size Comment LSTM 100K -

    1M Training from scratch GPT-2 124M – 774M – 1.5B ruGPT 3 117M – 760M – 1.3B ruGPT 3.5 13B LLaMA 7B+ 12 hours training 1 epoch Nvidia A100 80Gb GPU Compromise between training time, required dataset size to avoid overfitting
  8. Results Topic: UFO From: RON TAYLOR To: JACK SARGEANT Subj:

    UFOs JS> RT> You are certainly entitled to your opinion as an engineer or scientist, to JS> RT> the extent that you're entitled to an opinion with which you are JS> RT> absolutely convinced. However, if you are a skeptic, you shouldnt JS> RT> make your point clearly and firmly, not to "sell your ideas to me". JS>If you think of a UFO as merely a "flying disk" without an ET engineer behind the JS>scenes, you are free to believe that. I believe that the UFO is JS>real, but I am not content to just speculate about its nature. Because I'm not a skeptic and there are other people in this conference that are skeptic's for the most part... -Ron * QMPro 1.02 42-7029 * Why are there SO many atheists? Because God lets them. --- WILDMAIL!/WC v4.12 * Origin: CrimeBytes:Take A MegaByte Out Of Crime! (305)592-9831 1:135/5.0) https://huggingface.com/estonto/fido-gpt • Generated text is not present in the training dataset • Quoting style correctly reproduced (including names abbreviations) • Names are often present in the training dataset => Overfitting on names due to low database size/variability
  9. implementation Client Web App Cloud Server (GPU) Messenger App Client

    Web App Pre-generation http://soshnikov.com/art/fidoci
  10. Further work • Alternative approach – generation of conversation based

    on dialogue between different conversational models with different personalities • Training models for other languages • Implementing user interaction through chat-bots