Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FidoNet and Generative AI: A New Approach to Museumification of Historical Content Resources

FidoNet and Generative AI: A New Approach to Museumification of Historical Content Resources

This is a presentation at IEEE HISTELCON 2023 Conference

Dmitri Soshnikov

September 08, 2023
Tweet

More Decks by Dmitri Soshnikov

Other Decks in Research

Transcript

  1. FIDONET Cybernetic immortality
    FidoNet and Generative AI: A New Approach to Museumification of Historical Content Resources
    Vasiliy Burov, Dmitry Soshnikov

    View full-size slide

  2. Reality
    history museum
    visitors

    View full-size slide

  3. FIDOneT, BBS, etc.
    • Founded 1984 by Tom Jennings
    • Very popular in 1990s
    • Specific culture & communities
    • Writing style, quoting

    View full-size slide

  4. Fidonet popularity
    • HUMOR
    • HUMOR.FILTERED
    • STARWARS
    • ZX.SPECTRUM
    • FIDONET.HISTORY

    View full-size slide

  5. MUSEIFICATION of
    Traditional content
    • Museums have learned to present traditional content as
    physical artifacts
    • and make replicas from ancient books to give them the
    opportunity to leaf through them
    • But this is only static content…

    View full-size slide

  6. MUSEIFICATION of digital content
    Sites / Documents / Programs User generated content
    • FidoNet, Usenet, IRC – this is not only the frozen
    content itself, but also the style and dynamics…
    How can we recreate it?

    View full-size slide

  7. Goal: Fidonet cybernetic immortality
    • Train a language model capable of generating an infinite stream of potential FIDONet messages
    • What does this show?
    • Spirit of FIDONet being still alive
    • The idea of Cybernetic Immortality and that some form is already possible
    • Problems
    • Obtaining the dataset
    • Choosing and training base language model
    • Making an exhibit

    View full-size slide

  8. DATASET
    Source Years Size (original) Size (cleaned)
    Fido7 Usenet Archives 2013-2015 16 Gb (compressed) -
    Private Archives (JAM) 2001-2004 100Mb 88Mb
    English Usenet fido group archives 1997-2002 1.7 Gb 0.8 Gb
    ExecPC BBS Archives (en) 1997-1999 500Mb 500Mb
    • Datasets are very difficult to find, due to different media available at a time
    • Google UseNet archives cannot be scraped
    • Not a single point of aggregation (separate echos on different BBS systems / backbones)

    View full-size slide

  9. Base language Model selection
    Model Size Comment
    LSTM 100K - 1M Training from scratch
    GPT-2 124M – 774M – 1.5B
    ruGPT 3 117M – 760M – 1.3B
    ruGPT 3.5 13B
    LLaMA 7B+
    12 hours training
    1 epoch
    Nvidia A100 80Gb GPU
    Compromise between training time, required dataset size to avoid overfitting

    View full-size slide

  10. Results
    Topic: UFO
    From: RON TAYLOR
    To: JACK SARGEANT
    Subj: UFOs
    JS> RT> You are certainly entitled to your opinion as an engineer or scientist, to
    JS> RT> the extent that you're entitled to an opinion with which you are
    JS> RT> absolutely convinced. However, if you are a skeptic, you shouldnt
    JS> RT> make your point clearly and firmly, not to "sell your ideas to me".
    JS>If you think of a UFO as merely a "flying disk" without an ET engineer behind the
    JS>scenes, you are free to believe that. I believe that the UFO is
    JS>real, but I am not content to just speculate about its nature.
    Because I'm not a skeptic and there are other people in this
    conference that are skeptic's for the most part...
    -Ron
    * QMPro 1.02 42-7029 * Why are there SO many atheists? Because God lets them.
    --- WILDMAIL!/WC v4.12
    * Origin: CrimeBytes:Take A MegaByte Out Of Crime! (305)592-9831 1:135/5.0)
    https://huggingface.com/estonto/fido-gpt
    • Generated text is not
    present in the training
    dataset
    • Quoting style correctly
    reproduced (including
    names abbreviations)
    • Names are often present in
    the training dataset =>
    Overfitting on names due to
    low database size/variability

    View full-size slide

  11. implementation
    Client Web App Cloud Server (GPU)
    Messenger App
    Client Web App
    Pre-generation
    http://soshnikov.com/art/fidoci

    View full-size slide

  12. Further work
    • Alternative approach – generation of conversation based on dialogue between different conversational models with different
    personalities
    • Training models for other languages
    • Implementing user interaction through chat-bots

    View full-size slide

  13. Questions?
    • Vasily Burov ([email protected])
    • Dmitry Soshnikov ([email protected], http://soshnikov.com)

    View full-size slide