Upgrade to Pro — share decks privately, control downloads, hide ads and more …

20211021-datadrink-textmining-ce

etalab-ia
October 21, 2021
220

 20211021-datadrink-textmining-ce

etalab-ia

October 21, 2021
Tweet

Transcript

  1. Text mining as a support for public consultation: Multilingual clustering

    Datadrink 21/10/21 Nicolas Stefanovitch, Guillaume Jacquet JRC.I.3 Text and Data Mining
  2. Context: Conference on the Future of Europe • EU-wide multilingual

    public consultation:  24+ languages  3.5 million uniq visitors  140 000 participants  26 000 contributions • Aim:  Make sense of large number of multilingual contributions  Identify clusters of linked ideas  Find related ideas
  3. Methodology and technology • How it works:  Aligned multilingual

    sentence embeddings (~ 100 languages)  Ad-hoc: Search, Clustering and Visualisation algorithms • References:  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond  A Survey Of Cross-lingual Word Embedding Models
  4. Cluster quality evaluation • For a given set of parameters:

     8000+ sentences  100+ clusters • Measure:  Unsupervised: silhouette score (range: [-1,1]) • Summary clusters:  Silhouette: 0.14 • Whole clusters:  Silhouette: -0.001 • “Topic” clusters:  Silhouette: -0.023
  5. Key Takeaways • Conference on the Future of Europe: https://futureu.europa.eu/

    • Information access and summarization in a highly multilingual environment • Supporting “Data for Policy” with Text Mining tools NLP as a key support in ‘Data For Policy’