Slide 1

Slide 1 text

GATE General Architecture for Text Engineering Presented by Ahmed Magdy Ezzeldin

Slide 2

Slide 2 text

What is Text Engineering? ● Text or Language Engineering means applying scientific principles to the design, construction and maintenance of tools to help deal with information that has been expressed in natural languages (the languages that people use for communicating with one another).

Slide 3

Slide 3 text

Applications ● Automatic summarization ● Co-reference resolution ● Discourse analysis (elaboration, explanation, contrast, question, statement, assertion) ● Machine translation ● Morphological segmentation ● Named entity recognition ● Natural language generation ● Natural language understanding ● Optical character recognition (OCR) ● Part-of-speech tagging ● Parsing ● Question answering ● Relationship extraction ● Sentiment analysis (Polarity) ● Speech recognition (Speech segmentation) ● Sentence breaking, Word segmentation, Topic segmentation ● Word sense disambiguation

Slide 4

Slide 4 text

What is GATE? ● General Architecture for Text Engineering ● Java suite of NLP tools ● University of Sheffield ● Initial Release 1995 (17 years ago) ● Last Stable Release 6.1 May 6, 2011 ● Languages : English, Spanish, Chinese, Arabic, Bulgarian, French, German, Hindi, Italian, Cebuano, Romanian, Russian. ● Accepted Input Formats TXT, HTML, XML, Doc, PDF and Java Serial, PostgreSQL, Lucene, Oracle Databases ● GATE Developer which is a GATE graphical user interface, like Eclipse for Java programmers, provides a graphical environment for research and development of language processing software.

Slide 5

Slide 5 text

Gate Components and APIs

Slide 6

Slide 6 text

ANNIE GATE Application ● A Nearly-New Information Extraction System ● Example Application for English Language Engineering ● A set of modules: ● Tokenizer ● Gazetteer ● Sentence splitter ● Part-of-speech tagger ● Named entities transducer ● Co-reference tagger.

Slide 7

Slide 7 text

ANNIE Architecture

Slide 8

Slide 8 text

Demos ● ANNIE Gazetteer: A list lookup component. The list files are located in $GATE_HOME/plugins/ANNIE/resources/gazetteer ● JAPE Transducer: JAPE is a Java Annotation Patterns Engine. JAPE provides finite state transduction over annotations based on regular expressions. Example files are located in $GATE_HOME/plugins/ANNIE/resources/NE ● ANNIE NE Transducer: (ANNIE named entity grammar) a semantic tagger based on the JAPE language.

Slide 9

Slide 9 text

Mimir ● Provides indexing and searching the linguistic and semantic information generated by GATE Demo

Slide 10

Slide 10 text

Installing Mimir

Slide 11

Slide 11 text

● Open GATE and Load ANNIE Systems with Defaults ● Then click the Manage CREOLE Plug-ins

Slide 12

Slide 12 text

Add Mimir Client Path ● Add Mimir as a Plugin and set mimir-client directory

Slide 13

Slide 13 text

● Make sure Mimir Plugin is loaded now and every time you open GATE

Slide 14

Slide 14 text

● Add Mimir Indexing PR to Processing Resources

Slide 15

Slide 15 text

● Create a New Corpus from Language Resource

Slide 16

Slide 16 text

● Right Click the Corpus and populate it with Documents

Slide 17

Slide 17 text

Edit the Default Index Template ● Open http://localhost:8080/mimir-demo in your browser and go to the configuration page ● Then go to the Index Templates section and manage them ● Then Click on the default Index Template to edit it.

Slide 18

Slide 18 text

Add some annotations to the Default Index Template

Slide 19

Slide 19 text

Add a new Index

Slide 20

Slide 20 text

Edit the Index you created and set the Scorer Algorithm (1) (2) (3)

Slide 21

Slide 21 text

Copy the Index URL

Slide 22

Slide 22 text

Paste Index URL in Mimir and Run ANNIE on the Corpus

Slide 23

Slide 23 text

Double click any document and check Annotations yourself

Slide 24

Slide 24 text

Close and Search the Index (1) (2)

Slide 25

Slide 25 text

Example Query

Slide 26

Slide 26 text

Thank you

Slide 27

Slide 27 text

References http://gate.ac.uk http://www.wikipedia.com GATE Website (it is huge) Mother of all Knowledge