Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Molly - First Committee Meeting

rdrake
October 24, 2011

Molly - First Committee Meeting

These are the slides from my first ever thesis committee meeting (Oct. 24, 2011).

rdrake

October 24, 2011
Tweet

Other Decks in Research

Transcript

  1. Why Keywords? • Matching keywords is fast • Full-text search

    engines are insanely fast • Usually produces acceptable results • How often are you dissatisfied with Google? Monday, 24 October, 11
  2. Why Not? • “Good enough” is not always good enough

    • Fuzzy searching is possible, but this only fixes spelling errors • Can also deal with varying suffixes (stemming), synonyms, etc. • Fast, but not necessarily “accurate” • Great for individual documents, not so much for linked documents Monday, 24 October, 11
  3. What If... • A system could infer the desired knowledge

    from keywords • Users familiar with interface • Users don’t have to remember exactly what they want • Users think “my class at 5 PM on Thursday,” not “Databases” • This system could scale to support thousands of concurrent users Monday, 24 October, 11
  4. The Goal A system that infers the desired document(s) given

    a user’s keyword query Monday, 24 October, 11
  5. Language of Choice: Clojure • Functional, lisp dialect • (println

     “Hey  there!”) • Powerful macro system • Built on top of the JVM • Plethora of Java libraries at our disposal • Pretty fast • Ubiquitous Monday, 24 October, 11
  6. Operating System: Linux & OS X • Given the JVM’s

    ubiquitous nature, OS choice is a personal preference more than anything • I find a unix-like platform preferable to develop on • Nothing stopping us from using Windows Monday, 24 October, 11
  7. Tools: BASH & Git • BASH is powerful, everywhere •

    Git is also powerful and distributed • Can collaborate with others easily • Github is a wonderful place to store code Monday, 24 October, 11
  8. Lots of Libraries • Lucene • Fast, full-text search engine

    • Pure Java • SQLite • No daemon required • Slow, but great for development Monday, 24 October, 11
  9. Lots of Libraries • Ring • Glue between HTTP and

    higher-level Clojure web frameworks • Rack for Clojure • Compojure • Web framework Monday, 24 October, 11
  10. Current List of Libraries • Clojure • Clojure Contrib (JDBC,

    JSON, CLI) • Sqlite JDBC Driver • ClojureQL • Ring • Compojure Monday, 24 October, 11
  11. The Corpus: Mycampus Database • Built from multiple sources (uoit.ca/{directory,mycampus},

    course calendars) • Contains the following relationships: • Courses (code, title, description) • Instructors (id, name) • Schedules (date_start, date_end, day, schedtype, hour_start, hour_end, min_start, min_end, classtype, location, section_id) • Sections (actual, campus, capacity, credits, levels, reg_start, reg_end, semester, sec_code, sec_number, year, course) • Teaches (id, schedule_id, instructor_id, position) Monday, 24 October, 11
  12. The Indexer • Takes in a configuration file • Builds

    three indices: 1.Values - Set of distinct strings in each table (n-gram’d) 2.Entities - Each row in a table becomes an entity 3.Groups - Linked entities combined into groups Monday, 24 October, 11
  13. Configuration File • Written entirely in Clojure • Specifies entities

    and values to index, group hierarchy {:course {:name :course :id :id :sql (-> (cql/table :courses) (cql/project [[:code :as :id] :title :description])) :values [:code :title :description]} Monday, 24 October, 11
  14. Rows to Entities • How do we go from here

    • To here? csci 5010g Survey of Computer Science This course is a survey of some of the main... Type Course Course ID csci 5010g csci 5010g Attributes code csci 5010g title Survey of Computer Science description This course is a survey of some of the main... Monday, 24 October, 11
  15. Rows to Entities • Configuration file tells us type, ID

    column • Rest are considered attributes • Make use of Clojure data structures (specifically maps) {:__type__ "Course" :__id__ "csci 5010g" :__attrib__ {:code "csci 5010g" :title "Survey of Computer Science" :description "This course is a survey of some of the main..."}} Monday, 24 October, 11
  16. Entities to Documents • Simply flatten the entity • Insert

    the resulting map into the document as key-value pairs (into {:__type__ (entity :__type__) :__id__ (entity :__id__)} (entity :__attrib__)) Monday, 24 October, 11
  17. How are Groups Stored? • From config file, we know

    the hierarchy • Use this instead of foreign key relationships from the data store • Store links to many entities in a single document • Hypergraph > graph • Matching one entity in a document pulls out all related entities • Eg. finding a course would also pull out section entities, instructor entities, etc. Monday, 24 October, 11
  18. At Our Disposal: • Every value in the database •

    Not repeated, n-gram analyzed • Every entity stored as a separate document • Can search for any text and return an entity that contains it • All entity groups • Can find related entities based on unique identifier • Traversing across groups requires recursive searches Monday, 24 October, 11
  19. Future Directions† • The system should “learn” what users want

    • Users help train the system based on their input and reaction • The system must scale up • Far slower than traditional keyword-based search • User interface is a challenge • How do we present the results to users in a useful manner? †subject to change Monday, 24 October, 11