Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Molly - First Committee Meeting

Avatar for rdrake rdrake
October 24, 2011

Molly - First Committee Meeting

These are the slides from my first ever thesis committee meeting (Oct. 24, 2011).

Avatar for rdrake

rdrake

October 24, 2011
Tweet

Other Decks in Research

Transcript

  1. Why Keywords? • Matching keywords is fast • Full-text search

    engines are insanely fast • Usually produces acceptable results • How often are you dissatisfied with Google? Monday, 24 October, 11
  2. Why Not? • “Good enough” is not always good enough

    • Fuzzy searching is possible, but this only fixes spelling errors • Can also deal with varying suffixes (stemming), synonyms, etc. • Fast, but not necessarily “accurate” • Great for individual documents, not so much for linked documents Monday, 24 October, 11
  3. What If... • A system could infer the desired knowledge

    from keywords • Users familiar with interface • Users don’t have to remember exactly what they want • Users think “my class at 5 PM on Thursday,” not “Databases” • This system could scale to support thousands of concurrent users Monday, 24 October, 11
  4. The Goal A system that infers the desired document(s) given

    a user’s keyword query Monday, 24 October, 11
  5. Language of Choice: Clojure • Functional, lisp dialect • (println

     “Hey  there!”) • Powerful macro system • Built on top of the JVM • Plethora of Java libraries at our disposal • Pretty fast • Ubiquitous Monday, 24 October, 11
  6. Operating System: Linux & OS X • Given the JVM’s

    ubiquitous nature, OS choice is a personal preference more than anything • I find a unix-like platform preferable to develop on • Nothing stopping us from using Windows Monday, 24 October, 11
  7. Tools: BASH & Git • BASH is powerful, everywhere •

    Git is also powerful and distributed • Can collaborate with others easily • Github is a wonderful place to store code Monday, 24 October, 11
  8. Lots of Libraries • Lucene • Fast, full-text search engine

    • Pure Java • SQLite • No daemon required • Slow, but great for development Monday, 24 October, 11
  9. Lots of Libraries • Ring • Glue between HTTP and

    higher-level Clojure web frameworks • Rack for Clojure • Compojure • Web framework Monday, 24 October, 11
  10. Current List of Libraries • Clojure • Clojure Contrib (JDBC,

    JSON, CLI) • Sqlite JDBC Driver • ClojureQL • Ring • Compojure Monday, 24 October, 11
  11. The Corpus: Mycampus Database • Built from multiple sources (uoit.ca/{directory,mycampus},

    course calendars) • Contains the following relationships: • Courses (code, title, description) • Instructors (id, name) • Schedules (date_start, date_end, day, schedtype, hour_start, hour_end, min_start, min_end, classtype, location, section_id) • Sections (actual, campus, capacity, credits, levels, reg_start, reg_end, semester, sec_code, sec_number, year, course) • Teaches (id, schedule_id, instructor_id, position) Monday, 24 October, 11
  12. The Indexer • Takes in a configuration file • Builds

    three indices: 1.Values - Set of distinct strings in each table (n-gram’d) 2.Entities - Each row in a table becomes an entity 3.Groups - Linked entities combined into groups Monday, 24 October, 11
  13. Configuration File • Written entirely in Clojure • Specifies entities

    and values to index, group hierarchy {:course {:name :course :id :id :sql (-> (cql/table :courses) (cql/project [[:code :as :id] :title :description])) :values [:code :title :description]} Monday, 24 October, 11
  14. Rows to Entities • How do we go from here

    • To here? csci 5010g Survey of Computer Science This course is a survey of some of the main... Type Course Course ID csci 5010g csci 5010g Attributes code csci 5010g title Survey of Computer Science description This course is a survey of some of the main... Monday, 24 October, 11
  15. Rows to Entities • Configuration file tells us type, ID

    column • Rest are considered attributes • Make use of Clojure data structures (specifically maps) {:__type__ "Course" :__id__ "csci 5010g" :__attrib__ {:code "csci 5010g" :title "Survey of Computer Science" :description "This course is a survey of some of the main..."}} Monday, 24 October, 11
  16. Entities to Documents • Simply flatten the entity • Insert

    the resulting map into the document as key-value pairs (into {:__type__ (entity :__type__) :__id__ (entity :__id__)} (entity :__attrib__)) Monday, 24 October, 11
  17. How are Groups Stored? • From config file, we know

    the hierarchy • Use this instead of foreign key relationships from the data store • Store links to many entities in a single document • Hypergraph > graph • Matching one entity in a document pulls out all related entities • Eg. finding a course would also pull out section entities, instructor entities, etc. Monday, 24 October, 11
  18. At Our Disposal: • Every value in the database •

    Not repeated, n-gram analyzed • Every entity stored as a separate document • Can search for any text and return an entity that contains it • All entity groups • Can find related entities based on unique identifier • Traversing across groups requires recursive searches Monday, 24 October, 11
  19. Future Directions† • The system should “learn” what users want

    • Users help train the system based on their input and reaction • The system must scale up • Far slower than traditional keyword-based search • User interface is a challenge • How do we present the results to users in a useful manner? †subject to change Monday, 24 October, 11