Federated Search Research

From openmichigan

Jump to: navigation, search
OER Search Meetings - Dec 08 - May 09


Architecture for Federated Search

Discussion by Kathleen, Mike, Ali, Ted, 19 December 2008

Discussion by Nathan Y (CC), Kathleen, Ali, Piet, Garin, Albert (U-M Library), Amatava, Greg, 24 March 2009

Discussion by Kathleen, Piet, Mike, Amatava, Ted, Ali 18 May 2009

Last Updated 18 May 2009

Contents

[edit] Abstract

In order to address the difficulty in searching health OER materials, we are investigating a federated search framework to search across health OER repositories. This portal would serve as a central location for dynamically searching a variety of health OER repositories, from universities to professional associations and materials ranging from courses to image libraries. This is an ambitious project and the short-term goal is to create user scenarios, identify necessary metadata for functionality, and develop a functional and preliminary technical architecture for this portal. This research is complementary to Creative Commons' work on DiscoverEd (http://discovered.creativecommons.org/), an aggregated search across large cross-subject OER repositories such as OER Commons. While Creative Commons is focusing on an aggregated search and ease of searching among the current metadata in large OER repositories, our paper will focus on federated search in the health OER field and recommend improved metadata standards.

[edit] Notes on CC White Paper

  • Do not expect that there will be universal OER metadata standard - too difficult to organize and agree upon
  • Search should build upon existing metadata formats - Dublin Core and RDFa. RDFa does not need to be aware underlying vocabularies beforehand.
  • Metadata should be independent and extensible (pg 5)
  • Simple is better - their prototype crawls 4 sites and uses 5 pieces of metadata (license, source, edu level, language, keywords)
  • They designed an aggregated search that crawls RDFa and OAI-PMH
  • Users should be able to remix (e.g. copy-paste) and retain associated metadata
  • We should allow others to customize our search results through interfaces such as www.opensearch.org/
  • Could include user-submitted content later (e.g. reviews), but as it's difficult to ensure quality or accuracy of user-entered data, the search should allow people to filter by curator/source. There should be no anonymous submissions.
  • Doesn't save user prefs or shared expert searches (e.g. favorites list) but future versions might

[edit] Search what?

  • First priority is Health OER content of any type
  • Also want to make sure the system could scale to any content

[edit] Search where?

  • Within an institution and its content hierarchies
  • Across institutions with high bandwidth connections
  • Across institutions with poor connections
  • Repositories
  • Which institutions would we be searching across – health OER (U-M, Tufts, KNUST, U-G, UCT, UWC) or all of OCWC?
  • University of Michigan

[edit] Search on?

  • Metadata
  • Automatically generated
  • File name
  • Creation date
  • File type
  • Manually added (to be added by the departments/institutions before we get the data)
  • Learning objective
  • License (e.g. CC)
  • Full text

[edit] What’s the role of RDFa?

  • Do we need to define an OER minimum?
  • Need to include CC
  • Author
  • License Name
  • License URL
  • What does the Yahoo search monkey RDFa look like?
  • From CC Tech Summit

[edit] Metadata Standards

  • IEEE metadata standards in this area?
  • Medbiquitous’s role in defining medical education competencies

[edit] Other repositories' metadata

[edit] Multimedia Educational Resources for Learning and Online Teaching (MERLOT)

  • Title
  • Author
  • Keywords
  • Description
  • Email
  • Organization
  • Selected from list
  • Community
  • Type
  • Subject
  • Audience
  • Technical Format
  • Language
  • Copyright (Y/N)
  • Source Code Available (Y/N)
  • Section 508 compliant (Y/N)
  • Cost Involved (Y/N)
  • Creative Commons (Y/N)
  • Automatically generated
  • Submitter
  • Date Added
  • Date Modified
  • has peer reviews
  • MERLOT award winner
  • has member comments
  • has personal collections
  • has assignments
  • has an Author Snapshot
  • has SERC Activity Sheets

[edit] MedEdPortal

  • Resource Title
  • Keyword
  • Institution
  • Description
  • Author
  • Selected from List
  • Resource Type
  • AAMC Hot Topics
  • Specialty/Discipline
  • Collection
  • Intended Learner Audience
  • Intended Faculty Audience
  • ACGME Outcomes Addressed
  • Automatically generated
  • Publication Date

[edit] Health Education Assets Library (HEAL)

  • User-entered
  • Title
  • Description
  • Contributors/Institution
  • Copyright Holders
  • MeSH Term
  • Tags
  • Selected from list
  • Learning Resource Type
  • Medical Image Type
  • Source Collection
  • Primary Audience

[edit] Use cases

[edit] Providers of content

  • Where is RDFa captured (today and our desired future)?
  • Note: This is one of our requirements in the OER Platform discussion.
  • Educommons?
  • Does RDF but not RDFa
  • OERca?
  • Connexions?
  • Does RDF but not RDFa. Exports as plain XML.
  • BlueStream?
  • Does not generate RDF/RDFa
  • If you want to be included in the results, do the providers need to register?

[edit] Consumers of content

  • Search for the term “dyspnea” anywhere in the text and return anything that matches (1) the term and (2) the concept
  • Map learning experiences to learning assessments to learning objectives to learning outcomes through some sort of navigation interface
  • Return content links that may be behind authentication
  • Question: For CTools does that mean getting to its indices or extracting the index?
  • I have forgotten some of this context. I remember the talk about authentication. We’d be searching CTools because we wouldn’t be restricted to searching for open content?
  • How is it presented?
  • Consolidated list (Google way)
  • Advantages: Simple UI
  • Disadvantages: Relevance ranking across repositories,
  • Provide silos
  • Advantages: Would not have to merge results and relevance ranking across institutions
  • Disadvantages: More work for the user to filter through silos
  • Faceted search
  • Advantages: User friendly
  • Disadvantages: Might be more difficult to implement depending on the platform
  • Can we survey users to find out what they want?
  • We have SurveyMonkey as a service now. Who would we survey?
  • I forget the context in which we mentioned Google Alerts. Did we want our federated search to include a feature like Google Alerts where users can receive notifications when new resources that match their keywords are added? Or did we somehow want to integrate with Google Alerts?

[edit] Features

  • Meta-thesaurus
  • Question: Does UMLS cover the necessary health OER domains
  • Language (e.g., across English, French, German)
  • Note challenges, pros, cons for incorporating this feature (based on previous applications)
  • Timeliness (of syncing)
  • Daily?
  • Weekly?
  • "identification and association with people of related interests (social networking)"
an example might be that a user logs in and the system tracks her search queries or facets and supplies that to other subscribers or members of the system. opportunities to create research groups, study groups, etc.
  • "peer-based sharing of information"
  • "access by a variety of devices across different network connections (e.g., mobile devices and locally replicated caches where wide area connectivity is constrained)"

[edit] Technical issues

  • Syncing strategies for performance
  • Search APIs for the different repositories

[edit] Approaches/Products

  • Just to confirm: We mentioned Google SEO because we think users would be coming to our content from Google searches in addition to whatever federated search platform we choose?

[edit] Literature Review

  • Thi Truong Avrahami, Lawrence Yau, Luo Si, Jamie Callan. "The FedLemur project: Federated search in the real world." Journal of the American Society for Information Science and Technology Vol 57 No 3, 2006, 347-358, http://dx.doi.org/10.1002/asi.20283

[edit] Additional Resources

Apelon, Inc.: http://www.apelon.com/products

Columbia University Biomedical Informatics: http://www.dbmi.columbia.edu/

HL7: http://www.hl7.org

The Lexical Grid: http://informatics.mayo.edu/LexGrid/index.php?page

Medical Entities Dictionary: http://med.dmi.columbia.edu/

MedLEE - A Medical Language Extraction and Encoding System: http://lucid.cpmc.columbia.edu/medlee/

OpenGALEN Foundation: http://www.opengalen.org/index.html

The Protégé Ontology Editor and Knowledge Acquisition System: http://protege.stanford.edu/

SNOMED International: http://www.snomed.org

Unified Medical Language System Knowledge Source Server: http://umlsks.nlm.nih.gov/kss/servlet/Turbine/template/admin,user,KSS_login.vm

www.opensearch.org/

[edit] General questions

  • Where do we stop with our issues before diving general IR problems?
  • What more is there to semantic search than thesaurus and metadata?
Albert's response: "Quite a bit especially if you have full text. Some of my favorite projects: http://infomap-nlp.sourceforge.net/ (no longer actively developed, because the author moved on to work on a different algorithm the next one in fact...) http://code.google.com/p/semanticvectors/. If you have a lucene index, semantic vectors can use that to create its semantic index. Some patents on real latent semantic analysis might have expired, so maybe that's an option now too, I couldn't say for sure though. Either way, some possible uses for SV/Infomap are to derive your own thesaurus based on the corpus you have indexed. What you have full text to may or may not necessarily even have to be what you've got indexed for your particular search interests. For example if you build a semantic model of doman-specific wikipedia pages, you could use that to base your thesaurus off of, provided there's enough data."

[edit] Follow-up

Personal tools