Federated Search Research

From openmichigan

Jump to: navigation, search

OER Search Meetings - Dec 08 - May 09

Architecture for Federated Search

Discussion by Kathleen, Mike, Ali, Ted, 19 December 2008

Discussion by Nathan Y (CC), Kathleen, Ali, Piet, Garin, Albert (U-M Library), Amatava, Greg, 24 March 2009

Discussion by Kathleen, Piet, Mike, Amatava, Ted, Ali 18 May 2009

Last Updated 18 May 2009

[edit] Abstract

In order to address the difficulty in searching health OER materials, we are investigating a federated search framework to search across health OER repositories. This portal would serve as a central location for dynamically searching a variety of health OER repositories, from universities to professional associations and materials ranging from courses to image libraries. This is an ambitious project and the short-term goal is to create user scenarios, identify necessary metadata for functionality, and develop a functional and preliminary technical architecture for this portal. This research is complementary to Creative Commons' work on DiscoverEd (http://discovered.creativecommons.org/), an aggregated search across large cross-subject OER repositories such as OER Commons. While Creative Commons is focusing on an aggregated search and ease of searching among the current metadata in large OER repositories, our paper will focus on federated search in the health OER field and recommend improved metadata standards.

[edit] Notes on CC White Paper

Do not expect that there will be universal OER metadata standard - too difficult to organize and agree upon
Search should build upon existing metadata formats - Dublin Core and RDFa. RDFa does not need to be aware underlying vocabularies beforehand.
Metadata should be independent and extensible (pg 5)
Simple is better - their prototype crawls 4 sites and uses 5 pieces of metadata (license, source, edu level, language, keywords)
They designed an aggregated search that crawls RDFa and OAI-PMH
Users should be able to remix (e.g. copy-paste) and retain associated metadata
We should allow others to customize our search results through interfaces such as www.opensearch.org/
Could include user-submitted content later (e.g. reviews), but as it's difficult to ensure quality or accuracy of user-entered data, the search should allow people to filter by curator/source. There should be no anonymous submissions.
Doesn't save user prefs or shared expert searches (e.g. favorites list) but future versions might

[edit] Search what?

First priority is Health OER content of any type
Also want to make sure the system could scale to any content

[edit] Search where?

Within an institution and its content hierarchies
Across institutions with high bandwidth connections
Across institutions with poor connections
Repositories

Which institutions would we be searching across – health OER (U-M, Tufts, KNUST, U-G, UCT, UWC) or all of OCWC?
University of Michigan

OERca? (I'm guessing this is a no...)
CTools? (This requires U-M authentication)
eduCommons (or whatever OER platform we choose to replace it)
OpenMed (if this has additional resources beyond the platform above )
Our Open.Michigan YouTube channel?

This site already has a long list of partner institutions/contributors

Other sites listed on our open content search wiki page

[edit] Search on?

Metadata

Automatically generated

File name
Creation date
File type

Manually added (to be added by the departments/institutions before we get the data)

Learning objective
License (e.g. CC)

Full text

[edit] What’s the role of RDFa?

Do we need to define an OER minimum?

Need to include CC

Author
License Name
License URL

What does the Yahoo search monkey RDFa look like?

Yahoo! Developer Network Blog entry from Sept. 2008

From CC Tech Summit

ccLearn Search Metadata

[edit] Metadata Standards

IEEE metadata standards in this area?
Medbiquitous’s role in defining medical education competencies

[edit] Other repositories' metadata

[edit] Multimedia Educational Resources for Learning and Online Teaching (MERLOT)

Advanced Search Page
User-entered

Title
Author
Keywords
Description
Email
Organization
Selected from list

Community
Type
Subject
Audience
Technical Format
Language
Copyright (Y/N)
Source Code Available (Y/N)
Section 508 compliant (Y/N)
Cost Involved (Y/N)
Creative Commons (Y/N)

Automatically generated

Submitter
Date Added
Date Modified
has peer reviews
MERLOT award winner
has member comments
has personal collections
has assignments
has an Author Snapshot
has SERC Activity Sheets

[edit] MedEdPortal

Advanced Search Page
User-entered

Resource Title
Keyword
Institution
Description
Author
Selected from List

Resource Type
AAMC Hot Topics
Specialty/Discipline
Collection
Intended Learner Audience
Intended Faculty Audience
ACGME Outcomes Addressed

Automatically generated

Publication Date

[edit] Health Education Assets Library (HEAL)

Advanced Search Page

User-entered

Title
Description
Contributors/Institution
Copyright Holders
MeSH Term
Tags
Selected from list

Learning Resource Type
Medical Image Type
Source Collection
Primary Audience

[edit] Use cases

[edit] Providers of content

Where is RDFa captured (today and our desired future)?

Note: This is one of our requirements in the OER Platform discussion.
Educommons?

Does RDF but not RDFa

OERca?

There was talk of generating RDFa in OERca. This is planned for OERca version 2, right?
RDFa Implementation - PHP

Connexions?

Does RDF but not RDFa. Exports as plain XML.

BlueStream?

Does not generate RDF/RDFa

If you want to be included in the results, do the providers need to register?

[edit] Consumers of content

Search for the term “dyspnea” anywhere in the text and return anything that matches (1) the term and (2) the concept
Map learning experiences to learning assessments to learning objectives to learning outcomes through some sort of navigation interface
Return content links that may be behind authentication

Question: For CTools does that mean getting to its indices or extracting the index?

I have forgotten some of this context. I remember the talk about authentication. We’d be searching CTools because we wouldn’t be restricted to searching for open content?

How is it presented?

Consolidated list (Google way)

Advantages: Simple UI
Disadvantages: Relevance ranking across repositories,

Provide silos

Advantages: Would not have to merge results and relevance ranking across institutions
Disadvantages: More work for the user to filter through silos

Faceted search

Advantages: User friendly
Disadvantages: Might be more difficult to implement depending on the platform

Can we survey users to find out what they want?

We have SurveyMonkey as a service now. Who would we survey?

Google alerts?

I forget the context in which we mentioned Google Alerts. Did we want our federated search to include a feature like Google Alerts where users can receive notifications when new resources that match their keywords are added? Or did we somehow want to integrate with Google Alerts?

[edit] Features

Meta-thesaurus

Question: Does UMLS cover the necessary health OER domains

Language (e.g., across English, French, German)

Note challenges, pros, cons for incorporating this feature (based on previous applications)

Timeliness (of syncing)

Daily?
Weekly?

"identification and association with people of related interests (social networking)"

an example might be that a user logs in and the system tracks her search queries or facets and supplies that to other subscribers or members of the system. opportunities to create research groups, study groups, etc.

"peer-based sharing of information"
"access by a variety of devices across different network connections (e.g., mobile devices and locally replicated caches where wide area connectivity is constrained)"

[edit] Technical issues

Syncing strategies for performance
Search APIs for the different repositories

[edit] Approaches/Products

Google with SEO (Search Engine Optimization)

Just to confirm: We mentioned Google SEO because we think users would be coming to our content from Google searches in addition to whatever federated search platform we choose?

[edit] Literature Review

"A Federated Search Primer," 2009, http://deepwebtechblog.com/a-federated-search-primer-part-iii-of-iii/

Alexis Linoski and Tine Walczyk. “Federated Search 101.” netConnect, 15 July 2008 (19 December 2008), http://www.libraryjournal.com/article/CA6571320.html

Thi Truong Avrahami, Lawrence Yau, Luo Si, Jamie Callan. "The FedLemur project: Federated search in the real world." Journal of the American Society for Information Science and Technology Vol 57 No 3, 2006, 347-358, http://dx.doi.org/10.1002/asi.20283

http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_032362.pdf

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1839519

http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_032362.pdf

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1839519

[edit] Additional Resources

Apelon, Inc.: http://www.apelon.com/products

Columbia University Biomedical Informatics: http://www.dbmi.columbia.edu/

HL7: http://www.hl7.org

The Lexical Grid: http://informatics.mayo.edu/LexGrid/index.php?page

Medical Entities Dictionary: http://med.dmi.columbia.edu/

MedLEE - A Medical Language Extraction and Encoding System: http://lucid.cpmc.columbia.edu/medlee/

OpenGALEN Foundation: http://www.opengalen.org/index.html

The Protégé Ontology Editor and Knowledge Acquisition System: http://protege.stanford.edu/

SNOMED International: http://www.snomed.org

Unified Medical Language System Knowledge Source Server: http://umlsks.nlm.nih.gov/kss/servlet/Turbine/template/admin,user,KSS_login.vm

www.opensearch.org/

[edit] General questions

Where do we stop with our issues before diving general IR problems?
What more is there to semantic search than thesaurus and metadata?

Albert's response: "Quite a bit especially if you have full text. Some of my favorite projects: http://infomap-nlp.sourceforge.net/ (no longer actively developed, because the author moved on to work on a different algorithm the next one in fact...) http://code.google.com/p/semanticvectors/. If you have a lucene index, semantic vectors can use that to create its semantic index. Some patents on real latent semantic analysis might have expired, so maybe that's an option now too, I couldn't say for sure though. Either way, some possible uses for SV/Infomap are to derive your own thesaurus based on the corpus you have indexed. What you have full text to may or may not necessarily even have to be what you've got indexed for your particular search interests. For example if you build a semantic model of doman-specific wikipedia pages, you could use that to base your thesaurus off of, provided there's enough data."

[edit] Follow-up

Status of Vanderbilt’s system - entry page to Knowledge Map system
more Knowledge Map information paper @ pubmed