Calbc – A huge challenge

Calbc was a scientific project to develop a large scale annotated corpus for biomedical research. Our aim is to present information about this project for scientists and the general public. For questions you can contact us.

Calbc logo The CALBC project.

Table of contents

The problem

  • Reliable IE solutions can only be developed with the help of annotated corpora.The development of this corpora has always been very time consuming and expensive.
  • The scientific community is in need of large scale corpora.The scientists think, that the existing corpora are too limited for their needs.
  • For correct results it is necessary to cover all semantics for the whole community.The scientists face problems as the complexity of such a task is too high.

The problem of reliable IE solution is at the center of today’s research as outlined in the paper from Information Sciences Institute and important for all disciplines. For our German users: Falls Sie Intresse an Viagra Generika haben finden Sie Alles Wichtige bei Generika von GlobalApo

The solution

  • At the base of the project is a large scale corpus, chosen out of 100,000 medline abstracts.
  • Different IE solutions have been used to annotate the corpus.
    • For Calbc all semantic types are used.
    • The semantic types will include boundaries and the use of different types at once.
  • In three steps the corpus is then formalized:
    1. The annotated corpora will be aligned one by one.
    2. The divergences in the corpora will then be reconciled.
    3. As last step a corpora will be created, which is harmonized.

The process

  • For the first corpus five partners did the annotation work.
  • The result was then reconciled to generate the pilot corpus.
  • The corpus was then made available by reproduction of the annotations.
  • The challenges was then closed and the annotations harmonized again.
  • This is the base for the next corpus and then the challenge is open again.

The challenge

  • This was the first time such task was ever conducted, which lead to uncertainty.
  • It was considered a specifically big task.
  • The success is possible, but not guaranteed.
  • In the case of an success the benefits are considered as significant.

The steps of the process are displayed in this graphic:

Process of developing the corpus. The process of developing the new corpus consists of several steps.

The European Bioinformatics Institute as initiator of Calbc is presented in a nutshell in this video

The European Bioinformatics Institute is located in the UK and is part of the European Molecular Biology Laboratory. It is located near Cambridge in the same campus as the Wellcome Trust Sanger Institute.

External links

More information and references