Topic: Biological research is a science which derives its findings from the proper analysis of experiments. But what has changed dramatically over the last three decades is the throughput of those experiments – from single observations to gigabytes of sequences in a single day – and the breadth of questions that are studied – from single molecules to entire genomes, transcriptomes, proteomes, etc. Today, a large variety of experiments are carried-out in hundreds of labs around the world, and their results are reported in a myriad of different databases, web-sites, publications etc., using different formats, conventions, and schemas. The integration of these diverse and distributed databases has been a topic of bioinformatics research for more than 20 years.
Recent years have seen a revitalization of Data integration research in the Life Sciences. But the perception of the problem has changed: While early approaches concentrated on handling schema-dependent queries over heterogeneous and distributed databases, current research emphasizes instances rather than schemas, tries to place the human back into the loop, and intertwines data integration and data analysis. Transparency, one of the main goals of federated databases, is not a target anymore; instead, users want to know exactly which data from which source was used in which way in studies (provenance). The old model of “first integrate, then analyze” is replaced by a new, process-oriented paradigm: “integration is analysis – and analysis is integration”. These new views on DI, lessons learnt from the past, and the challenges to face are the subject of this course.
Notice: No prior knowledge in biology is necessary to follow this course.
Organisation:
- The lecture starts with a lab session where students will search for biological data in the major molecular biological sources. The problematics of data integration will thus be concretely experienced by students (dealing with highly heterogeneous data, various levels of quality...). We will use real queries daily performed by several of our biologist collaborators.
- The first and second part of the lecture will wrap-up what has been found during the lab session and more generally present the current and major challenges to face with to integrate bio data.