Meeting IADB december 12th, 2017

=== Participants


Elena Cabrio, Olivier Corby, David Darmon, Catherine Faron Zucker, Edson Florez, Raphael Gazzotti, Amina Ghrissi, Johan Montagnat (remote), Céline Poudat (remote), Frédéric Precioso, Michel Riveill, Pascal Staccini (remote), Serena Villata


=== Agenda


  - Introduction, project progresses 

  - Introduction of new PhD students: T. Mayer and A. Ghrissi 

  - Feedback on the data anonymization meeting (Pascal) and first work on anonymization (Serena / Elena) 

  - Progresses on deep patient workflow (Michel) 

  - Progresses on PRIMEGE data (Edson) 

  - On going work on PRIMEGE data with Wimmics and Synchronext (Catherine) 

  - HELP low power platform for deep learning (Michel) 

  - Planning next steps and next meeting 


=== Introduction


See introduction slides attached for general objectives and organization of the project.

  - Project started in June: now at T0+6 months. 

  - 2 PhDs hired: Amina Ghrissi (image data) and Tobias Mayer (text data). Amina just arrived in December. Tobias started in October and is working on an anonymization procedure for textual clinical reports.

  - First contacts with the Medical Data Centre (now renamed Medical Data Institute) of UCA.

  - Franck Michel (I3S) started working on a PMSI data integration procedure using the MongoDB flexible format database. He wrote scripts to export Access PMSI data (in CSV) and import it in MongoDB, proposed some example MongoDB queries and deployed a server on a virtual machine inside the I3S private network. Test were done on the PMSI MCO data. 



===  Feedback on the data anonymization meeting (Pascal) and first work on anonymization (Serena / Elena) 


Data anonymization is needed to experiment with medical data outside of the CHU network / without direct supervision from the DIIM. However, anonymization degrades data and destroy links between data items: a trade-off needs to be found between the acceptable level of anonymization and the kind of inferences that can be made on data. In the PMSI data exported for instance, the link between several stays of a patient at hospital is lost. To fully exploit PMSI data, non-anonymized data (or at least loosely anonymized data  with preserved data links) will be needed in the end. It remains to be seen if a VPN access to the CHUN network from an outside institution is an acceptable solution to work with the raw data or if all work on raw data need to be implemented inside the CHU network.


There are 3 levels to deals with in the anonymization process: 

  1. Which data is to be anonymised / exported (some data sets are more easily reidentifiable than others, such as rare diseases). Different kinds of medical report exist (surgery, hospital stay). 
  2. How is anonymization implemented. Level of anonymization and trustability of the anonymization process.
  3. Legal aspects. The legal context is evolving in France. We could meet with the CIL (Correspondant Informatique et Liberté) of the CHUN.


Exported PMSI data is strongly anonymised but data links are preserved in the raw database. There exist well established data anonymization tools for textual medical data (Tobias is visiting the LMSI laboratory in Paris which develops the "Medina" tool). A reference terminology for medical documents is available (CDA) in addition to all clinical terms vocabularies. terms For other kind of data, e.g. biology, there is neither automatic extraction nor automatic anonymization tools. 


Michel is trying to get access to the SNIIRAM data (national data from the health insurance) that contains medical prescriptions and medical acts that have been paid for by the health insurance (only paid for acts are known; prescriptions are not always applied). There exists some data set on the international scene, in particular in the US which as a more open legislation than France (CDC Atlanta). Producing a linked research medical data set would be a strong added value for the project. This work needs to be synchronised with the medical data centre and may involve other medical institutions (contacts in Rennes and Grenoble in particular).


The PRIMEGE data set contains town medicine prescriptions from 13 general practitioners from 2012. Data contains both structured text with codes and notes in free text. Complete reports, that would need to be anonymized are excluded. The data set is declared to the CNIL and ready to be exploited. The data is formatted in H' (an HL7 compatible format). Linking these data with biology data would be of high interest but there exist currently no tool for this work. Some statistical tools search for links between PRIMEGE and SNIIRAM data (with 93% accuracy). 



=== Progresses on deep patient workflow (Michel report on John PhD work) 


3 target questions:

Other questions of interest are identified (cyclical phenomena…)


3 steps deep-patient workflow:

  - data encoding: how to represent data (e.g. age can be a number but also an age range…)

  - deep learning: using (unsupervised) autoencoders

  - random forest decision layer


Data is large: 12% of year 2008 data = 3.2 GB

The model is proportional to the dataset size -> find the correct trade-off between model complexity and computational cost.

Auto-encoder: data is noised + reconstructed to reduce the input vector sparsity without any supervision.


First results:

  -  Re-education time do not improve classification

  - The auto-encoding stage do not modify the workflow accuracy

  - The auto-encoding stage reduces the classification time



=== Progresses on PRIMEGE data (Edson) 


This work on the PRIMEGE database aims at detecting Adverse Drug Reaction (ADR) by detecting correlations between in medical notes that contain information on medication, diseases and disorders observed. Example: "the patient has internal bleeding secondary to warfarin" establishes a correlation between the disorder (internal bleeding) and the medication (warfarin). 


The workflow is composed of 3 steps:

  1. Preprocessing to deal with the low quality of reports (use of abbreviations, spelling errors…)
  2. Named entities recognition: identify what is medication, disorders, etc. The ECMT tool from CISMef (CHU Rouen) is used. It extract concepts from multiple terminologies (55 terminologies included in the HeTOP repository). 1.4 million tokens were extracted from 46 thousands PRIMEGE notes and classified in 7 main categories.
  3. Adverse drug reaction detection: identification of drug-disorders relations and identification of ADR. A Long Short Term Memory (LSTM) recurrent deep neural network is used to identify dependencies between terms. 


 Current status: a first version of the tool extracts clinical entities from the notes and a time series for each patient with notes, medication, diagnosis and symptoms. The LTSM deep network is being built and trained for ADR detection.



=== On going work on PRIMEGE data with Wimmics and Synchronext (Raphaël)


A next meeting to be organised on this subject.



=== HELP low power platform for deep learning (Michel) 


A spin-off project named HELP was accepted and funded by Academy 1 to fund for a low power computing platform using nVidia GPUs for deep learning computing.



=== Planning next steps and next meeting 


The planed milestones in the project proposal were:

  - Case study specification at M3. There were significant progresses on textual data access and exploitation plan (PMSI, PRIMEGE) but everything remains to be done on the imaging data (a meeting planned with the CHUN cardiology department in January).

  - State of the art in LTSM at M6. Edson is working on an LSTM implementation and Frédéric plan further work on LSTM this year. Although no written report is required, we need to assess the progresses on LSTM networks bibliography and exploitation.


The next milestone planed is the data access infrastructure setup with the MSI at M12 (June 2018). Work on data access is progressing but much needs to be done in coordination with the Medical Data Institute of the MSI.


The following indicators have been given to the IDEX project reporting and follow-up officer. Partners are welcome to give more details. In particular please let me know:

  1. What are the international collaboration that you have in connection with the IADB project?
  2. What are the publications linked with the project that have been produced.
  3. If you have other collaborations inside UCA on the IADB project topics.



projet ambitionnant d’être candidat à l’ERC

Serena Villata a déposé une candiature ERC fin 2017


recrutement international : étudiant ou doctorant ou chercheur

Recrutement d'un étudiant Allemand et d'une étudiante Tunisienne + 2 Colombiens


mise en place de coopération int (ex : accueil d'1 chercheur invité, thèse en co-tutelle, accueil d'un étudiant international…)

Projet de collaboration avec l'Université et l'hôpital de de Da Nang (Vietnam) 


mise en place de structure internationale pérenne




Ecole de calcul intensif sur le traitement de données médicales (Da Nang) 

Liste des collaborations internationales?


co-publi entre domaines scientifiques

Rappel: remerciement publis et liste à établir

Identifier les co-publications


co-direction de thèse

Les 3 sujets sont prévus en co-direction




Université cible

collaboration entre membres de l'Idex

 I3S, Inria, BCL, CHUN, MSI, MDC… Autres?


co-publication entre membres de l'Idex

 Cf point publications ci-dessus


contribution à l'EUR

Une EUR acceptée! IADB dans le périmètre thématique.




Impacts éco

création de start-up



dépôt d'un brevet



valorisation (licences, transfert…)





Effet levier

co-financement / fonds publics

Thèses, postdoc et ingénieur financées sur d'autres sources (ENS, Labex, ANR)


co-financement / fonds privés



co-financement / fonds européens




Important reminder: all publications related to the project should be acknowledged with the following sentence:


This work is partly funded by the French government labelled PIA program under its IDEX UCAJEDI project (ANR-15-IDEX-0001).



IADB includes funding for master traineeship and travels to present project papers: do not hesitate to let me know if you have specific funding requirement needs in these directions.


The next meeting will be held on February 19, from 10am to 1pm. Subsequent plenary meeting will be organised in May, September and November 2018.