Unstructured Medical Data Transformed Into System For Planning Scenarios of Clinical Trials
Let’s say a company wants to introduce a new drug to the market. How do they plan the experiments that will determine if the drug is effective and safe?
How big does the sample need to be? What do they need to look for? How do they define the inclusion and exclusion criteria? These are not easy questions, but they need to be answered if the company is focused on achieving the best result.
The client came to us with a very ambitious goal in mind. The goal was to create a recommendation system for planning scenarios of clinical trials. They had a vision of using data mining to make the whole process a lot faster, more reliable and more affordable.
The client’s team of programmers and designers had already created a big part of the system before they reached out to Stermedia. What they lacked was machine learning experience and familiarity with natural language processing. A human can not easily make use of hundreds of thousands of existing protocols documenting past clinical trials; this amount of knowledge first needs to be structured and systematized before it can be useful. The Stermedia team was responsible for testing, recommending and implementing state-of-the-art solutions that helped with this enormous task.
Our team worked on this project for almost 9 months. Structuring a big database of medical documents is quite a vague task, so during this time, we focused on several smaller goals that brought the project closer to realization.
- Named Entity Recognition
We started with information extracting – this means searching the document for the information that is needed. For example, we might be interested in finding a specific concept in the text, like a drug name. This problem is known as named entity recognition (NER).
The solutions our team chose are based on the modern “transformer” architecture called BERT. What BERT gives us, is the ability to represent words as numbers in a smart way. Words with a similar meaning are assigned similar vectors. The transformer also looks at the surrounding context, which prevents it, for example, from confusing different meanings of one word.
We used a large database of clinical trial protocols to tune BERT to our use case. Then, we trained several neural networks for recognizing concepts in raw text. But training the models manually is very time consuming, so Stermedia also prepared a mechanism to make this process easier for the client. Moreover, we created a method that helps in the automation of creating datasets necessary for teaching the model to recognize new concepts.
The quality of Stermedia’s work is not their only asset: they also have excellent communication standards and are always on time. It never seemed like there were 8 time zones between us! They provided a tremendous amount of value and were real team members of our company.
- Hand-made rules
Sometimes the information extraction process doesn’t need to involve heavy mathematical models. In some cases, we can achieve a lot by automating our own thought process. For example, some professional medical words almost always appear in certain types of texts. Another reason is that many types of phrases have a rigid structure, thus making them easy to describe and find using regular expressions combined with part of speech detection.
For instance, we can imagine many variants of the phrase ‘number of people with’. Different words can be used in place of ‘number’ and ‘people’, but if we think about it beforehand, we can create a more or less complete list of those that actually appear in paragraphs we have to handle. Once we detect this generalized phrase, we can reasonably expect the noun phrase following it to be the thing these ‘people’ are dealing with in our context.
This approach required us to spend some time reading and understanding the text we were processing. The knowledge we gathered this way gave us a lot of ideas for hand-made rules describing what we had learned. Our team not only created and tested a lot of these rules, but also prepared a flexible Python framework that allows the client to easily add as many as they like, using their own specialistic insight.
Developing tools and preparing models for the project was a challenge our team really enjoyed. Ultimately, not only did we help the client accomplish their goal but also learned many valuable lessons about state of the art in NLP.
- Text classification
Another capability that we included in our system, is assigning a class to a passage of text. This can be used in multiple scenarios, for example deciding which study objective -among a known list of choices- is relevant for a protocol.
Both the choice of words and the entire context of the paragraph are important to properly determine which class it should be assigned to. Therefore, in a similar fashion to the NER solution, we decided to use a BERT based model using context-sensitive representation to perform classification.
We used a variety of hand-made rules based on regular expressions and other parsers to create an initial, labelled dataset that would serve as a basis for further improvements by domain experts. Our model, derived from the pretrained neural network, was fine-tuned on this data. As before, the whole process was incorporated into our framework – simplifying the process of preparing and training this architecture for other classification tasks.
The project was oriented towards delivering tools that would make the future analytical process easier for the client. To summarize the previous sections, our main contributions were:
- testing and recommending solutions to the client’s NLP tasks;
- preparing several microservices for easy training and making predictions with named entity recognition and text classification models;
- creating a mechanism for transforming raw text into a training data set for the named entity recognition task;
- developing an easily expandable framework for using hand-made rules to extract information from text.
In addition, we experimented with many other techniques, like text-generating models, coreference resolution, and relationship extraction. In the process of creating the aforementioned tools and verifying our approaches, we also trained a number of models on experimental data sets. These can be further improved upon by obtaining data of higher quality, but the client can start using them right off the bat as well.
Python, jupyter, pytorch, transformers, flair, spacy, nltk, celery + redis, docker, docker-compose, luigi
Our solution is dedicated to the biggest pharma and biotech companies around the world. During a period of cooperation lasting almost a year, data scientists from Stermedia prepared an easy to use the framework that simplifies the deployment of new analytical tasks and participated in the design and implementation of analytical solutions. They prepared a set of heuristics for identifying concepts in medical text. Their NLP (natural language processing) expertise let them recommend, test, and successfully implement several state of the art models for solving NLP problems. Moreover, the data science team suggested suitable NLP architectures taking into account feasibility, ease of use, performance, and applicability to the tasks to be solved. Our colleagues from Stermedia showed versatility with Python programming, machine learning, and infrastructure development skills used throughout the project – also by creating several microservices for training and serving NLP models. It wasn’t only data science that was a great help to us though – the Stermedia scrum team supported us in software development of web application: frontend in Angular, backend in Python – rewriting a part of the code as well as building new functionalities and microservices deployed on docker. The quality of Stermedia’s work is not their only asset: they also have excellent communication standards and are always on time. It never seemed like there were 8 time zones between us! They provided a tremendous amount of value and were real team members of our company.