By applying natural-language artificial intelligence techniques to analyze text fields in health records, researchers have developed an automated approach for classifying the severity of COVID-19 illness among pregnant people. The automated approach could accelerate the processing of surveillance records for pregnant patients, who are at higher risk for severe COVID-19 illness than non-pregnant people infected by the SARS-CoV-2 virus.
Produced in a collaboration between the Georgia Tech Research Institute (GTRI) and the Centers for Disease Control and Prevention (CDC), this technical solution helps address a challenge faced by the CDC, which must rapidly classify illness based on data from electronic forms with free-text information entered by clinical or health department personnel. Because of its variability, the free-text data from each electronic form currently must be reviewed by clinicians.
Text Field Data Useful, but Challenging to Analyze
“Not all information helpful to know about a COVID-19 illness can be captured in the boiled-down coded data that gets entered into forms,” said Charity Hilton, a GTRI research scientist who led the GTRI component of the project. “There can be much more information in the text fields — which may be copied directly from patient charts — that can help understand the broader scope of what is going on. This project will help improve the speed and accuracy of disease classification.”
Providing clarifying information that goes beyond the standardized codes is the purpose of the text fields, but their variability and lack of consistent structure can make them challenging to process and interpret. Natural language processing (NLP), an automated approach using artificial intelligence, can help provide the kind of understanding that would otherwise require human review, extracting the meaning of text to go beyond the simple matching of words, Hilton explained.
Beyond providing additional information to assist with the classification, the NLP solution can validate information provided elsewhere on the forms to catch coding errors or other discrepancies.
State, Local, and Territorial Health Departments Provide Data
Health departments report information on COVID-19 cases to CDC, including pregnancy status. State and local health departments have the option of providing additional data on pregnant people with COVID-19 and their developing babies. These data are collected as part of CDC’s Surveillance for Emerging Threats to Mothers and Babies Network (SET-NET).
Thirty-two jurisdictions have reported data on the health of individuals with SARS-CoV-2 infection during pregnancy. So far, data from over 71,000 pregnant people with SARS-CoV-2 infection have been reported to SET-NET. COVID-19 severity classification is based on a hierarchy of factors such as intensive care unit (ICU) admission, invasive ventilation, COVID-19 therapies required, and complications. That information is used to classify illness as asymptomatic, mild, moderate-to-severe, or critical.
Evaluating the Effectiveness of Natural Language Processing
To evaluate the effectiveness of the NLP approach, CDC and GTRI researchers compared severity classifications provided by the NLP-based approach against those made by the standard human review. They found that the classifications produced by the NLP agreed with the clinician’s judgment in 99.4% of the 4,378 COVID-19 cases studied.
“Concordance between approaches was high, validating that automated approaches could reduce the need for clinical review to classify COVID-19 severity,” the researchers wrote in an abstract of a presentation on the project planned for an upcoming conference.
Analysis Helps CDC Understand Risks to Pregnant People
Information provided by SET-NET helps the CDC formulate recommendations for pregnant people, and the new system will help evaluate data coming into the agency.
“Automated approaches, such as natural language processing, have helped CDC investigators ‘sift’ through thousands of records to determine the level of COVID-19 severity among pregnant people more efficiently,” said Van T. Tong, MPH, who leads the Emerging Threats Team in CDC’s National Center on Birth Defects and Developmental Disabilities. “This work to better understand the increased risks of COVID-19 infection, along with the growing body of evidence supporting the safety and effectiveness of COVID-19 vaccination during pregnancy, was used to support CDC’s message that the benefits of COVID-19 vaccination outweigh any potential risks of COVID-19 vaccination during pregnancy.”
Next Steps in Implementing the Project
The project is largely completed and operating in CDC’s information technology environment. A few more tweaks will be made, and the project could soon help CDC analyze data about the effects of the COVID-19 pandemic on pregnant people. The team is working to share the code and mock dataset on the CDC GitHub. Details of the project are scheduled to be presented at the 11th International Conference on Emerging Infectious Diseases later this year.
Natural Language Processing Has Broad Application
Using information from free-text fields is one of the challenges facing database systems used in health care and other applications, and it’s an area where proven NLP techniques can be especially useful.
“Especially in the clinical case, text data can be a rich source of information,” Hilton said. “Providers, clinicians, and nurses have to put information into the coded sections of forms, but the text fields allow them to provide more detail about a patient and what they are experiencing. They want to provide this information because the coded boxes can’t tell the whole story.”
Examples of information useful to clinicians and policy planners might include context on the patient’s family history, earlier illness, or social dimensions relevant to the treatment and disease outcome.
Project Results from Long-Term Collaboration with CDC
GTRI researchers have collaborated with the Atlanta-based CDC through a long-term initiative designed to support the agency’s overarching Data Modernization Initiative (DMI). Launched in 2020, DMI is a multiyear, billion-plus dollar effort to modernize core data and surveillance infrastructure across the federal and state public health landscape. Now in its third year, the CDC-GTRI collaboration has moved modernization forward by focusing on high-performance computing, health care interoperability, data analytics, machine learning techniques, synthetic data generation, predictive model development, and visualization to identify trends in the vast data sets the agency receives and analyzes.
In addition to Hilton, GTRI researchers Richard Boyd and Jordan Chandler also supported the project. At CDC, in addition to Van Tong, the researchers included Suzy Newton, Kate Woodworth, and Lucas Gosdin.
DMI is not just about technology, but about putting the right people, processes, and policies in place to deliver real-time, high-quality information on both infectious and non-infectious threats. The CDC-GTRI partnership is a key piece of CDC’s overall DMI strategy.
See CDC’s Data Modernization Initiative | CDC for more information.
Writer: John Toon (john.toon@gtri.gatech.edu)
GTRI Communications
Georgia Tech Research Institute
Atlanta, Georgia USA
The Georgia Tech Research Institute (GTRI) is the nonprofit, applied research division of the Georgia Institute of Technology (Georgia Tech). Founded in 1934 as the Engineering Experiment Station, GTRI has grown to more than 2,800 employees supporting eight laboratories in over 20 locations around the country and performing more than $700 million of problem-solving research annually for government and industry. GTRI's renowned researchers combine science, engineering, economics, policy, and technical expertise to solve complex problems for the U.S. federal government, state, and industry.