Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Meet the authors: Rita González-Márquez, Philipp Berens, and Dmitry Kobak

2024·0 Zitationen·PatternsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2024

Jahr

Abstract

In their recent publication in Patterns,1González-Márquez R. Schmidt L. Schmidt B.M. Berens P. Kobak D. The landscape of biomedical research.Patterns. 2024; 100968Google Scholar the authors present a 2D atlas of the entire English biomedical literature. In their recent publication in Patterns,1González-Márquez R. Schmidt L. Schmidt B.M. Berens P. Kobak D. The landscape of biomedical research.Patterns. 2024; 100968Google Scholar the authors present a 2D atlas of the entire English biomedical literature. Rita: I have always been a curious person and I think curious people often end up in research since it is a career that allows you to learn new things and explore in depth what interests you. I am very lucky to have found supervisors who have been incredible guides. I have learned a lot from Dmitry and Philipp over the last years. Philipp: The intellectual challenge to find out how the brain works has been my main motivation to become a researcher. I had excellent mentors, Andreas Tolias and Matthias Bethge, and equally important, excellent colleagues, like Alexander Ecker or Tom Baden. Peer mentorship is something not to be underestimated. Dmitry: As far back as I can remember, I always wanted to be a scientist, and was mainly driven by romantic ideals about science, knowledge, and human understanding. What took me a long while was to find an area of research where I could contribute: I switched from computer science to theoretical physics, then to neuroscience, and then almost back to computer science again. But I never doubted that I wanted to do research. Rita: One important thing is to be critical and always question your results, which is arguably a good attribute for a scientist in general, not only a data scientist. Even when the results look how you were expecting, sanity checking is always good. I am surprised how many times I found that something was off, or I discovered something I did not expect when double-checking results that were apparently normal. Philipp: Of course, it is important to have strong technical skills, but these alone are not sufficient. You need to have experience with data and have good intuition about what to look at and what hypothesis to form. Rather than fitting the most complicated model, it is often more important to visualize the data well. Dmitry: I would add that one also needs to have a good grasp of the specific application field, which is often challenging in itself. A data scientist working in neuroscience needs to understand neuroscience, and similarly with other fields. A data scientist doing actual scientific research is not a jack of all trades. Rita: I met Dmitry and Philipp when they were teaching an introductory machine learning lecture I attended. I really enjoyed the lecture and wanted to learn more about the topic, so I started a research project with them. I found the research done in the group exciting and I liked the team so much that I ended up staying in the lab throughout my master’s thesis and now my PhD. We are a very diverse group, with many women and people from all over the world, which contributes to creating a great atmosphere. For the biomedical landscape project, we will continue to update the data in the interactive visualization yearly using the PubMed annual releases. As for my next project, I would like to keep studying representation spaces of textual data. In this project, we did not develop any novel method to visualize collections of text but used existing ones. In my next project, I would like to work on ways of improving those existing models to produce optimal representations for visualization. In particular, I would like to fine-tune large language models to produce representations targeted for data exploratory tasks, such as visualization, but also retrieval, or clustering. Rita: One of the difficulties that we faced was related to the data. Data are imperfect and irregular by nature. Therefore, it is challenging to foresee all possible cases and exceptions and account for them when you design an algorithm. In our case, this occurred, for instance, during the data parsing process. It required a lot of manual exploration and recurrent rounds of re-parsing the data when we discovered that we were not accounting for one specific case that should have also been included. Another challenge we faced was related to the interpretability of the results. In this project, we worked with a transformer model (PubMedBERT), which is a very large and complex deep neural network. Models of this kind are not completely understood and therefore it is often hard to explain why results look like they do. In our case, for example, it was sometimes unclear why a specific set of papers was grouped together in one cluster and not, for instance, in two or three. We used methods and metrics to try to gain insight into some of the aspects driving these behaviors, but our understanding capacity is limited due to the inherent “black-box” nature of these models. Rita: One of the things that surprised me the most was how heterogeneous the gender distribution was across disciplines and how fine-grained this duality was inside single disciplines. I had similar expectations for the distribution of affiliation countries and was again surprised by its heterogeneity. Besides, I was not familiar with research fraud and its extent before this project. I was shocked to learn how big of a problem this poses for science nowadays and what a negative effect it has for particular disciplines, where reproducing results is extremely costly, yet researchers can no longer trust existing research due to massive amounts of fabricated studies stemming from paper mills. Rita: I found very interesting how fine-grained the gender distribution inside disciplines was. We showed a couple of examples in the paper, but there were many more that we did not report. For instance, there was a reverse example of the female island inside the surgery discipline discussed in the paper. In nutrition, a female-dominated discipline, there was a predominantly male island that focused on nutritional supplements for maximizing muscle gain, endurance, and training performance. I find these sorts of examples fascinating since they are very concrete cases that clearly reflect how ingrained gender biases are in science and our society in general. Rita: Data science has been an integral part of natural language processing, gaining significant relevance in the last years. With the development of large language models that can be trained in very large amounts of texts, it has become increasingly important to find ways of collecting and analyzing these massive sets of training data. The quality of the training data significantly influences the model’s performance, contributing to undesirable behaviors like memorization or learned biases (“garbage in, garbage out”). That is where data science comes into play. If we want to develop excellent models, we need to make sure that the data we are using are high-quality data, and that can only be achieved by analyzing and curating it. Philipp: I am trying to form an interdisciplinary and diverse team, where people with different backgrounds can bring in their skills and are motivated to learn and improve. I like to see PhD students and postdocs mature not only as scientists but also in their personalities. Philipp: We hold career development meetings once a year to reflect on the past year and figure out what the personal goals of each lab member are and how to achieve them best. Usually, we go for a walk, and touch upon a lot of different topics. It helps me to adjust my mentoring style in the right direction, and it helps students as well. Philipp: Not only is everyone different, but also people develop, so that one must adjust the mentoring style over time. In the beginning of a PhD thesis, I usually try to have a clear project for each student with close mentoring. Later, PhD students bring in more of their own ideas and skills. More than the differences between people, it is this development of maturity that I am most amazed by. Philipp: I would worry less about the latest hype technique to definitely learn, but rather focus on essentials and questions you actually care about. Dmitry: I am subscribed to daily emails from Google Scholar with “recommended articles”: these are papers that Google Scholar considers sufficiently similar to the papers in my own profile. Apart from that, I find Twitter very helpful: I use it only for academic communication and often learn about new and exciting research there. Twitter is more fun, but Google Scholar gives me a less biased sample of new relevant papers. Dmitry: I may be biased because I am working on data visualization myself, but I do think that data visualization of large and high-dimensional datasets is one of the pressing topics that has been becoming increasingly relevant in the last years. More and more disciplines have to deal with large data collections: collections of texts (like in our work here), collections of images, collections of audio recordings, etc. One of the jobs of the data science community is to develop tools to handle, explore, and curate such datasets. Dmitry: It actually began as a side project. At the time, we were mostly working with visualizing biological single-cell data, and I thought that visualizing a large document collection would be an interesting challenge. So, I suggested Rita try to visualize PubMed library contents, as it was the largest dataset we could find. With over 20 million English abstracts, it is an order of magnitude larger than any available single-cell dataset. Initially, I thought it would be a fun project to play around for a couple of months. But here we are, two years later, still working on it! This project turned out more interesting and more important than I initially expected, and we simply could not stop. Dmitry: Of course, it was Rita, who is the first and the main author. But I would like to take this opportunity to highlight the other two co-authors: Luca Schmidt did the first experiments with large language models during a rotation project in our lab. And Ben Schmidt, a scientist from Nomic AI, developed an amazing interactive visualization allowing us to explore our entire collection of 20 million papers. This interactive website was very helpful for us when working on the paper, and I am sure will be very helpful in the future for many others. That Luca and Ben share the same last name is, by the way, a complete coincidence! Dmitry: We do not have any specific recommendations but we do hope that publishers and policymakers will find our, or similar, visualizations useful for identifying paper mill products and for combating scientific fraud and misconduct. Rita: Our map can be used to explore any research questions one is interested in. We showcased its usefulness in different applications, ranging from exploring a specific field, e.g., the COVID-19 literature, to general aspects, e.g., the gender distribution in the biomedical literature. With the functionalities of the interactive web version, such as searching by title, journal, author names, or PubMed ID, one can leverage the map for exploring any specific question. Dmitry: Indeed, we believe that we have only scratched the surface of metascientific questions that our 2D map allows us to approach. We already have multiple ideas for follow-up studies but we would be most happy if other researchers interested in metascience found our map useful. To make it easier for others to build up on our work, we have shared all our analysis code and also all raw, intermediate, and final data, and we invite everybody to use it. Benjamin M. Schmidt is Vice President of Information at Nomic AI. About the authors Rita González-Márquez has a background in physics and computational neurosciences. She is currently a doctoral candidate studying machine learning and computational neuroscience. She is interested in representation learning with a focus on dimensionality reduction and visualization techniques. In particular, for the last years, she has been working on representations of textual data for data exploration applications. Philipp Berens studied bioinformatics and philosophy before doing a PhD in computational neuroscience. He is interested in working with real-world data and developing new algorithms for clinical workflows. He is currently a professor of data science at the University of Tübingen and is the director of the Hertie Institute for AI in Brain Health in Germany. Dmitry Kobak studied computer science and physics, then did a PhD in computational neuroscience. His current work focuses on statistical data analysis and machine learning. He is currently a research group leader at University of Tübingen, in Germany. The landscape of biomedical researchGonzález-Márquez et al.PatternsApril 9, 2024In BriefThis study presents a 2D map based on the abstracts of biomedical research articles from PubMed. Containing 21 million English articles, this map highlights several publishing issues, including gender bias and fraudulent research. Full-Text PDF Open Access

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationBiomedical and Engineering Education

Volltext beim Verlag öffnen

Meet the authors: Rita González-Márquez, Philipp Berens, and Dmitry Kobak

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen