Editorial Assessment
The authors present a seven-part taxonomy aimed at identifying how AI can be integrated into workflows in data repositories. The taxonomy focuses on functions such as ‘acquire’, ‘validate’, ‘enhance’, and ‘organize’. The authors suggest that the piece is meant for the community of librarians and Open Science practitioners working with data repositories. The piece was reviewed by two metaresearchers. Both emphasized the potential value of the proposed taxonomy, but also noted a number of areas in which the paper needs substantial improvement. The reviewers admitted to be struggling with what they saw as a lack of clarity and practical applicability of the taxonomy. They emphasized the need for more detailed explanations of the various functions, such as case studies or examples of how AI is currently used in repositories. A second main point mentioned by both reviewers was the lack of a section explaining how exactly the taxonomy was developed. This makes it hard to see how and why the authors arrived at the seven functions and what knowledge/insights they are based on – is it previous literature or based on practical work experience of the authors? This also ties in with a third point raised by the reviewers, namely the lack of references. Finally, especially reviewer 2 felt that the concluding section on balancing human and AI expertise is interesting, but also noted that it currently feels a bit disconnected from the main part of the text.
The reviewers and I agree that the paper can potentially become a valuable contribution to the metaresearch literature related to data repositories, provided that the authors address the various criticisms raised by the reviewers.
Competing interests: none
Editors
Handling Editor
Wolfgang Kaltenbrunner
Senior Editor
Kathryn Zeiler
Peer Review 1
General review
The article describes a taxonomy which can be used to help facilitate the use of AI in data repositories. The taxonomy is similar to other descriptions of the data or research life cycle and is not particularly AI specific although brief descriptions of how AI could be used in each stage are included. The article is not currently written as a journal article, and does not engage with the current literature or research within the field, which would help to explain the importance or value of this specific taxonomy.
A literature review should be included and references should also be included throughout using an inline referencing style rather than hyperlinks. A references section should be included at the end of the article.
It is not clear how the 7 areas for AI in data repositories were decided. Is this opinion, based on currently literature, based on user experience…? Without knowing this it is hard for the reader to understand how justified the categories are. Towards the end of the paper 5 categories of AI involvement are discussed, but these don’t appear to be a part of the framework although they seem to be the most AI specific part. I’d suggest that the taxonomy could benefit from including more AI specific terms.
The article could be improved by more detail about how this framework could be used in practice (or is currently being used). It is unclear whether this is just a suggestion of shared language or something more formal like CReDIT or MeSH?
The Balancing AI and Human Expertise in Data Repositories section gives a nice overview of some of the concerns, however this section needs to be referenced as there have been lots of discussions in the literature about the important of ‘humans in the loop’.
The ‘three suggestions to promote trust and transparency’ at the end are interesting but come rather out of nowhere. These should be tied into the body of the paper. How are they related to the taxonomy?
Specific suggestions
Engagement with the literature would help to show why this taxonomy is important or relevant. Some areas which the literature review could cover are:
How is AI being used with repositories at the moment? (or not)
What is the importance of standard taxonomies?
Are there other examples of AI taxonomies?
Are there other ways of describing AI roles which are currently being used and why are they not sufficient?
Each of the taxonomy descriptions include examples of how AI could benefit this role, but they would be greatly enhance by giving more detail of projects where this is already happening or explaining where (and why) it isn’t happening. A specific example of this is within the Organize section where a useful project to discuss might be the Library of Congress metadata labelling project.
A more comprehensive introduction to the project and what you are trying to do would be helpful. Readers may not be specialists in either AI or digital repositories.
The introduction should also explain the current situation and problem that this taxonomy is trying to solve. Future possible benefits are described, but it does not discuss what else is happening in this space or the current situation.
“Just as AI can revolutionize other forms of scholarly communications like peer-reviewed publications” – the reference justifying this is an editorial, a research paper would be more authoritative. There have been a lot of articles arguing the positives and negatives of AI within the field of scholarly communication and this should be discussed more thoroughly.
“it can bring significant improvements to data repositories” What are these improvements? Has anyone done this yet or is it only theoretical?
“As AI becomes more integrated into data repository workflows” Is it becoming more integrated into these workflows? Or do you mean “if”?
Peer Review 2
This commentary (or perhaps it is more similar to a blogpost) proposes a taxonomy for tasks for how AIs could be used within workflows in data repositories. Although it is not the stated aim, it also reflects on the relationship between AIs and humans in these workflows and how to develop trust and transparency in a repository while using AIs.
The idea of having a taxonomy which can be used to spur discussions or to classify the roles of AI in repositories is useful, and the taxonomy roles themselves are sensible, although I am not sure how feasible they actually are at the moment. This is perhaps related to a lack of clarity about if these capabilities already exist for AIs to be used in these ways, or if this is the imagined future. It is also not clear which AIs the authors are referring to; is this targeted to LLMs specifically? To other AIs? I imagine this is important in identifying the potential tasks which AI could perform.
Although the authors provide examples of possible tasks for each part of the taxonomy, I still find it a bit abstract and vague. Maybe it would help to include a case study or example of a repository to carry through the steps and demonstrate how this looks/could look in practice? Some of the statements introducing each section of the taxonomy could be a bit more clear also, especially in terms of the target audience. Overall it is unclear to me who the exact target audience is – is this aimed for all repositories or just generalist repositories, given that it was developed within GREI? The latter part of the article refers to generalist repositories, but the earlier part (the taxonomy) does not. This raises the question of whether these steps would apply to all types and sizes of data repositories, or just to larger “generalist” ones. Some of the names for the taxonomy tasks may also be confusing, e.g. “share” – which usually tends to refer to the act of researchers sharing/depositing data in repositories rather than repositories making data available and facilitating reuse,.
One of the biggest points for improvement is that it is not very clear how the taxonomy was developed. The authors mention that it is based on other taxonomies (but they do not provide links or references) and their “coopetition efforts” within the GREI consortium. What did these activities entail exactly? There is a link in the acknowledgements section to authors of a very similar taxonomy (actually a very similar piece overall) that was developed for publishing workflows. Why is this not referenced in the article earlier, or the relationship between the two pieces described? Overall, the referencing should be improved and standardized (although again how this is done depends a bit on how the authors envision the future of this work) – there is currently no reference list at all.
It also seems as if the taxonomy part of the article is disconnected from the latter sections (aside from the conclusion). I actually find the section on balancing human and AI expertise to be quite valuable as its own contribution, as it makes it clear that using AIs is not an all-or-nothing proposition in repositories’ workflows. I think it would help if the authors could let readers know (at least) that this section, and the one on trust, are a part of the article in the introduction. Or perhaps there is room for restructuring the argument here, and somehow foregrounding the human/AI section.
Other reflections
It is a bit difficult to review this piece, not knowing whether it is intended to be a commentary, blogpost, or other type of article.
Metadata
1