Artículo

Estamos trabajando para incorporar este artículo al repositorio
Consulte el artículo en la página del editor
Consulte la política de Acceso Abierto del editor

Abstract:

Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data. © 2016 Zermoglio et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Registro:

Documento: Artículo
Título:A standardized reference data set for vertebrate taxon name resolution
Autor:Zermoglio, P.F.; Guralnick, R.P.; Wieczorek, J.R.
Filiación:Departamento de Ecología, Genética y Evolución, Instituto IEGEBA (CONICET-UBA), Facultadde Ciencias Exactasy Naturales, Universidadde Buenos Aires, Buenos Aires, Argentina
Institut de Recherche sur la Biologie de l'Insecte, UMR 7261 CNRS, Université François Rabelais, Tours, France
University of Florida Museum of Natural History, University of Florida at Gainesville, Gainesville, FL, United States
Museum of Vertebrate Zoology, University of California, Berkeley, CA, United States
Palabras clave:cladistics; controlled study; driver; experimental model; human; information processing; nomenclature; probability; statistical model; taxon; validation process; vertebrate; algorithm; animal; biodiversity; classification; factual database; geography; nomenclature; procedures; reference value; reproducibility; vertebrate; Algorithms; Animals; Biodiversity; Classification; Databases, Factual; Datasets as Topic; Geography; Probability; Reference Values; Reproducibility of Results; Terminology as Topic; Vertebrates
Año:2016
Volumen:11
Número:1
DOI: http://dx.doi.org/10.1371/journal.pone.0146894
Título revista:PLoS ONE
Título revista abreviado:PLoS ONE
ISSN:19326203
CODEN:POLNC
Registro:https://bibliotecadigital.exactas.uba.ar/collection/paper/document/paper_19326203_v11_n1_p_Zermoglio

Referencias:

  • Graham, C.H., Ferrier, S., Huettman, F., Moritz, C., Peterson, A.T., New developments in museum-based informatics and applications in biodiversity analysis (2004) Trends Ecol Evol, 19 (9), pp. 497-503. , PMID: 16701313
  • Jetz, W., McPherson, J.M., Guralnick, R.P., Integrating biodiversity distribution knowledge: Toward a global map of life (2012) Trends Ecol Evol, 27 (3), pp. 151-159. , PMID: 22019413
  • Hill, A.W., Otegui, J., Ariño, A.H., Guralnick, R.P., (2010) Position Paper on Future Directions and Recommendations for Enhancing Fitness-for-Use Across the GBIF Network, Version 1.0, 25p. , Copenhagen: Global Biodiversity Information Facility
  • Boakes, E.H., McGowan, P.J.K., Fuller, R.A., Chang-Qing, D., Clark, N.E., O'Connor, K., Distorted views of biodiversity: Spatial and temporal bias in species occurrence data (2010) PLoS Biol, 8 (6). , PMID: 20532234
  • Maldonado, C., Molina, C.I., Zizka, A., Persson, C., Taylor, C.M., Albán, J., Estimating species diversity and distribution in the era of big data: To what extent can we trust public databases? (2015) Global Ecol Biogeogr, 24, pp. 973-984
  • Guralnick, R.P., Hill, A.W., Lane, M., Towards a collaborative, global infrastructure for biodiversity assessment (2007) Ecol Lett, 10, pp. 663-672. , PMID: 17594421
  • Page, R.D.M., Biodiversity informatics: The challenge of linking data and the role of shared identifiers (2008) Brief Bioinform, 9 (5), pp. 345-354. , PMID: 18445641
  • Hjarding, A., Tolley, K.A., Burgess, N.D., Red list assessments of east African chameleons: A case study of why we need experts (2014) Oryx, , Cambridge University Press (CUP)
  • Patterson, D.J., Cooper, J., Kirk, P.M., Pyle, R.L., Remsen, D.P., Names are key to the big new biology (2010) Trends Ecol Evol, 25 (12), pp. 686-691. , PMID: 20961649
  • Kennedy, J., Hyam, R., Kukla, R., Paterson, T., Standard data model representation for taxonomic information (2006) OMICS, 10 (2), pp. 220-230. , PMID: 16901230
  • Deck, J., Guralnick, R., Walls, R., Blum, S., Haendel, M., Matsunaga, A., Meeting report: Identifying practical applications of ontologies for biodiversity informatics (2015) Standards in Genomics, 10, p. 25
  • Wieczorek, J., Döring, M., De Giovanni, R., Robertson, T., Vieglais, D., (2009) Darwin Core, , http://www.tdwg.org/standards/450/, [Internet] Accessed 2015 Aug 17
  • Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., Darwin Core: An Evolving Community-Developed Biodiversity Data Standard, 7 (1). , PLoS ONE2012 PMID: 22238640
  • Meyer, C., Kreft, H., Guralnick, R.P., Jetz, W., Global priorities for an effective information basis of biodiversity distributions (2015) Peer J, , In Press PrePrints 3
  • (2015), www.gbif.org, [Internet] Accessed 2015 Jul 2; Boyle, B.L., Hopkins, N., Lu, Z., Raygoza Garay, J.A., Mozzherein, D., Rees, T., The taxonomic name resolution service: An online tool for automated standarization of plant names (2013) BMC Bioinformatics, 14, p. 16. , PMID: 23324024
  • (2013), http://www.theplantlist.org/, Version 1.1. [Internet] Accessed 2015 Jul 2; (2015), http://www.indexfungorum.org/, [Internet] Accessed 2015 Jul 2; (2015) World Spider Catalog, , http://wsc.nmbe.ch, Version 16.5. [Internet]. Natural History Museum Bern Accessed 2015 Jul 2
  • (2015) Lepage D. Avibase, , http://avibase.bsc-eoc.org/avibase.jsp?lang=EN, [Internet] Accessed 2015 Jul 2
  • Lepage, D., Vaidya, G., Guralnick, R.P., Avibase - A Database System for Managing and Organizing Taxonomic Concepts, 420, pp. 117-135. , ZooKeys2014 PMID: 25061375
  • Froese, R., Pauly, D., (2015), www.fishbase.org, [Internet] Accessed 2015 Jul 2; Eschmeyer, W.N., (2015) Catalog of Fishes: Genera, Species, Referenced, , http://researcharchive.calacademy.org/research/ichthyology/catalog/fishcatmain.asp, [Internet] Accessed 2015 Jul 2
  • Costello, M.J., Bouchet, P., Boxshall, G., Fauchald, K., Gordon, D., Hoeksema, B.W., Global Coordination and Standardisation in Marine Biodiversity Through the World Register of Marine Species (WoRMS) and Related Databases, 8 (1). , PLoS ONE2013 PMID: 23505408
  • (2015) World Register of Marine Species, , http://www.marinespecies.org, [Internet] Accessed 2015 Jul 2
  • Uetz, P., Hošek, J., (2015) The Reptile Database, , http://www.reptile-database.org, [Internet] Accessed 2015 Jul 2
  • (2015) Information on Amphibian Biology and Conservation, , http://amphibiaweb.org/, [Internet] Berkeley, California: AmphibiaWeb. Accessed 2015 Jul 2
  • Wilson, D.E., Reeder, D.M., (2005) Mammal Species of the World - A Taxonomic and Geographic Reference, p. 2. , (Third ed.). Baltimore, Maryland: Johns Hopkins University Press/Bucknell University 142
  • Wilson, D.E., Reeder, D.M., (2015) Mammal Species of the World, , http://vertebrates.si.edu/msw/mswcfapp/msw/index.cfm, John Hopkins University Press 3rd edition. Press [Internet]. Accessed 2015 Jul 2
  • (2015), http://www.itis.gov, [Internet] Accessed 2015 Jul 2; Roskov, Y., Abucay, L., Orrell, T., Nicolson, D., Kunze, T., Flann, C., (2015) Species 2000 & ITIS Catalogue of Life, , www.catalogueoflife.org/col, 30th July [Internet]. Species 2000: Naturalis, Leiden, the Netherlands. 2015 Accessed 2015 Jul 2
  • (2013) GBIF Backbone Taxonomy, , http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c, [Internet] Accessed 2015 Jul 2
  • Vanden Berghe, E., Coro, G., Bailly, N., Fiorellato, F., Aldemita, C., Ellenbroek, A., Retrieving taxa names from large biodiversity data collections using a flexible matching workflow (2015) Ecol Inform, 28, pp. 29-41
  • Gaiji, S., Chavan, V., Ariño, A.H., Otegui, J., Hobern, D., Sood, R., Content assessment of the primary biodiversity data published through GBIF network: Status, challenges and potentials (2013) Biodiversity Informatics, 8, pp. 94-172
  • Peng, T., Li, L., Kennedy, J., A Comparison of Techniques for Name Matching, 2 (1), pp. 55-61. , GSTF Journal on Computing (JoC)2012
  • Constable, H., Guralnick, R., Wieczorek, J., Spencer, C.L., Peterson, A.T., VertNet: A new model for biodiversity data sharing (2010) PLoS Biol, 8, pp. 1-4
  • (2015), http://vertnet.org, [Internet] Accessed 2015 Jul; Robertson, T., Döring, M., Guralnick, R., Bloom, D., Wieczorek, J., Braak, K., The GBIF integrated publishing toolkit: Facilitating the efficient publishing of biodiversity data on the internet (2014) PLoS ONE, 9 (8). , PMID: 25099149
  • Chapman, A.D., (2005) Principles and Methods of Data Cleaning-primary Species and Species-occurrence Data, Version 1.0, , Report for the Global Biodiversity Information Facility, Copenhagen
  • Damerau, F., A technique for computer detection and correction of spelling errors (1964) Commun. ACM, 7, pp. 171-176
  • (1999) International Code of Zoological Nomenclature, , http://www.nhm.ac.uk/hosted-sites/iczn/code/, 4th Ed. International Trust for Zoological Nomenclature, London Accessed 2015 Jul 2
  • Kluyver, T.A., Osborne, C.P., Taxonome: A software package for linking biological species data (2013) Ecol Evol, 3 (5), pp. 1262-1265. , PMID: 23762512
  • Schuh, R.T., The linnean system and its 250-year persistence (2003) Bot Rev, 69 (1), pp. 59-78
  • Berendsohn, W.G., The concept of "Potential taxa" in databases (1995) Taxon, 44 (2), pp. 207-212
  • Gill, F.B., Species taxonomy of birds: Which null hypothesis? (2014) The Auk, 131 (2), pp. 150-161
  • (2015), http://ipt.vertnet.org:8080/ipt/, [Internet] Accessed 2015 Apr; Wieczorek, J., VertNet darwin core data migrator toolkit (2015) GitHub Repository, , https://github.com/vertnet/toolkit, [Internet]
  • Wieczorek, J., VertNet darwin core vocabularies (2015) GitHub Repository, , https://github.com/tucotuco/DwCVocabs, [Internet]
  • (2015) R: A Language and Environment for Statistical Computing, , http://www.R-project.org/, Vienna, Austria
  • Burnham, K.P., Anderson, D.R., (2002) Model Selection and Multimodel Inference: A Practical Information-theoretic Approach, 488p. , New York: Springer-Verlag
  • Bartoń, K., (2015) MuMIn: Multi-Model Inference Package, , https://cran.r-project.org/web/packages/MuMIn/index.html
  • Beaujean, A.A., (2015) R Package for Baylor University Educational Psychology Quantitative Courses, , https://cran.r-project.org/web/packages/BaylorEdPsych/index.html
  • Fox, J., Effect displays in R for generalised linear models (2003) J Stat Softw, 8 (15), pp. 1-27. , https://cran.r-project.org/web/packages/effects/index.html
  • Alroy, J., How many named species are valid? (2002) Proc. Natl. Acad. Sci. U. S.A., 99, pp. 3706-3711. , PMID: 11891342
  • Padial, J.M., De La Riva, I., Taxonomic inflation and the stability of species lists: The perils of ostrich's behavior (2006) Syst Biol, 55 (5), pp. 859-867. , PMID: 17060206
  • Costello, M.J., Wieczorek, J., Best practice for biodiversity data management and publication (2013) Biol Cons, p. 173
  • Rees, T., Taxamatch, an algorithm for near ('Fuzzy') matching of scientific names in taxonomic databases (2014) PLoS ONE, 9 (9). , PMID: 25247892
  • Duarte, M., Guerrero, P.C., Carvallo, G., Bustamante, R.O., Conservation network design for endemic cacti under taxonomic uncertainty (2014) Biological Conservation, 176, pp. 236-242

Citas:

---------- APA ----------
Zermoglio, P.F., Guralnick, R.P. & Wieczorek, J.R. (2016) . A standardized reference data set for vertebrate taxon name resolution. PLoS ONE, 11(1).
http://dx.doi.org/10.1371/journal.pone.0146894
---------- CHICAGO ----------
Zermoglio, P.F., Guralnick, R.P., Wieczorek, J.R. "A standardized reference data set for vertebrate taxon name resolution" . PLoS ONE 11, no. 1 (2016).
http://dx.doi.org/10.1371/journal.pone.0146894
---------- MLA ----------
Zermoglio, P.F., Guralnick, R.P., Wieczorek, J.R. "A standardized reference data set for vertebrate taxon name resolution" . PLoS ONE, vol. 11, no. 1, 2016.
http://dx.doi.org/10.1371/journal.pone.0146894
---------- VANCOUVER ----------
Zermoglio, P.F., Guralnick, R.P., Wieczorek, J.R. A standardized reference data set for vertebrate taxon name resolution. PLoS ONE. 2016;11(1).
http://dx.doi.org/10.1371/journal.pone.0146894