News
 

C·I·B researchers develop new software package for improving data quality

Three C·I·B researchers, Mark Robertson, Cang Hui and Vernon Visser developed a new R package that can be used for assessing and improving the quality of datasets consisting of occurrence records.

Use package to identify likely alternative positions for points

The package can be used to identify likely alternative positions for points that represent obvious errors in a dataset.

Museums and herbarium collections provides records of where species occurred, which are often used for mapping biodiversity patterns. These collections datasets are freely available and are becoming easily accessible through portals such as the Global Biodiversity Information Facility (http://gbif.org/). Unfortunately these datasets contain many errors and suffer from several data quality issues. Despite the large number of users of these datasets there are only a few software tools dedicated to error detection and correction of such datasets.

The package, called biogeo includes features such as error detection, such as mismatches between the recorded country and the country where the record is plotted, records of terrestrial species that fall into the sea and outlier detection. A key feature of the package is the ability to identify likely alternative positions for points that represent obvious errors in the dataset and functions to explore records in geographical and environmental space in order to identify possible errors in the dataset. Functions are also available for converting coordinates that are in various text formats into degrees, minutes and seconds and then into decimal degrees.

The package was developed for the R environment, so at least some experience with R is useful, but is not essential. The package comes with a tutorial that is aimed at the first-time user that provides examples of how to use the various functions in the package to detect and correct errors in collections datasets.

The package is available from the Comprehensive R Archive Network https://cran.r-project.org/

A paper describing common data quality issues and highlighting the features of the package was published in the journal, Ecography.




Read the paper:

Robertson, M. P., Visser, V. and Hui, C. 2016. Biogeo: an R package for assessing and improving data quality of occurrence record datasets. Ecography 39: DOI: 10.1111/ecog.02118.

For more information, contact Mark Robertson at mrobertson@zoology.up.ac.za