Research software is like the Jenga tower-building game tools on tools on tools. When developers change their individual pieces, it can change the function of the software that depends on them, potentially altering results or causing the software to fail.
Version 3.6.0 of the R programming language, for example, introduced a replacement algorithm for generating random numbers. This and the old algorithm both work, but not the same way. If you ran the same code with an older version of R and a newer version of R and used whatever function it needed to generate a random number, you’d end up getting different results, says Tiffany Timbers, a data scientist at the University of British Columbia in Vancouver. in Canada.
Among other things, this type of variability can complicate collaboration (see Environmental Testing). In 2020, Mine etinkaya-Rundel was working with another author on a statistics textbook, using R and a formatting language called R Markdown to calculate numbers, create digits, and format the final document. We wanted to make sure that we were using the same versions, says etinkaya-Rundel, a statistician at Duke University in Durham, NC, and also that when we re-render the book, we’re rendering it with a certain version of the packages. Otherwise, the two authors might have generated slightly different manuscripts.
To address this challenge, they turned to the R renv package, one of a small group of tools that help developers and researchers manage their computational environments; other options include venv and virtualenv for Python and conda, a language independent tool. Most are command-line utilities, although renv is tightly integrated with the RStudio Desktop graphical programming environment. All can help researchers ensure their code is reproducible, reusable, documented, and shareable.
C. Titus Brown, a bioinformatician at the University of California, Davis, has 187 conda environments on his laptop. Most are one-offs, used to test new tools or to illustrate a point during class. His daily work takes place mainly in a development environment that includes a specific version of Python and other programming tools.
Some tasks, however, require a computational change of scenery. For example, Brown writes blog posts in Markdown, which he transforms into HTML, the standard markup language for web pages. But code that takes that step doesn’t play well with newer versions of a crucial software library, and older versions conflict with its development tools. To isolate the problem, Brown created a separate environment. I just patched the version to something really old that still works and I’m running [the rendering software] here, he says.
A conda environment is a computational sleight of hand, says Johannes Kster, a computer scientist at the University of Duisburg-Essen in Germany, who founded a bioinformatics-focused software repository called Bioconda. Basically, it’s just changing your system’s path to where your system looks for the executable [applications]. You may have multiple versions of a tool installed, but when conda activates a particular environment, your computer can only see what you want.
Computational environments offer several benefits, Timbers says. One is reproducibility, the ability to analyze the same data with the same software on the same IT infrastructure to get the same results.
It can be very frustrating to track the differences between the outputs on different computers, says Ben Marwick, an archaeologist at the University of Washington in Seattle. Some research projects take years to complete, he notes. And while Marwick prefers newer libraries, his colleagues don’t always update at the same pace. Renv ensures that he and his collaborators always execute their project code the same way. The resulting environment description file can be version controlled and shared on GitHub. Collaborators can recreate the environment using the command renv::restore()
Conda is a command-line tool that creates environments and installs software into them. To create a new environment called my_env linked to a specific version of Python, for example, use conda create –name my_env python=3.9
Both R and conda allow users to install their own tools rather than having to ask system administrators to do it for them. You don’t need root privileges, says Rob Patro, a computational biologist at the University of Maryland in College Park. This is a useful feature when working on shared computing resources.
Environment managers also simplify the installation of the software. Scientific software is often released as source code, which may need to be compiled, configured, and installed in a specific location. It may have a web of dependencies, written in multiple programming languages, that must be installed in a particular order. Sometimes, says bioinformatician Fredrik Boulund of the Karolinska Institute in Stockholm, the process can be beyond the capabilities of users. That completely changed when solutions like conda came on the scene, he says. Installing a complex set of dependencies is simply asking conda to create an environment based on an environment spec file.
Put an end to the tyranny of copy and paste with these coding tools
For the Galaxy project, an open source framework for analyzing reproducible data, these capabilities were a key reason for choosing conda as the project’s software installation manager. Bioinformatician Bjrn Grning, who runs the European Galaxy server at the University of Freiburg in Germany, says the Galaxy community began looking for a coherent tool installation strategy around 2015 because its existing manual approach was unsustainable. Conda ticked all of our requirement boxes, says Grning. It doesn’t need root privileges; is independent of the programming language; and uses human-readable package recipes that are easy to understand and maintain. Today there are more than 9,000 bioinformatics tools available to Galaxy users through the Bioconda channel.
Perhaps the greatest benefit to environments, however, is isolation: environments allow researchers to explore new or updated tools while knowing that their code will still work.
Elana Fertig, a statistician at Johns Hopkins University in Baltimore, Maryland, describes herself as lax when it comes to environments: For me, everything goes into environments. But larger environments are more difficult to use, because the environment manager must resolve a larger network of dependencies to install new tools. (Conda is notorious for poor performance with large environments, but a drop-in solver called mamba speeds up the process.) Instead, Fertig suggests that her students use one environment per project.
In fact, most of the researchers contacted for this article recommend creating environments to host specific workflows or projects, and doing so in advance. Start your project with a package management solution in mind, says Joshua Shapiro, senior data scientist at the Childhood Cancer Data Lab for the Alexs Lemonade Stand Foundation, based in Wynnewood, Pennsylvania. He’s got the potential to save you a lot of headaches down the line.
Challenge to Scientists: Does Your 10-Year-Old Code Still Work?
Tommy Tang, director of computational biology at Immunitas Therapeutics, a biotechnology company in Waltham, Massachusetts, uses dedicated environments for various computational tasks, such as processing data from RNA sequencing or working in Google Cloud.
Users of the Snakemake and Nextflow computational workflow managers can even direct these tools to perform each step in a separate conda environment, says Kster, who leads Snakemake development. Make them as fine and disposable as possible, he advises. In addition to being easier to maintain, he explains, small rooms are also more transparent. People who want to understand what the analysis actually did immediately see which software stack was used for which step.
However, environments can’t do everything. Tools written in languages such as C, Perl, and Fortran can be difficult to encapsulate in environments, and dependency differences can make it difficult to port environments between operating systems. If so, users can try software containers, such as those from Docker and Singularity.
Containers, which essentially package a tool with its underlying operating system, are bigger and more complicated than environments, but they’re more portable. They’re also easier to share, because while an environment may contain thousands of files, a container has only one. On high-performance systems where jobs may run in parallel on hundreds of processing cores, transferring many small files can impact performance.
Computational environments, Timbers says, are the forgotten child of reproducibility. Journals increasingly require more code and data along with manuscripts, but full reproducibility requires knowledge of the environment in which they were run. It’s the elephant in the room, he says.
#sleight #hand #trick #simplify #scientific #calculation