This course will address technical and soft skills in managing computer-based research. It is intended to be relevant to both technical and newly technical student researchers from a broad array of disciplines from across campus. Much of the curriculum is drawn from the collaborative open-source Software and Data Carpentry programs and has been taught thousands of times across an international audience.
This course will cover data management; process management and task automation; and research project management. Topics will include data organization in spreadsheets, document revision control, data visualization, robust testing strategies, and reproducibility. Best practices and “good enough” practices (1 2 3 4 5) will be employed and emphasized throughout.
We really want for you to be able to do two things:
The course will be available on the campus Moodle page.
This course is suitable for students from across campus who intend to work with data using computerized processing. For the Sp21 pilot, we expect students to have some prior programming experience, but no particular language or platform is assumed.
Computing and data-intensive analysis have become ubiquitous in all domains, and the impact of instrumentation and computation is only likely to increase over the next decades. At the same time, a focus on reproducibility in research calls for more automatic processes and methods of incorporating data and computing environments. Research should be repeatable and reproducible, and use of computing should lend confidence to scientific results. Many times, courses and projects requiring these skills presume that students already have them or that students will pick them up informally and on their own. Few make a point of explicitly teaching these important skills and effectively none are strategically focused on it to adequately prepare students for their work.
Students may be told to “work smarter, not harder,” but it is often unclear what this should mean in a particular domain context. This course proposes to teach students fundamental skills and applications for using computing to make their research process organized, automated, and reproducible. It is intended for students at the commencement of their research program, and will serve as a complement to their advisor’s guidance on how to conduct a research program appropriately within their domain specialty.
Researchers should consider themselves not only as manipulators of scientific data and producers of insights, but makers of tools as well—their products will be useful directly and for training future generations of researchers. Sensible, scalable workflows and tools are indispensable to enabling modern research. For instance, the growth of data sets in size and complexity motivates script-based, rather than spreadsheet-based, workflows. Many aspects of computer-based research management have been worked out but remain relatively cloistered from standard undergraduate and graduate student practice. This course aims to communicate these skills between domains and across campus.
As code penetrates research, research itself becomes liable to the flaws of code. Consider, for instance, the Willoughby–Hoye Python scripts, used to process NMR shifts. Recently, it was announced that the scripts contained an error across platforms due to reliance on a platform-specific behavior that was formally undefined. Hundreds of papers have been impacted and perhaps invalidated. There is also something of a replication crisis ongoing in many disciplines, and calls for researchers to transparently provide data and code, indicating a broader unease with the non-reproducible, non-repeatable research environment which has persisted. Teaching researchers from day zero to prioritize reproducibility, rigorously test code, and manage data will give them the ability to demonstrate confidence in their results as well as ease their necessary analytical processes.