When Kimmo Vehkalahti introduces himself as a statistician at his lectures, the feelings of horror among the audience are palpable to him. In the case of many students, repulsion for some past compulsory course in statistics surfaces. This is why Vehkalahti prefers to introduce himself as a data scientist, which makes the audience consider him "cool".
At the University of Helsinki, Vehkalahti has coordinated several popular MOOCs, or massive open online courses, one of which, Open Data Science, is revolutionising the way in which research in the humanities and social sciences is conducted.
Vehkalahti’s office at Unioninkatu stands out thanks to its colourful stacks of books and posters on art, architecture, psychology, biology and other sciences.
“As a method scientist, I get to be in with all the sciences, serving as one of the infamous ‘all kinds of docents’”, this docent of applied statistics chuckles.
Two of the books stand out: Factfulness by the late Hans Rosling and Multivariate Analysis for the Behavioral Sciences by Vehkalahti and Brian S. Everitt.
The former is a bestseller by Vehkalahti’s Swedish hero, which IT billionaire Bill Gates and Barack Obama, former president of the United States, named as one of their favourites. The latter is fresh off the press, a book Vehkalahti considers the most important of his career.
“Brian S. Everitt is a legend in the field of statistics, and it’s totally unbelievable that I got to write this book with him,” Vehkalahti enthuses.
New methods take research to the next level
Open data, open science and data science: Vehkalahti is toying with the name of a course he is organising on the hottest trends in science. The course trains social scientists and humanists in coding, modelling and programming, as well as in adopting the tools of open science and its way of thinking.
“In recent years, the nature of knowledge has been shaken to its core, and this new situation requires researchers to update their skills. For example, huge datasets and interesting data can be gained from social media, but this data cannot be analysed with traditional statistical methods.”
Data collected from Twitter discussions before an election is a productive source of information for political scientists, while those conducting consumer research may be interested, say, in Sipsikaljavegaanit (‘Crisp-beer-vegans’), a Facebook group for discussing food. If such data – thousands or millions of messages – are examined one by one, failure is certain.
What is needed is an application programming interface, which, in turn, requires a certain level of understanding of programming. After the data has been downloaded, it must be turned into numerical or other usable form, necessitating the use of a software such as R.
R will most likely have the answer to your question
Among the things that get Vehkalahti excited is the programming language R. R is essentially based on the same mindset which the Open Data Science course also examines: open science and data.
R is a programming language and software whose source code is freely available. At its core resides statistical software that can be expanded with various packages. In the natural sciences, R has been in use for quite some time, but in the humanities and social sciences, the R revolution is only just beginning.
With R, researchers can, for example, merge their data with the maps available through Google Maps and thus visualise their findings, such as the distribution of poverty or crime in a certain geographical area.
“If you’re wondering whether R could have a link to a certain subject, there will almost definitely already be a software package available for that specific purpose,” Vehkalahti says.
No more closed research environments
In addition to R, students taking the Open Data Science course will familiarise themselves with other tools needed in the conduct of open science, such as GitHub which makes it possible to easily share and make available data and other materials related to research.
Vehkalahti also mentions LaTeX, a typesetting software that automates adherence to academic layout requirements related to texts and references, providing researchers with the opportunity to focus fully on writing.
He notes that the time of closed environments and software is drawing to a close.
“They raise questions and suspicions, slow down the progress of science and impede its self-correcting nature.”
The future of science is in openness, such as open data, a topic that has, in recent years, started to get a lot of attention both in discourse and written works. Open data is important, but what Vehkalahti considers at least as important is open source code, which is produced, for example, by using R.
“In short, it’s about what researchers specifically do to gain their results. It’s about each and every choice researchers make in the software they use.”
Each session in R is stored as code displaying all of the choices made, enabling researchers themselves or anyone else to repeat the experiment later.
“This is absolutely crucial to the conduct of science. Data in itself is worthless if you have no clue as to what has been done with it. In recent years, a crisis of replication has been prevalent in science and particularly in psychology; the subsequent replication of many research findings has proven to be impossible. Open source code would solve part of the problem,” Vehkalahti emphasises.
Anyone can learn to use new tools
The time we live in has been called the post-truth era, where emotions trump knowledge or facts. Science has also been at the receiving end of this phenomenon.
“The reliability of science increases when practically anyone has the opportunity to access research materials and even replicate experiments without being hindered by considerations of money or accessibility,” Vehkalahti says.
The possibilities offered by extensive datasets are enormous. Vehkalahti goes back to his idol Hans Rosling and his work on making socially significant data freely available.
“Thanks to Rosling and his Gapminder foundation, we have the vast datasets of the UN, WHO and World Bank, among others, at our disposal. Those in the possession of suitable tools and skills are able to find answers to big questions from these and other datasets.”
Vehkalahti believes that anyone can introduce a new kind of approach to their research. You can get started by visiting a website maintained by the Helsinki University Library which contains information and resources related to the practices and tools of open science.
“Learning to base your thinking on data and algorithms, to perceive opportunities where you didn’t see any before is key. If teachers and researchers try to hold you back, you have to keep in mind that it’s the students’ responsibility to challenge them,” says Vehkalahti.