AI has been successful in automating scientific reasoning processes in e.g. the life science (with the Robot Scientists). The question that I want to ask is whether it is possible to automate the processes involved in data science? I also want to answer that question in the course of our ERC AdG project SYNTH on “Synthesising Inductive Data Models”.
To start the discussion on this topic, it is useful to look at the famous knowledge discovery cycle, where one typically starts from raw data, select and pre-process the data, identify the data mining task, use the right data mining algorithms, and then interpret the results and possibly iterate. It turns out that most of the existing approaches to automating this process, such as the automated statistician and meta-learning, algorithm portfolio and configuration approaches assume the learning task is known and we only need to identify the right algorithm and parameters to find the optimal task. It is well-known in the data mining community that this step takes typically only about 20% of the time, while the preprocessing and task identification take 80% of the time.
The question that I am interested in is what we can do to automate the pre-processing and task identification aspects, particularly for non-experts in data science.