Data Science Challenges
Data science challenges are competitive events where individuals or teams work on real-world problems using data analysis and machine learning techniques. Below are challenges I participated in together with a team:
BIRDCLEF+ 2025
by Cornell Lab of Ornithology, USA
May 2025
Learn more about the challenge
Overview
The task was to develop a sound classification model that determines which of
206 different
animal species are present in a soundclip, taken in a wild life reservoir.
Such models are used to quantifiy the success of environmental restoration
programs. The idea is that a high biodiversity and the presence of certain
key species indicates a healthy environment and therefore the success of the
restoration program.
Approach
I used the PANN audio model, which was pretrained on 5000 hours of sound and
predicts 527 sound classes. With the help of this model, I created
embeddings of the sound training data from the BIRDCLEF challenge. The idea
is that this pretrained model is already capable of finding useful features
in sound data, and can therefore be used as a feature extractor. To achieve
this, I removed the output layer of the PANN, and used the output of the
layer before as the embeddings. Then I trained another neural network model
with far fewer parameters than PANN to map from those embeddings to the
probabilities of the 206 animal classes of the BIRDCLEF challenge.
Unfortunately, the challenge hosts required a minimal runtime of the
approach in their cloud environment, which I could not meet with my approach,
which is why I was unable to submit and achieve a ranking.
Richterโs Predictor: Modeling Earthquake Damage
by Driven Data, USA
March โ May 2025
๐ 1007th Place / 2487 Submissions
Learn more about the challenge
Overview
The task was to develop a model that predicts the earthquake damage (low,
medium, almost complete destruction) from features such as the geo location,
and construction engineering features, such as the height of the building
and the type of superstructure material. The data consisted of about 260,000
houses damaged in the Gorkha earthquake that hit Nepal in 2015. Note that
this was a practice challenge with no price pool.
Approach
We used one-hot encoding and target encoding for the categorical
features, and predicted the damage with an XGBoost model by threating this
ordinal problem as a classification. As the objective function, we used
multi-class classification โ soft probability output. The final score was
reported with the micro-averaged F1 score. Finally, we used SHAP to extract
feature importances to find insights about how to build houses so that they
get less earthquake damage. Our SHAP feature importance analysis showed that
building age is the most important predictor of earthquake damage (besides
proximity to the earthquake epicenter): newer buildings tend to suffer less,
while older ones are more likely to be heavily damaged. Structural materials
play a major role too. Buildings made of mud mortar stone, mud mortar brick, or
adobe mud are associated with higher damage, which matches engineering
knowledgeโthey are brittle, lack reinforcement, and perform poorly under
lateral seismic forces. In contrast, cement mortar brick offers better
protection due to its higher strength and cohesion.
Wattโs up? Synthetic Data for Buildings
by TU Vienna, Austria
22 โ 23 February 2025
๐ 3rd Place / 8 Teams โ 80โฌ Prize
Learn more about the challenge
Overview
The task was to develop a model that can input timecourse data of electricity
consumption and output a synthetic version of this dataset that ensures the
privacy of the individuals. This is important because electricity consumption
data are sensitive and cannot be easily shared. However, widespread availability
of such datasets could be very beneficial for model development, allowing insights
that help save energy.
Approach
We chose an approach that samples pieces from the actual data, reassembles them,
and adds Gaussian noise on top. This produced instances that closely resembled
the real data in terms of statistical properties but were different enough to
ensure they cannot be linked to individual examples of the real data.
Autoimmune Disease Machine Learning Challenge
by Eric and Wendy Schmidt Center at the Broad Institute, USA
Part 1: Oktober 2024 โ February 2025Part 2: November 2024 โ March 2025
Part 3: December 2024 โ April 2025
Part 1: ๐ 2th Place / 75 submissions / 7800 participants โ 2400$ Prize
Part 2: ๐ 6th Place / 11 submissions / 8200 participants โ 700$ Prize
Part 3: ๐ still ongoing
Learn more about the challenge
Overview
Autoimmune diseases occur when the immune system mistakenly attacks healthy
cells, disrupting the bodyโs natural defense mechanisms. These diseases affect
50 million people in the U.S., with cases rising globally.
One of the most prevalent autoimmune diseases is inflammatory bowel disease
(IBD). To better understand and treat IBD, researchers sequence the RNA
(transcriptome) of individual affected cells. However, this process is highly
complex and expensive, making a predictive model for transcriptomics a valuable
tool in advancing research.
This challenge consisted of three independent sub-challenges, each addressing a
distinct aspect of this broader research problem:
- The task of Part 1 of this challenge was to create a model that can deliver this expensive and hard to obtain information from readily available pathology images.
- The task of Part 2 was to predict all genes of a cell from a (predicted) subpart.
- The task of Part 3 was to rank all genes based on their ability to distinguish between normal cells and ones showing dysplasia.
Approach
For task 1, we used the ResNet50 model implemented in PyTorch to directly
map from the input images to the 460 genes (end-to-end deep learning approach).
For task 2, we took our trained ResNet50 model and its predictions, which served as the input to a standard neural network, which mapped from the 460 predicted genes to all genes of the cell.
For task 3, we submitted many different approaches. Notable examples include (i) determining the AUC value for all features, and sorting them according to that, and (ii) predicting the cells with dysplasia with the help of a neural network model, for which we extracted feature importances with the Integrated Gradients method from the Captum library. Integrated Gradients is a feature importance algorithm for neural networks.
LEC Data Challenge 2023
by LEC Gmbh, Austria
July โ August 2023
๐ 1st Place / 46 Teams โ 4,000โฌ Prize
Learn more about the challenge | See the winners on LinkedIn |
Overview
The task was to develop a machine learning solution to identify
cylinder-specific events in time-series data from large combustion engines.
Approach
We trained an XGBoost algorithm to produce a model that predicted, for each
time point, which type of issue was present, if any. As input, we used slices
of the time course (nearest values in time). Because issues always covered
larger continuous time windows in the training data, we afterward
algorithmically identified regions in the XGBoost predictions where many
issues were detected and unified them into a continuous stretch with one
issue type.
Trigger Detection Challenge 2023
by Bauhaus-University Weimar, Germany
April โ May 2023
Learn more about the challenge
Overview
A multi-label classification challenge to assign trigger warning
labels (such as violence and blood) to fanfiction documents based on their
content.
As I still was a beginner, I honestly was a bit overwhelmed and left the
challenge before it ended. However, I learned a lot about the workflow on how
to tackle such tasks.