My Favourite EMNLP 2024 Papers

31st Dec 2024

I attended my first major NLP conference this year: EMNLP in Miami. I learned a lot, met a lot of cool people and Miami was lush! ☀️🏝️🪩

I thought I would share some of my favourite papers I came across during the conference: some from the main conference and others from workshops and co-located things. Note: these are not necessarily the 'best' papers from the conference, simply ones I found fun and/or interesting by some arbitrary measure. I am easily swayed by pretty posters.

I've given a brief description of what each one was about but these were very much hasty notes I scribbled down.

I'll start with an honourable mention:

Foodie QA - Had snacks at their poster stand. 10/10 behaviour.

What's it about?

The authors evaluate LLMs and VLMs on a multimodal dataset for fine-grained understanding of Chinese food culture.

FoodieQA comprises three multiple choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions.

On to the main list!

5. LLM Tropes (My favourite poster)

What's it about?

Take 62 statements from the Political Compass Test (PCT)
Test this using 420 prompt variations (demographic changes or style changes)
Test on 6 LLMs
Demographic features significantly impact outcomes
Compare open-ended and closed-form responses
Reveal ‘tropes’ in responses that are consistent across demographics, e.g.
- ‘Love is love’
- ‘National pride connects us’
- ‘Work towards building a better world’

4. What is the social benefit of hate speech detection research? A Systematic Review

Not only was the author a really lovely human, they also did research that we don't see enough of in NLP: figuring out if any of the research we do actually makes a difference in the world.

What's it about?

Reviews 48 hate speech detection systems from 37 publications
Tests these against 8 principles

Human-centred, transparency, well-being, privacy, reliability, interrogation and accountability

3. BabyLM - GPT or BERT: why not both?

This was a popular one with my whole research group, as was the BabyLM challenge in general. In fact, we've been thinking about how to do some sort of Welsh BabyLM project - stay tuned! 👀

Did I understand it all? Definitely not. Did I nod along at their poster and hope no-one asked me any questions? Absolutely!

What's it about?

Part of the BabyLM challenge
Merge GPT and BERT into a single transformer architecture
Improved performance over each one separately
Achieves the improvements without requiring additional parameters or increased training time over doing either GPT or BERT
Possibility that this does not scale well

My dumb person notes:

GPT modelling - guessing the next words
BERT - guessing a masked word
Instead of predicting masked tokens at their original positions, we shift the predictions one position to the right, aligning them with the CLM’s next-token prediction pattern.

2. ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions

These guys also get top marks for pretty poster and a really interesting research topic in general.

I would strongly recommend taking a look at the paper.

What's it about?

Look at 13 different Reddit communities

Across gender, politics, science and finances

Traces the evolution of different communities and how their norms change over time
E.g. if an election or some other event happens

So the little graph shows how supportiveness and a sense of humour tended downward after the election, with the exception of supportiveness in the Libertarian commuity, which increased post-election.

1. Connecting the Dots

As someone already obsessed with NYT Games, my eyes lit up upon seeing this poster. I believe it spun out of a Masters project, so it's extra impressive that it's at EMNLP Main.

If you're not familiar with the game, the aim is to create 4 groups of 4 words based on shared categories, from an unlabelled grid of 16 words.

What's it about?

Can LLMs solve the New York Times Connections game?
The authors use this as a benchmark to test abstract reasoning in LLMs
Claude 3.5 Sonnet performs best

(better than GPT-4o, Gemini, Llama 3.1 and Mistral 2)

Examine performance through a taxonomy to figure out which types of word associations LLMs are good/bad at

(The taxonomy itself is really interesting)

So that's my round up! Hope it was interesting for you too :)

Bonus pic of me with some raccoons: