My Favourite EMNLP 2024 Papers

31st Dec 2024

I attended my first major NLP conference this year: EMNLP in Miami. I learned a lot, met a lot of cool people and Miami was lush! ☀️🏝️🪩

I thought I would share some of my favourite papers I came across during the conference: some from the main conference and others from workshops and co-located things. Note: these are not necessarily the 'best' papers from the conference, simply ones I found fun and/or interesting by some arbitrary measure. I am easily swayed by pretty posters.

I've given a brief description of what each one was about but these were very much hasty notes I scribbled down.

I'll start with an honourable mention:


Foodie QA - Had snacks at their poster stand. 10/10 behaviour.

What's it about?

The authors evaluate LLMs and VLMs on a multimodal dataset for fine-grained understanding of Chinese food culture.

FoodieQA comprises three multiple choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions.

On to the main list!

FoodieQA Dataset

5. LLM Tropes (My favourite poster)


LLM Tropes Poster

What's it about?

  • Take 62 statements from the Political Compass Test (PCT)
  • Test this using 420 prompt variations (demographic changes or style changes)
  • Test on 6 LLMs
  • Demographic features significantly impact outcomes
  • Compare open-ended and closed-form responses
  • Reveal ‘tropes’ in responses that are consistent across demographics, e.g.
    • ‘Love is love’
    • ‘National pride connects us’
    • ‘Work towards building a better world’

4. What is the social benefit of hate speech detection research? A Systematic Review


Not only was the author a really lovely human, they also did research that we don't see enough of in NLP: figuring out if any of the research we do actually makes a difference in the world.

What's it about?

  • Reviews 48 hate speech detection systems from 37 publications
  • Tests these against 8 principles
    • Human-centred, transparency, well-being, privacy, reliability, interrogation and accountability
Hate Speech Detection

3. BabyLM - GPT or BERT: why not both?


BabyLM

This was a popular one with my whole research group, as was the BabyLM challenge in general. In fact, we've been thinking about how to do some sort of Welsh BabyLM project - stay tuned! 👀

Did I understand it all? Definitely not. Did I nod along at their poster and hope no-one asked me any questions? Absolutely!

What's it about?

  • Part of the BabyLM challenge
  • Merge GPT and BERT into a single transformer architecture
  • Improved performance over each one separately
  • Achieves the improvements without requiring additional parameters or increased training time over doing either GPT or BERT
  • Possibility that this does not scale well

My dumb person notes:

  • GPT modelling - guessing the next words
  • BERT - guessing a masked word
  • Instead of predicting masked tokens at their original positions, we shift the predictions one position to the right, aligning them with the CLM’s next-token prediction pattern.

2. ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions


These guys also get top marks for pretty poster and a really interesting research topic in general.

I would strongly recommend taking a look at the paper.

What's it about?

  • Look at 13 different Reddit communities
    • Across gender, politics, science and finances
  • Traces the evolution of different communities and how their norms change over time
  • E.g. if an election or some other event happens

So the little graph shows how supportiveness and a sense of humour tended downward after the election, with the exception of supportiveness in the Libertarian commuity, which increased post-election.

ValueScope

1. Connecting the Dots


Math Bias

As someone already obsessed with NYT Games, my eyes lit up upon seeing this poster. I believe it spun out of a Masters project, so it's extra impressive that it's at EMNLP Main.

If you're not familiar with the game, the aim is to create 4 groups of 4 words based on shared categories, from an unlabelled grid of 16 words.

What's it about?

  • Can LLMs solve the New York Times Connections game?
  • The authors use this as a benchmark to test abstract reasoning in LLMs
  • Claude 3.5 Sonnet performs best 
    • (better than GPT-4o, Gemini, Llama 3.1 and Mistral 2)
  • Examine performance through a taxonomy to figure out which types of word associations LLMs are good/bad at
    • (The taxonomy itself is really interesting)

So that's my round up! Hope it was interesting for you too :)

Bonus pic of me with some raccoons:

Zara with raccoons