Published September 17, 2025 | Version 1.0
Dataset Open

FoodSafeSum

  • 1. ROR icon Stockholm University
  • 2. ROR icon Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo"
  • 3. Department of Computer and Systems Sciences, Stockholm University
  • 4. SGS Digicomply
  • 5. ROR icon Athens University of Economics and Business
  • 1. ROR icon Stockholm University
  • 2. ROR icon Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo"
  • 3. ROR icon University of Pisa
  • 4. Stockholm University, Department of Computer and Systems Sciences
  • 5. SGS Digicomply

Description

FoodSafeSum is a machine-actionable dataset for NLP in food safety. It contains human-written and LLM-generated summaries and titles of 2,091 food-safety documents, plus manually curated topics, document types, and automatically extracted hazard annotations. Documents were gathered by SGS Digicomply from news, regulatory/legal sources, guidance portals, and scientific outlets (years 2002–2023; ~58% originally in English; the rest translated and curated). The dataset enables research on classification, retrieval, RAG-style QA, and event clustering in food-safety monitoring and policy. (See Section 3 and Table 7 for schema and fields; Figures 1–3 for source/type statistics. In the paper)

 

What’s included?

    • Manual summary and manual title (by domain experts)

    • LLM summary and LLM title (generated with meta.llama3-70b-instruct via Bedrock)

    • Document type (News, Regulation, Guidance, Scientific) and topic labels (12 high-level categories, e.g., Policies & Laws; Contaminants, residues & contact materials)

    • Hazard annotations auto-extracted from a controlled vocabulary derived from prior work

    • Source name and original title

      For each source item:

       

  • Note: The full original documents are not included in the public release (used internally for analysis only).

 

Format & schema

  • Primary release as CSV/JSON with columns (see Table 7): manual_summary, manual_title, llama70b_summary, llama70b_title, source_name, doc_type, topics, plus hazards (list) and any auxiliary metadata used for experiments.

  • Multilingual inputs were translated (Google Translate/DeepL) and curated; see paper for details.

Files

fss.csv

Files (2.9 MB)

Name Size Download all
md5:6e6061b643e2628c674adb9e4845f364
2.9 MB Preview Download

Additional details

Funding

European Commission
EFRA - Extreme Food Risk Analytics 101093026

Dates

Collected
2025-09-17