Schema2QA 2.0

A large-scale question answering dataset on real-world Schema.org data

View the Project on GitHub stanford-oval/schema2qa

Stanford Schema2QA Dataset

Overview

Schema2QA is the first large question answering dataset over real-world Schema.org data. It covers 6 common domains: restaurants, hotels, people, movies, books, and music, based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.). In total, there are over 2,000,000 examples for training, consisting of both augmented human paraphrase data and high-quality synthetic data generated by Genie. All questions are annotated with executable virtual assistant programming language ThingTalk.

Schema2QA includes challenging evaluation questions collected from crowd workers. Workers are prompted with only what the domain is and what properties are supported. Thus, the sentences are natural and diverse. They also contain entities unseen during training. The collected sentences are manually annotated with ThingTalk by the authors. In total there are over 5,000 examples for dev and test.

An example of an evaluation question and its ThingTalk annotation is shown below:

“What are the highest ranked burger joints in the 40 mile area around Asheville NC?”

sort(aggregateRating.ratingValue desc of @org.schema.Restaurant.Restaurant() 
  filter distance(geo, new Location("asheville nc" )) <= 40 mi && 
         servesCuisine =~ "burger")[1] ;

What’s new in 2.0

The main difference is that all the examples in the dataset has been reannotated with ThingTalk 2.0. This is a major redesign of the language to make it more accessible, less verbose, and more compatible with pre-trained neural network. More details about the changes can be found in the release history. The synthetic data is regenerated with latest Genie v0.8.0, with improvement over both quality and efficiency. There are also minor annotation fixes, duplicated examples removed in the evaluation set. So the size of evaluation set is actually slightly smaller for some domains, but the diversity remains the same.

You can still find information about Schema2QA 1.0 here. However, we do not recommend using Schema2QA 1.0 any more as it contains outdated ThingTalk annotation.

Leaderboard

All numbers are evaluated on the Schema2QA test set which is not included in this repository. Please contact us at mobisocial@lists.stanford.edu to evaluate your model(s) on the test data. Accuracy on dev set can be found here. Note that the accuracy is now different from what we reported in our papers as the dataset has changed.

Schema2QA

Trained with the full Schema2QA training data, including synthetic data using manual natural language annotations of the Schema.org properties, and human paraphrase data. Both are augmented with crawled real property values.

Rank Model Restaurants People Movies Books Music Hotels Average
1 BART
Stanford
73.3% 80.0% 81.7% 72.5% 70.3% 69.5% 74.5%
2 BERT-LSTM
Stanford
64.3% 73.8% 66.8% 46.7% 58.0% 55.9% 60.9%

AutoQA

Trained with dataset fully synthesized with AutoQA, using automatically generated natural language annotations and a neural paraphraser.

Rank Model Restaurants People Movies Books Music Hotels Average
1 BART
Stanford
77.3% 76.2% 83.4% 65.1% 62.9% 72.2% 72.9%
2 BERT-LSTM
Stanford
62.6% 58.4% 60.4% 44.0% 50.3% 60.4% 56.0%

Validation data can be found under directories of each domain in this git repository. The training sets can be downloaded from the following links:

Detailed statistics of the dataset can be found in the stats page.

Getting started

This repository also contains the Makefile to run the full data synthesis, training, and evaluation of Schema2QA dataset. Detailed instructions can be found in installation and run instructions.

License

The dataset is released under CC BY 4.0. Please cite the following papers if use this dataset in your work:

% Schema2QA & BERT-LSTM model
@inproceedings{xu2020schema2qa,
  title={Schema2QA: High-Quality and Low-Cost Q\&A Agents for the Structured Web},
  author={Xu, Silei and Campagna, Giovanni and Li, Jian and Lam, Monica S},
  booktitle={Proceedings of the 29th ACM International Conference on Information \& Knowledge Management},
  pages={1685--1694},
  year={2020}
}

% AutoQA 
@inproceedings{xu2020autoqa,
  title={AutoQA: From Databases to Q\&A Semantic Parsers with Only Synthetic Training Data},
  author={Xu, Silei and Semnani, Sina and Campagna, Giovanni and Lam, Monica},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={422--434},
  year={2020}
}

% BART parser
@inproceedings{campagna-etal-2022-shot,
    title={A Few-Shot Semantic Parser for {W}izard-of-{O}z Dialogues with the Precise {T}hing{T}alk Representation},
    author={Campagna, Giovanni  and Semnani, Sina  and Kearns, Ryan  and Koba Sato, Lucas Jun  and Xu, Silei  and Lam, Monica},
    booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
    pages={4021--4034},
    month={may},
    year={2022},
    address={Dublin, Ireland},
    publisher={Association for Computational Linguistics},
    url={https://aclanthology.org/2022.findings-acl.317},
    doi={10.18653/v1/2022.findings-acl.317},
}