A large-scale question answering dataset on real-world Schema.org data
Schema2QA is the first large question answering dataset over real-world Schema.org data. It covers 6 common domains: restaurants, hotels, people, movies, books, and music, based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.). In total, there are over 2,000,000 examples for training, consisting of both augmented human paraphrase data and high-quality synthetic data generated by Genie. All questions are annotated with executable virtual assistant programming language ThingTalk.
Schema2QA includes challenging evaluation questions collected from crowd workers. Workers are prompted with only what the domain is and what properties are supported. Thus, the sentences are natural and diverse. They also contain entities unseen during training. The collected sentences are manually annotated with ThingTalk by the authors. In total there are over 5,000 examples for dev and test.
An example of an evaluation question and its ThingTalk annotation is shown below:
“What are the highest ranked burger joints in the 40 mile area around Asheville NC?”
sort(aggregateRating.ratingValue desc of @org.schema.Restaurant.Restaurant()
filter distance(geo, new Location("asheville nc" )) <= 40 mi &&
servesCuisine =~ "burger")[1] ;
The main difference is that all the examples in the dataset has been reannotated with ThingTalk 2.0. This is a major redesign of the language to make it more accessible, less verbose, and more compatible with pre-trained neural network. More details about the changes can be found in the release history. The synthetic data is regenerated with latest Genie v0.8.0, with improvement over both quality and efficiency. There are also minor annotation fixes, duplicated examples removed in the evaluation set. So the size of evaluation set is actually slightly smaller for some domains, but the diversity remains the same.
You can still find information about Schema2QA 1.0 here. However, we do not recommend using Schema2QA 1.0 any more as it contains outdated ThingTalk annotation.
All numbers are evaluated on the Schema2QA test set which is not included in this repository. Please contact us at mobisocial@lists.stanford.edu to evaluate your model(s) on the test data. Accuracy on dev set can be found here. Note that the accuracy is now different from what we reported in our papers as the dataset has changed.
Trained with the full Schema2QA training data, including synthetic data using manual natural language annotations of the Schema.org properties, and human paraphrase data. Both are augmented with crawled real property values.
Rank | Model | Restaurants | People | Movies | Books | Music | Hotels | Average |
---|---|---|---|---|---|---|---|---|
1 | BART Stanford |
73.3% | 80.0% | 81.7% | 72.5% | 70.3% | 69.5% | 74.5% |
2 | BERT-LSTM Stanford |
64.3% | 73.8% | 66.8% | 46.7% | 58.0% | 55.9% | 60.9% |
Trained with dataset fully synthesized with AutoQA, using automatically generated natural language annotations and a neural paraphraser.
Rank | Model | Restaurants | People | Movies | Books | Music | Hotels | Average |
---|---|---|---|---|---|---|---|---|
1 | BART Stanford |
77.3% | 76.2% | 83.4% | 65.1% | 62.9% | 72.2% | 72.9% |
2 | BERT-LSTM Stanford |
62.6% | 58.4% | 60.4% | 44.0% | 50.3% | 60.4% | 56.0% |
Validation data can be found under directories of each domain in this git repository. The training sets can be downloaded from the following links:
Detailed statistics of the dataset can be found in the stats page.
This repository also contains the Makefile to run the full data synthesis, training, and evaluation of Schema2QA dataset. Detailed instructions can be found in installation and run instructions.
The dataset is released under CC BY 4.0. Please cite the following papers if use this dataset in your work:
% Schema2QA & BERT-LSTM model
@inproceedings{xu2020schema2qa,
title={Schema2QA: High-Quality and Low-Cost Q\&A Agents for the Structured Web},
author={Xu, Silei and Campagna, Giovanni and Li, Jian and Lam, Monica S},
booktitle={Proceedings of the 29th ACM International Conference on Information \& Knowledge Management},
pages={1685--1694},
year={2020}
}
% AutoQA
@inproceedings{xu2020autoqa,
title={AutoQA: From Databases to Q\&A Semantic Parsers with Only Synthetic Training Data},
author={Xu, Silei and Semnani, Sina and Campagna, Giovanni and Lam, Monica},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={422--434},
year={2020}
}
% BART parser
@inproceedings{campagna-etal-2022-shot,
title={A Few-Shot Semantic Parser for {W}izard-of-{O}z Dialogues with the Precise {T}hing{T}alk Representation},
author={Campagna, Giovanni and Semnani, Sina and Kearns, Ryan and Koba Sato, Lucas Jun and Xu, Silei and Lam, Monica},
booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
pages={4021--4034},
month={may},
year={2022},
address={Dublin, Ireland},
publisher={Association for Computational Linguistics},
url={https://aclanthology.org/2022.findings-acl.317},
doi={10.18653/v1/2022.findings-acl.317},
}