Schema2QA 2.0

A large-scale question answering dataset on real-world data

View the Project on GitHub stanford-oval/schema2qa

Run Schema2QA/AutoQA Data Synthesis, Train, and Evaluate a Parser


Edit Makefile, set geniedir to where Genie toolkit is installed, and set developer_key to your Thingpedia developer key. A Thingpedia developer account is required to obtain the developer key. Register an Almond account and sign up as a developer, then you can retrieve the developer key from your user profile page

You can also create a file called ‘’ with your settings if you don’t want to edit the Makefile directly.

Data Synthesis


To synthesize a Schema2QA dataset, i.e. a dataset that includes synthesized data and human paraphrases, simply run the following

make datadir

By default, it generates dataset for restaurants domain, with manual annotations by the authors, and human paraphrase data by crowd workers.

To synthesize data for a different domain, set the experiment option as follows:

make datadir experiment=people # change this to the domain you want

AutoQA (w/o automatic paraphrasing)

To synthesize an AutoQA dataset, i.e. a dataset with neither manual annotation nor human paraphrases, run the command with additional options to enable AutoQA and exclude human paraphrases as follows:

make datadir experiment=people annotation=auto human_paraphrase=false

If the command failed because misconfiguration or missing library, run make clean before you rerun make datadir.

AutoQA (w/ automatic paraphrasing)

You can automatically paraphrase any dataset using a neural paraphrasing model. If you paraphrase the dataset from the previous step, you will obtain the fully automatic dataset described in the AutoQA paper.

Running the following command will paraphrase the dataset in datadir and write two resulting datasets into datadir_paraphrased and datadir_filtered. The latter is the former after filtering is applied.

make datadir_filtered

datadir_paraphrased folder will contain an additional file almond_paraphrase.results.json, which includes the average self-BLEU score of paraphrased examples (the lower the self-BLEU, the more different the paraphrases are from the original dataset). Another file is almond_paraphrase.tsv which contains the raw output of the automatic paraphraser; its two columns are example id and the paraprhased sentence).

datadir_filtered folder will contain pass-rate.jsonwhich contains the percentage of paraphrases that passed through the filter.


To train a parser for Schema2QA, simply run

make train datadir=datadir

For training a parser on the AutoQA dataset, you can run make train datadir=datadir_filtered

Similar to make datadir, one can append experiment option to the training command to choose a different domain other than restaurants.

This automatically trains the BART model proposed in SKIM, which is currently the state-of-the-art model on Schema2QA benchmark. The default setting is under train_nlu_flags in the Makefile. You can either tweak the hyperparameters directly, or append additional flags using custom_train_nlu_flags.


Once a model is trained, one can run the following command to evaluate the model on the dev set.

make evaluation

Again, set experiment to the domain you would like to evaluate on.

The results will be saved in a file called ${experiment}/eval/${model-name}.results. This is a CSV file with the following columns:

The exact match accuracy is the one reported in the paper and on the leader board.

Compute Requirements

Data synthesis for Schema2QA can be done on a CPU machine, and usually takes less than 1 hour.

However, synthesizing an AutoQA dataset (w/ or w/o automatic paraphrasing), requires GPUs as it involves running a neural paraphraser and training and running a parser for filtering. Depending on the GPU, this can take a couple of hours.

Training a parser with default hyperparameters takes around 5 hours on our GPUs.

We have tested these commands on an AWS p3.2xlarge machine, which has one 16GB NVIDIA V100 GPU, 8 vCPUs, and 61 GiB of memory. These commands should run out-of-the-box on machines with lower CPU and RAM, but if you are using a GPU with less memory, you might need to decrease train_batch_tokens and val_batch_tokensaccordingly and increase train_iterationsand train_filter_iterations instead.