-
Notifications
You must be signed in to change notification settings - Fork 31
Dataset: Format and Usage
We currently have TextDataset
for text classification, RelationDataset
for text relation classification and NumericDataset
for tabular data classification. Note that any new type of dataset can be converted into NumericDataset
with numeric features/pre-trained embeddings.
In general, data are stored in json files. We highly recommend that train/valid/test data are stored in three different json files. Each json file contains a dictionary that looks like
{
"0":{ # data id
"label": 0, # (int) ground truth label, start from 0
"weak_labels": [0, -1, 0, 1], # (List(int)) weak supervision labels, start from -1. -1 is ABSTAIN, 0...k are labels
"data": { # dataset specific raw data dictionary
...
},
},
...
}
In addition, a label.json
file is required which basically contains a dictionary mapping label id (start from 0) to label surface name.
The file structure should be like
datasets
-- data name (for example, yelp)
|-- label.json
|-- train.json
|-- valid.json
|-- test.json
Than the dataset can be loaded by
dataset_path = 'datasets/yelp'
train_data = TextDataset(path=dataset_path, split='train')
valid_data = TextDataset(path=dataset_path, split='valid')
test_data = TextDataset(path=dataset_path, split='test')
)
For classification, we also support multiple feature extractors. We take BERT embedding extractor as an example
extractor_fn = train_data.extract_feature(extract_fn='bert', model_name='bert-base-uncased', return_extractor=True)
valid_data.extract_feature(extract_fn=extractor_fn, return_extractor=False)
test_data.extract_feature(extract_fn=extractor_fn, return_extractor=False)
)
After calling extract_feature
, the extracted features are stored in, for example, train_data.features
.
Each train/valid/test json file contains a dictionary:
{
"0":{
"label": 0,
"weak_labels": [0, -1, 0, 1],
"data": {
"feature": [0, 1, 0.1]
},
},
"1":{
"label": 1,
"weak_labels": [-1, 1, -1, 1],
"data": {
"feature": [1, 0, 0.2]
},
}
...
}
The NumericDataset
only has a default feature extractor, which directly copy the feature
stored in json to dataset's features
attribute.
data.extract_feature(extract_fn=None, return_extractor=False)
Each train/valid/test json file contains a dictionary:
{
"0":{
"label": 0,
"weak_labels": [0, -1, 0, 1],
"data": {
"text": "this is an example"
},
},
"1":{
"label": 1,
"weak_labels": [-1, 1, -1, 1],
"data": {
"text": "this is another example"
},
}
...
}
The TextDataset
has following feature extractors (extract_fn
argument):
'bow'
: bag of words feature extractor, based on sklearn.feature_extraction.text.CountVectorizer
.
'tfidf'
: TF-IDF feature extractor, based on sklearn.feature_extraction.text.TfidfVectorizer
.
'sentence_transformer'
: sentence transformer feature extractor, based on SentenceTransformers.
'bert'
: BERT-based feature extractor, based on HuggingFace.
Each train/valid/test json file contains a dictionary:
{
"0":{
"label": 1,
"weak_labels": [1, -1, 1, 1],
"data": {
"text": "AA is a BB.",
"entity1": "AA",
"entity2": "BB",
"span1": [0, 2], # character-level span
"span2": [8, 10], # character-level span
},
},
...
}
The RelationDataset
has following feature extractors (extract_fn
argument):
'bert'
: BERT-based feature extractor, based on HuggingFace and R-BERT.
Each train/valid/test json file contains a dictionary:
{
"0":{
"label": ["B-PER", "O", "O", "O", "O", "O", "O", "O", "B-LOC", "I-LOC", "I-LOC", "O", "B-ORG", "I-ORG", "O"],
"weak_labels": "weak_labels": [["B-PER", "B-PER", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["B-LOC", "B-LOC", "O", "O", "O", "B-LOC", "B-LOC", "B-MISC", "O", "O", "B-LOC", "B-LOC", "B-LOC", "B-LOC", "B-LOC", "B-LOC"], ["I-LOC", "I-LOC", "O", "O", "O", "I-LOC", "I-LOC", "I-MISC", "O", "O", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC"], ["O", "I-LOC", "O", "O", "O", "I-LOC", "I-LOC", "I-MISC", "O", "O", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC"], ["O", "I-LOC", "O", "O", "O", "O", "O", "I-MISC", "O", "O", "O", "O", "O", "O", "O", "O"], ["B-ORG", "B-ORG", "B-ORG", "B-ORG", "O", "O", "O", "I-MISC", "B-ORG", "B-ORG", "O", "O", "B-LOC", "B-LOC", "B-LOC", "B-LOC"], ["I-ORG", "I-ORG", "I-ORG", "I-ORG", "O", "O", "O", "I-MISC", "I-ORG", "I-ORG", "O", "O", "I-LOC", "I-LOC", "I-LOC", "I-LOC"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]],
"data": {
"text": {"text": ["Gradsky", "has", "also", "performed", "as", "a", "tenor", "at", "New", "York", "City", "'s", "Carnegie", "Hall", "."]},
"len": 15
},
},
...
}
Note that different from the Classification dataset, each sentence in SeqDataset consists of a series of tokens. Hence, the label
here is a list with the length equal to the sequence length. weak_labels
is a list of weak labels where the i-th list stands for the weak labels for the i-th token. len
stands for the length of the sequence.
In addition, a meta.json
file is required which basically contains metadata information (e.g. weak label sources, entity type, maximum number of tokens). An example of meta.json
is on belows (for CoNLL03 dataset).
{
"train_size": 14041,
"valid_size": 3250,
"test_size": 3453,
"num_labels": 9,
"max_length": 124,
"lf": [
"BTC",
"core_web_md",
"crunchbase_cased",
"crunchbase_uncased",
"full_name_detector",
"geo_cased",
"geo_uncased",
"misc_detector",
"multitoken_crunchbase_cased",
"multitoken_crunchbase_uncased",
"multitoken_geo_cased",
"multitoken_geo_uncased",
"multitoken_wiki_cased",
"multitoken_wiki_uncased",
"wiki_cased",
"wiki_uncased"
],
"num_lf": 16,
"entity_types": [
"PER",
"LOC",
"ORG",
"MISC"
],
"lf_rec": [
"BTC",
"core_web_md",
"crunchbase_cased",
"crunchbase_uncased",
"full_name_detector",
"geo_cased",
"geo_uncased",
"misc_detector",
"multitoken_crunchbase_cased",
"multitoken_crunchbase_uncased",
"multitoken_geo_cased",
"multitoken_geo_uncased",
"multitoken_wiki_cased",
"multitoken_wiki_uncased",
"wiki_cased",
"wiki_uncased"
],
"num_lf_rec": 16
}