Metadata
Controlled Vocabularies & Classifications
When a metadata section in a Data Management Plan template includes a question on the used ontology (if any) what is usually meant is: is there a specific vocabulary or classification system used. Controlled vocabularies are created by domain experts to help translate ontological concepts as well as to organise knowledge for subsequent (information) retrieval. Controlled vocabularies (CESSDA: “structured controlled vocabularies”) are intended to reduce ambiguity that is inherent in normal human languages where the same concept can be given different names and to ensure consistency. Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Some vocabularies are very internationally accepted and standardized and may even become an ISO standard or a regional standard/classification. Controlled vocabularies can be broad in scope or very limited to a specific field. When a Data Management Plan template includes a question on the used ontology (if any), what is usually meant is: is there a specific vocabulary or classification system used.
Examples are:
- CDWA (Categories for the Description of Works of Art)
- Getty Thesaurus of Geographic names
- NUTS (Eurostat)
- Medical Subject HEadings (MeSH)
Many examples of vocabularies and classification systems can be found at the FAIRsharing.org website. It has a large list for multiple disciplines. If you are working on new concepts or new ideas and are using or creating your own ontology/terminology, be sure to include them as part of the metadata documentation in your dataset (for example as part of your codebook).
Controlled vocabularies help make searching for and re-using information or data much easier when they are part of a machine-readable metadata scheme or system.
Metadata & Datasets
Metadata is descriptive information about data / information. Metadata allow humans and programs to more easily understand and interpret information or data. Controlled vocabularies are often used to help make searching for and re-using information or data much easier when they are part of a machine-readable metadata scheme or system.
The CESSDA has created helpful guidance about creating metadata.
There are three main levels of metadata: Data assets, Dataset documentation and Dataset registration (more information).
Data assets
On a micro-level there are four functional categories of metadata standards for datasets themselves that describe elements like structure, content, values (definitions, see also code book), and data formats (CSV, XML, etc.). Additionally, research groups often use a discipline’s standards to also describe data objects using naming conventions. There are, however, other guidelines for naming conventions and document versioning which can be useful for all documents, independent of whether they are research data or not. Often The table below gives an example of this.
Data Stage | Dataset description | Type of data | Versioning |
---|---|---|---|
Raw data | Consumer spending data | Text files | 2017-02-23_ConsumerSpending_1.2.txt |
Processed data | Anonymized Transcription of patient interviews | Word files, Excel | 2014-11-17_RawTranscription_Checked1.docx |
Analysed data | Photo Images with descriptions | TIFF files | C:\Images\Raw\2016-07-01_Subject1-V2.tiff C:\Images\Clean\2016-07-01_Subject1-H1c.tiff |
Word files | C:\Images\Clean\Descript\2016-07-01_Subject1-H1c.Docx |
Dataset documentation
On this general, descriptive, level the metadata concerns data packaging & metadata documentation on the dataset. It can include items like:
- Readme files that lists the period of research, collaborators, a short description of the research as well as the elements within the dataset
- Code Book: it provides descriptions, explanations or definitions of variables in a dataset
- Policy documents describing the context of the research as well as referring to standard operating procedures used
- Re-use guidelines (or licences) describing if there are re-use restrictions or limitations, including contact details.
Dataset registration
When you want to make sure that your dataset is findable it is recommended that the elements of the description of your dataset are made according to a certain metadata standard that allows for easier exchange of metadata and harvesting of the metadata by search engines. Many certified archives use a metadata standard for the descriptions. If you choose a data repository or registry, you should find out which metadata standard they use. At the VU the following standards are used:
- DataverseNL and DANS use the Dublin Core metadata standard
- The VU Research Portal PURE uses the CERIF metadata standard
Many archives implement or make use of specific metadata standards. The UK Digital Curation Centre (DCC) provides an overview of metadata standards for different disciplines. The list is a great and useful resource in establishing and carrying out your research methodology. Go to the overview of metadata standards. More important tips are available at Dataset & Publication.
Archiving & FAIR Principles
The Dutch Techcentre for Life Sciences has developed open source software code to enable you to make your dataset’s metadata FAIR. The software is being developed through GitHub and full details on the FAIR Data Point Software are available there. The Dutch eScience Center also developed Fair Data Point software, of which full details are, similarly, available on GitHub.