Data Management Plan – DYLEN: Diachronic Dynamics of Lexical Networks

Treatment of research data during and after the project term: During the project, the research data will be stored at the servers of the ACDH and will be constantly backed up. In the last project month (M24), the research data will be prepared for archiving in the ARCHE repository.
Data collected, processed or developed: The starting point are the already available text corpora ParlAT and AMC. The data will be preprocessed, especially POS and Name Entity Recognition. The enriched data will be also archived as xml. In the next step the lexical networks will be constructed. During the project the data will be stored in a graph database and at the end the network data will be archived in ARCHE. Co-occurrence matrices will be also archived as CSV. Furthermore it is planned to archive also the network visualisations (as SVG) in ARCHE. We will create a collection for the DYLEN project in ARCHE. ARCHE uses as persistent identifiers handles. Regarding the code developed within the project, it will be stored in a project-specific github repository during the project and at the end of the project the code will be archived in Zenodo and will get as persistent identifier a DOI and will be therefore citable and sustainable. We will use Zenodo for the code, because ARCHE is more suitable for data and due to the Zenodo and GitHub integration, it is very easy to archive code in Zenodo. The DYLEN project website will be maintained at least 5 years after the end of the project.
Methods and standards applied: In order to generate, analyse and visualise different types of networks, methods from natural language processing, network analysis, data mining and statistics are applied. The corpus data (AMC and ParlAT) are available in xml and the enriched versions with (enhanced) NE will be made available also in xml. For the network data, a graph database will be used to store them. As data format for the archived network data, the RDF standard will be used, since the graph database Neo4j allows the serialisation in RDF. For the co-occurrence matrices the CSV format will be used. The visualisations of the networks will be made available in SVG format. Since the data will be archived in ARCHE and described with metadata and ARCHE supports CMDI, the resources will be also harvested by the CLARIN Virtual Observatory and are then searchable through a faceted browser and hence better findable and accessible to a wider community as well as citable through the assigned handles in ARCHE.
Volume and type of data: As stated above we will have annotated text data (existing corpus data enriched especially with NE), co-occurrence matrices, network data and network visualisations. Furthermore we will have code from the development of the interactive web application. The size of the networks range between some KB and 100 MB, depending on the number of nodes, if there are only some 100 nodes or some 1000 nodes. We estimate the volume of the whole data up to 5TB. Data security measures (during and after the project): During the project we will use a version control system for data processing in order to revert back to specific versions of the project data. During the project, the data will be stored at the servers of the ACDH. There regularly backups are made in order to prevent data loss.
Names of the repositories for archiving of data and code: We will use ARCHE to archive our research data. For the developed code we will set
up a project github repository. The final version of the code will be archived in Zenodo. The publications (or pre-prints) resulting from this project will be archived in the relevant institutional publication repositories and also in Zenodo since ARCHE is best suited for research data but not for research publications (c.f. ARCHE Collection Policy).