5. Planning and implementation

This step can be implemented as follows

The person promoting data sharing in the organisation and the data administrator

  • assess the possibilities of sharing data, for example with regard to access rights,
  • assess the benefits, risks, and costs of data sharing,
  • assess the need for data anonymisation and aggregation,
  • assess the quality of the data and metadata to be shared,
  • select the licence under which the data will be shared, and
  • decide on opening the data.

The data protection specialist:

  • is consulted on the identification of information security and data protection risks, and
  • is consulted on the need for anonymisation and aggregation.

The organisation’s IT experts and the person responsible for the data to be shared:

  • assess the technical data sharing implementation, for example whether the data is shared as a file or through an API.

Defining data to be opened

Figure: Steps for defining the data to be opened

This section describes various matters that should be assessed and defined by the organisation as it begins planning how it will open its data in practice. At the same time, it is also advisable to consider if it would be possible to also open the data production process (calculation rules, algorithms, etc.) when opening the dataset.

In organisations that have already opened data, the process for defining the data that is to be opened is usually carried out in accordance with the steps below. For more information on the topic, see the data.europa.eu service’s guidebook for publishing open data (pdf).

Familiarisation with the organisation’s information management model

At the start of the process for defining the data to be opened, begin by identifying:

  • who manages and is responsible for the data be opened, and the information system used, and
  • what are the most likely use cases for the opened data.

For example, you can use your own organisation’s information management model, which describes the organisation’s information resources, or rely on the help of the person responsible for opening the data, if the organisation has allocated resources for this task. For additional information on the information management model, see step 4 of the operating model.

Consideration of factors affecting data sharing

At the practical level, the organisation must identify how the data collected and managed by the organisation has been created and what kinds of agreements and legislation apply to the data’s different components. When defining the data to be opened, take the following into account

  • possible legislation applicable to the data,
  • ownership of the data, 
  • copyrights, 
  • data disclosures, 
  • data protection, 
  • information security, and
  • factors related to data descriptions or metadata.

For more information on the legislation applicable to data sharing, see step 2 of the operating model: “Legislation and obligations”. More information about ensuring data protection is available later in this step. The description of metadata is explained in step 6 of the operating model: “Publication”.

If the aforementioned factors do not restrict the sharing of the data, the organisation can consult with the data’s administrator about how the dataset could be technically formed and shared in practice. It is also a good idea to define the coverage and level of accuracy of the data to be opened so that the potential or usability of the data is not impacted in any way. 

Making use of standards

When defining the data to be opened, it is advisable to determine whether there are national or international standards for opening the data in question (e.g. data modelling and formats), or whether another party has already opened any similar data, in which case the same data model can also be used in your own opening process. When using international standards, organisations should note the dissimilarities between the laws of different countries, especially regarding data protection.

For example, the national Open Data portal uses the DCAT-AP data model to describe the metadata of its data. This model is an application profile defined by the European Data Portal (EDP) for the use of the DCAT standard in Europe. A description of the DCAT-AP data model can also be found on the Interoperability Platform.

In addition, the six largest cities in Finland have collected a list of other international standards (Google Sheets) that have been used for data sharing in Finland.

Tips from organisations

Tips from Helsinki Region Infoshare

It is important to clearly designate a single party (responsible role) for opening the data, which parties outside the organisation can contact about open data questions. It is also a good idea to organise developer meetings and similar, where you can collect feedback even before opening the data.

Tips from the Finnish Meteorological Institute

The data controller should consider how metadata and a verbal description of the dataset will be produced, and how any need for user support will be responded to after publication.

Tips from the National Land Survey of Finland

The data controller must go through the copyrights of the data in detail, ensuring that its ownership is clear. If the data has previously been licensed against a fee, the transition period and customer communications should be planned carefully. In addition, sufficient communications network capacity must be ensured.

Benefits, risks and costs

This section describes the potential benefits, risks, and expenses of opening data that the organisation should assess before opening its data. Identifying and evaluating these areas is necessary for ensuring the success of the data opening process.

Studies have demonstrated several societally significant benefits arising from the opening and reuse of data. These benefits include:

  • improvements in public services and cost management,
  • increased competition and the creation of new services, and
  • transparency of public administration and support for research activities.

Figure: Benefits of opening data

The benefits of opening data can also be examined through the value created by the opened dataset. For more information about determining the value of datasets, see step 4 of the operating model. More information on the benefits of opening data can be found in the Economic Benefits of Open Data report (pdf) published by the EDP.

Risk identification is a necessary part of the data opening process so that any risks can be prepared for and managed effectively. Key risks related to the opening and sharing of data include the possible use of the shared data or data combinations (consisting of the shared data and other data) in a way that is harmful or damaging to society, citizens, or the authorities. Examples of such cases include: 

  • identity theft,
  • blackmail or scams, and 
  • physical damage to civic infrastructure, such as electricity, telecommunications, or transport networks and buildings.

The opening of data often results in costs, especially in connection with the organisation's first opened datasets. These costs may include:

  • acquiring and maintaining technology,
  • maintaining datasets,
  • extracting or detaching datasets, and
  • monitoring the use of datasets and supporting users.'

Managing information security risks

Recommendations of the Information Management Board on information security

The Information Management Board has issued a set of recommendations for applying certain information security provisions (in Finnish) (Ministry of Finance publications 2021:65), according to which information risk management is a continuous activity, and the information management entity should describe the objectives, responsibilities, and key methods related to it. The management is responsible for the organisation of and allocation of resources to information risk management. In addition, the information management entity maintains datasets comprising the risk assessment results and risk management plans and regularly assesses if this data is partly or fully secret or classified.

The Information Management Board has also issued a recommendation on the criteria for assessing information security in public administration (in Finnish) (Julkri), which contains instructions for applying the criteria (Ministry of Finance publications 2022:43). The assessment criteria support the needs to develop and evaluate information security in all branches of public administration. They can be used to assess compliance with the Public Information Management Act, the Decree on Security Classification of Documents in Central Government and, in part, the information security requirements laid down in the General Data Protection Regulation.

Digital security

It is important to ensure digital security when planning and implementing any data opening activities. Digital security includes issues related to risk management, business continuity management and preparedness as well as cyber security, information security and data protection. Citizens, companies, and communities must be able to rely on ethically sustainable, open, and transparent public administration services when they are provided digitally as well.

The Government Resolution on Digital Security in the Public Sector (Ministry of Finance publications 2020:23) defines the principles of development and key services for promoting security in digital environments. To promote digitalisation and digital security in a balanced manner, the Ministry of Finance has appointed a strategic management group for digitalisation and digital security in public administration.

Read more about actions and documents related to the development of digital security (in Finnish).

The BRC method

You can use the BRC method – which is used to assess benefits, risks, and costs – to assess the areas presented in this section and generate an indicative summary of these factors.

Method for assessing the potential benefits, risks and costs of opening data

As part of the operating model for data sharing, an assessment method was developed to help public administration organisations to assess the potential benefits of opening and sharing their datasets and the risks and costs associated with data sharing. This tool is also known as the BRC method (benefits, risks and costs).

Download the BRC assessment tool (in Finnish, Excel file)

The BRC tool is an Excel sheet that provides an overview of assessment results based on the responses the user has filled in, helping to determine the potential benefits, risk profile and costs arising from opening the dataset. It should be noted that the summary that is received based on the responses is only an indicative overview of different observations and not a recommendation. Each organisation makes the decisions on opening its data independently, taking into account legislation (including access rights and data disclosure), official guidelines and its internal policies. For example, the summary can be used as background material to justify the potential benefits of data sharing to those who make the decision on opening data.

The BRC method is based on assessment methods used by public administration organisations in Finland and internationally. Such parties as the National Institute of Standards and Technology, the University of Washington, Harvard University and several other expert organisations have been involved in developing the first version of the method.

The assessment method can be used to:

  • prioritise the order in which datasets are opened when resources are limited
  • identify datasets whose opening involves different risks
  • reach an understanding of the costs incurred from opening data
  • identify datasets with the greatest potential benefits for external stakeholders (data users) 
  • analyse possible income obtained from sharing data
  • introduce a systematic approach to the opening of datasets and decision-making on data sharing

The assessment method is intended for those responsible for opening data in organisations. They may include those responsible for datasets, heads of information management or data opening coordinators. In addition, it is advisable to involve specialists of each area from technical experts to data protection officers in the different stages of the assessment.

Ensuring data protection

This section describes the assessment of the need to aggregate and anonymise the dataset to be opened, as well as organisation-specific practices for ensuring the data protection of the dataset and the implementation of any necessary aggregation, anonymisation, or pseudonymisation processes.

At this stage, it is a good idea to carefully assess together with the Data Protection Officer whether the dataset to be shared is public and whether it could contain any personal data or other data critical to the functioning of society. It is also worth assessing whether the data to be opened needs to be processed in such a way that, for example, the requirements of the EU’s General Data Protection Regulation are not violated.

Anonymisation, aggregation, and pseudonymisation

Anonymisation means processing data in such a way that it becomes impossible to identify any individual persons directly or indirectly. Identification must be prevented irrevocably and ensuring that the controller or some other third party can no longer convert the data back into an identifiable format using the data in their possession. For example, identifiers can be deleted or generalised (aggregated) to a level where individuals can no longer be identified. Identifiers include names, addresses, phone numbers or personal identity codes.

Aggregation refers to the re-grouping of data on the basis of one or more factors to a rougher level. The data can be combined/reconstituted at a general level or converted into a statistical format so that the data concerning an individual person is no longer in an identifiable format.

Pseudonymisation means processing the personal data in a way that eliminates the possibility of associating it with a certain person without additional information. Such additional information must be kept carefully separate from personal data.

Consideration of the principle of publicity when opening data

According to the principle of openness (Act on the Openness of Government Activities 621/1999), official documents shall be in the public domain, unless specifically provided otherwise in the Act on the Openness of Government Activities or another Act. It is important to note, however, that a public document may contain personal data and that there must always be a legal basis for disclosing personal data, even if the document is public. The authority must assess if the personal data contained in the document can be disclosed. Consequently, public information does not necessarily mean that the information can be published, as a public document may contain personal data that cannot be published even though the document is not secret. The precondition for secrecy is fulfilling the criteria for secrecy laid down in the Act on the Openness of Government Activities, and secrecy provisions are also included in special legislation. 

Compliance with the General Data Protection Regulation

It should be noted, however, that as long as a person can be directly or indirectly identified based on the data or the data can be reverted to an identifiable form, it is still personal data that is subject to the General Data Protection Regulation.

Under the General Data Protection Regulation, certain controllers and processors of personal data must appoint a data protection officer. This obligation applies to all authorities and public administration bodies. The data protection officer provides guidance on data protection to the controller and employees processing personal data. They monitor compliance with the GDPR and the information activities and training provision related to data protection in their organisation. The data protection officer provides advice related to impact assessments and serves as the contact point for the supervisory authority.

The Office of the Data Protection Ombudsman is the national supervisory authority that monitors compliance with data protection legislation. The Data Protection Ombudsman and the Deputy Data Protection Ombudsmen perform their duties independently and impartially. The Office of the Data Protection Ombudsman has an Expert Board (whose term of office runs from 1 October 2020 till 30 September 2023). The Expert Board’s task is, on the Data Protection Ombudsman’s request, to issue statements on significant issues related to applying the legislation on personal data processing. For more information, visit the website of the Office of the Data Protection Ombudsman.

Read more:

Practices employed by organisations

Below, you will find examples of different privacy practices, such as aggregation and anonymisation, that are used by organisations that have opened their data.

State Treasury's practices

The State Treasury’s analysts carry out analyses on assignment. The analyses mainly use the central government's shared data platform, to which the material specified in the assignment is imported. Together with the customer, the analyst defines the data areas needed for the analysis using a data navigator. The navigator contains descriptions of the data residing in the systems of central government's joint service providers. In this description, the service provider and agencies have specified if a field may contain personal or secret information. A predefined set of data masking rules is generated for the described fields.

The analyst places an order for data that matches the assignment with the data engineer. The data engineer retrieves the necessary columns from the service providers' systems through APIs, removing any unnecessary columns from the dataset to minimise the data, and masking the data following the regulations:

  • Any text fields that may contain personal data are deleted
    • For example, the descriptive texts for monitoring targets 1 and 2 will be removed from the financial monitoring data
    • For example, fields containing personal names or emails will be deleted
  • In the dataset, personal identifiers are encrypted using an encryption algorithm, ensuring that the original value cannot be identified while preserving the uniqueness of the identifier
    • For example, using a cryptographic sealing function, the contents of fields containing personal identity codes are converted into a string from which the original value cannot be directly derived
  • The State Treasury may not have access rights to the data at a certain level of accuracy, but the data aggregates produced from it may be public. In these cases, the service provider aggregates the data to a public level defined together with the agencies on a case-by-case basis. In this context, aggregation refers to the re-grouping of data on the basis of one or more factors to a less accurate level.
    • For example, instead of the operating unit, the sum or average of the accounting unit level is presented
    • For example, trips to different continents are presented, instead of individual countries

The masked and minimized data is transferred to the platform for the use of the analyst. The analysis is carried out based on the masked data that does not contain directly identifiable personal data. If the analyst finds that the data may contain direct personal data, however, they inform the data engineer of this, making it possible to adjust the masking rules for the fields in question, and refrain from processing the dataset before the correction has been made and the dataset is free from personal data. Once the analysis has been carried out, the analyst aggregates the results to a statistical level before the results are presented to the customer. This means ensuring that the groups to be shown contain data concerning at least five persons, making sure that no individuals can be identified in the results.

Statistics Finland's practices

To ensure the data protection of a dataset, it is important to examine if there are target units in the dataset whose identity or attributes could be disclosed directly or indirectly. A precondition for direct identification is that the dataset includes a unique attribute of the target unit, such as a name, address or business ID. Indirect identification is possible when the target unit can be identified on the basis of several attributes, for example ‘mayor’ as the occupation and the municipality in which the person works as additional information. Some attributes of an individual target unit may also be revealed without identifying the target unit when a larger group to which the target unit belongs shares some of the same properties. In a survey on well-being at work, for example, all employees of a certain department have responded to the survey and expressed their dissatisfaction with the physical working environment.

When assessing this risk of disclosure, there is a major difference between talking about unit-level datasets or data that has been aggregated in some way. When processing unit-level data in which the properties of an individual target unit are examined by target unit, indirect disclosure may still be possible even if the data has been aggregated by attributes. Longitudinal datasets which examine the situation of the target unit over the long term are a good example of this. The mobility or work history of a person can very quickly lead to a situation where the possibility of indirect identification cannot be excluded, even if the data is aggregated to some extent. In the case of unit-specific datasets, the risk of disclosure should consequently be examined across a broad front, taking several attributes into account simultaneously. In general, anonymisation of unit-level datasets using aggregation and data delimitation results in small datasets mainly used as examples. Alternative data protection methods include data scrambling, (multiple) imputation or the production of synthetic datasets.

Statistics Finland has produced anonymous unit-level datasets intended for teaching purposes. The results obtained from these datasets may be indicative, but they are in no way suitable for producing statistical reports or scientific research. Read more about training datasets.

Aggregated data refers to data in which the attribute values of several target units have been compiled. This data can be divided into frequency tables describing the number of target units and quantity tables describing attribute values which, for example, describe totals or averages of an attribute. For frequency tables, the disclosure risk is determined by the cell value of each cell as a threshold value which the target units in the cell must reach at minimum. The threshold value depends on the attributes to be examined. In its official population statistics, Statistics Finland partly includes even individual persons in the statistics. Generally, however, at least three target units in a cell are required to ensure data protection. Applying this minimum value avoids situations where two target units sharing the same attributes could infer each other's values on the basis of the published data. Statistics Finland applies a higher threshold when examining data at a geographical level more accurate than a municipality (the threshold may be as high as fifty when looking at grid data) and, usually, a threshold of ten is applied to special category data referred to in the General Data Protection Regulation or crime data.

For quantity tables, merely looking at the threshold value is not enough to prevent the attribute values of another target unit from being inferred if the target units are in the same cell. In this case, Statistics Finland also applies the dominance rule to identify cells involving a disclosure risk. This means that cells in which a single target unit, or several target units together dominate (produce the majority of the value of the cell), are flagged as subject to protection. For example, if the cell examines company turnover by industry and region, it should not be possible to infer the value of an individual large company in a cell where the other companies have very low turnovers in proportion to the largest.

The threshold value or dominance rule can be used to determine the cells presenting a primary disclosure risk. If the data is deleted (masked) in the published dataset, recalculating their values is easy if the dataset also contains marginal amounts, or totals of rows and columns. In this case, additional masking must be used to ensure data protection. Special software is available for complementary masking, the use of which ensures adequate protection when determining the cells subject to secondary masking. Such special software includes Tau-Argus and sdcTable R package. More information about software on GitHub.

For more information on data protection, see Statistics Finland’s resources for researchers:

HRI’s instructions for assessing the aggregation and anonymisation needs of survey data

In cooperation with the City of Helsinki's data protection officer, Helsinki Region Infoshare has created instructions for opening survey datasets (and other data containing personal data) (in Finnish).

Good practices, support materials and other publications of VAHTI working groups

VAHTI is a cooperative, preparatory, and coordinating body of organisations responsible for the development of digital security in public administration. Organisations can use best practices and VAHTI guidelines to develop different areas of security.

VAHTI activities were transferred to the Digital and Population Data Services Agency in early 2020.

Outdated recommendations can still be applied if legislative amendments are taken into account.

The Digital and Population Data Services Agency has produced several training packages on digital security that are available on eOppiva, for example “Risk management in the digital world” (in Finnish) and the “ABC of data protection” (in Finnish).

Selecting the form of data sharing and file format

This section describes the formats in which data can be shared and what should be taken into account when selecting a format. This section contains useful information for choosing a distribution and file format.

Data can be shared as files, through APIs, or through download services. The technical implementation of data sharing largely depends on the types of sharing solutions developed for the information system. If data from the information system can be shared through an API, the national API principles should be used in its design. Data in file format can be exported from the system as a batch report and/or through an API. APIs rarely exist in, or can be developed for, older information systems, which is why batch files may be the only option for sharing data.

It is a good idea to share datasets in several different formats, if possible – i.e. by offering downloadable files in addition to an API. When publishing open data as a downloadable file, it is a good idea to use open data formats, i.e. file formats, whenever possible. For more information about the classification of file formats, see Tim Berners-Lee’s 5-star model.

Figure: Tim Berners-Lee's 5-star model (adapted from: 5-star Open Data).

Which sharing method is suitable for each type of data?

The selected data sharing method should be compliant with the legislation on access rights, data disclosure, and providing data in machine-readable format, as well as the obligations imposed by these statutes, such as sections 22 and 24 of the Public Information Management Act. In addition, any modifications that may be required for the data, such as pseudonymisation or anonymisation (which were discussed in the previous section), must be taken into account.

Sharing data in an open file format

Files are suitable for small and especially static datasets whose contents do not change often. Open, high-quality data can be is shared in an open file format, which usually allows for reprocessing the data in a software-independent manner.

An open file format refers to a non-commercial file format that anyone can use free of charge. The use of open file formats is not restricted by copyrights, patents, trademarks or other restrictions. For example, Microsoft’s .docx or .xslx file formats are commercial, not open, and using them with free software is difficult. According to Tim Berners-Lee’s 5-star model, data published in an open file format receives at least 3/5 stars.

The list below contains tips for publishing different types of datasets:

  • Text format data: TXT. The easiest and most reliable file format for publishing text files is .txt.
  • Tabular data: CSV. The best and easiest file format for tables is .csv (Comma-separated Values). CSV files can be easily created using common spreadsheet programs, including Microsoft Office Excel, by selecting csv as the file format when saving.
  • Spatial data, small vector data: GeoJSON, KML, Esri shapefile (shp) or GeoPackage. The first two options use the global WGS84 coordinate system, which can be easily processed with a variety of programs and tools. An shp file, on the other hand, supports several coordinate systems, including those developed for the Finnish conditions.
  • Spatial data, large raster data: GeoTIFF or NetCDF. To publish data in raster format, for example GeoTIFF file format can be used. 

When data is shared in the PDF format, you should note which PDF version is used and make sure that the data is in a machine-readable format. Adobe developed and patented PDF in the 1990s as a commercial file format. In 2008, its version 1.7 (ISO 32000-1) was standardised as an almost open format, while some of its features remained Adobe’s property (including Adobe XML Forms Architecture, Adobe JavaScript). In PDF 2.0 version (ISO-32000-2) published in 2017, however, all features were open. Read more about open file formats

Read Wikipedia's comprehensive list of open file formats.

Sharing data through an API

What is an API?

Application Programming Interfaces (APIs) are documented interfaces through which software, applications, or systems can exchange data or functionalities. The API provides data or a functionality in a machine-readable, documented format, making it possible for some other software, application or system to use it programmatically. For example, a route guide application can use a public transport API to receive information on when a specific bus will arrive at the stop and display this information to its user.

In this operating model, the terms API, application programming interface and technical interface referred to in the Public Information Management Act mean the same thing. It should be noted that rather than referring to an interface intended for end users, APIs are always used by some other software, application, application component or system.

Why use APIs for data sharing?

Sharing data through APIs is in many ways advisable and useful, especially if the volume of the data is very large and the dataset is updated frequently or in real time, in other words comprises dynamic data. This includes train timetables or weather data. However, it is worth remembering that file sharing is also useful especially for those persons and parties who are unable to use APIs. File distribution may also require fewer resources from the data distributor than the implementation and maintenance of a new API, especially if the organisation does not otherwise make use of APIs.

APIs can be a web-based, such as REST, SOAP or GraphQL, or they can be based on databases or other protocols. The essential point is that the API provides data in a machine-readable, documented format, making it possible for some other software, application or system to use it programmatically. Providing the data through web-based interfaces is a good idea if this is possible and consistent with the purpose of use.

Web-based APIs can be used in both internal and external APIs, and a wide range of information security controls can be implemented in them. The file format to be shared depends on the communication protocol. For example, web-based APIs usually use a HTTP-based communication protocol or architecture, such as REST. APIs are also highly suitable for sharing statistical data residing in a database. For example, see the datasets in Statistics Finland's open databases (in Finnish).

It is important to take APIs into account as part of the organisation's other information management and operating processes, as well as its goals for knowledge management. It is particularly essential for the organisation to determine which types of datasets are offered or used through APIs internally and externally, and which datasets should be accessible through APIs. Internal access and use can be implemented through internal APIs. To provide for external access and use, partner APIs or public APIs can be used, depending on the classification of the data. Please note that if the data cannot be provided as open data, sharing it may require using the Suomi.fi Data Exchange Layer.

Using the national API principles in the design and development of your APIs is advisable. The public administration API principles provide support and instructions for public administration actors in the development, management and file formats of APIs. Among other things, the API principles provide support for specifying the APIs, assigning of responsibilities, promotion of interoperability, procurement, testing and implementation of APIs. When designing the API, it is important to specify how changes to the life cycle plan or service level of the API are managed.

Additional information and support material for API development, management and file formats:

Comparing the forms in which data can be shared

Use the table below to evaluate which distribution format is best for your purposes. Pay particular attention to the differences highlighted in the table.

Comparing the forms in which data can be shared
 FileAPI
Easy for the userUsually the easiest to use. Small CSV files, for example, can be opened using standard office software

In practice, APIs are often only used by people with programming skills.

The design of the API affects its use friendliness. The entire life cycle of the API should be taken into account in its design.

Technical competence required of the administratorNo specific technical competence is required.Requires competence in both developing and maintaining the API.
Data volumeSmall volumeLarge volume
Data delimitationWhen the data is published as a file, the entire dataset is always downloaded at once.

The data is delimited based on a query, or all data can be retrieved at once.

The API can also offer files.

Rate of data updatesA file is primarily suitable for sharing data that changes very little/infrequently. If the data changes, the updated version must be shared separately.An API is often recommended for data that changes frequently.
Monitoring of useChallenging, as a file is easy to copyEasy, as analytics can be collected on API requests, including the IP address, query, time, date, query-response, etc.
Practical examplesPostal codes, State budget, most popular first names, small statisticsWeather and timetable data, business data, mobility

Examples of the distribution methods used by organisations

Methods used by the Finnish Meteorological Institute to share data

The Finnish Meteorological Institute shares its data through its own API services and Amazon AWS.

Helsinki Region Infoshare’s tips for choosing the data sharing method

Helsinki Region Infoshare service for the cities in the Helsinki Metropolitan Area has compiled tips for assessing technical feasibility. The questions below will support the assessment.

In which format should data be opened?

As a file:

  • File in which the data is maintained (xlsx/csv/shp/ …) 
  • The data is exported manually from the system
  • The data is exported automatically from the system
  • Usually a quick and free-of-charge way to open data, but often requires manual and memory-resident updates

Through an API:

  • An API is created for data that is exported from the system automatically
  • An API is produced for the system/system copy
  • More work and resources are required at the beginning, but no separate updates are required

Questions that should be considered when selecting the data format:

  • How often is the data updated?
  • How large is the data volume?
  • Is it real-time or, for example, annually compiled data?
  • How much manual work does editing the data take?
  • What could the data be used for?
  • Are there any standards?
  • Has any other party already opened similar data? How was it done? Would it be possible to open the data in a similar format?

Helsinki Region Infoshare has introduced the Datasette tool, which enables the publication of data available through an API in a file format. Read more about Datasette on HRI’s website (in Finnish).

HRI’s instructions for selecting a file format (in Finnish).

Defining data quality

This section describes how the quality of a dataset to be opened can be evaluated, determined, and described.

The public administration's shared data quality criteria and indicators developed to support improvement in the quality of public administration data can be used to evaluate and describe the data quality.

A dataset’s descriptive data, i.e. metadata, should describe the assessment of the dataset’s current quality along with its possible weaknesses. In the Open Data service, for example, you can type the data quality evaluation in the Description field of the dataset’s metadata, or add the description as a separate resource in PDF format. 

It is important to note that even if the quality of the dataset to be opened is not to the expected standards of the party administrating the data or their stakeholders, this does not necessarily mean that the data should not be shared. The dataset can be shared, drawing attention to the shortcomings in its quality in the metadata.

Data quality criteria

A public Data Quality Framework has been developed under the leadership of Statistics Finland and through broad-based cooperation within public administration. This work has been carried out as part of the Ministry of Finance's project on Opening up and using data. The data quality criteria and indicators were published in spring 2022.

The data quality criteria can be used to describe and assess the quality of data. They also help data users to assess if the data is of sufficiently good quality for the intended purpose. In the longer term, the quality criteria help improve the quality of data and information resources.

The quality criteria are intended as a flexible tool; not all criteria or, in particular, indicators are necessarily relevant to all situations or data sets. It should additionally be noted that the purpose of the data affects the level of each quality criterion that should be aimed for. For instance, for some purposes continuously updated data is needed (pandemic monitoring), whereas for others, annual or even less frequent updates (location of old buildings) is sufficient. While the quality criteria and their indicators add up to a hierarchical structure, they also affect and are linked to each other.

The quality criteria of the quality framework, and especially their indicators, focus on structured data. From the data user’s perspective, the quality criteria for datasets have been grouped under three questions.

How well does the data describe reality?

  • Timeliness: Timeliness describes the time dimension of datasets. The closer the reference date of the dataset is to the present, the more timely the data will be. The reference date is the date to which the data relates.
  • Coherence (regularity, logical integrity of data): Coherence indicates that the dataset is consistent and non-contradictory. Coherence can also be used to describe consistency between different datasets.
  • Completeness (extensiveness): Completeness describes the temporal and regional coverage of the dataset as well as the target units and attribute data that are aimed for. On the other hand, completeness indicates the extent to which the dataset contains the desired data.
  • Correctness (validity): Correctness describes the extent to which the data in the dataset corresponds to reality. By examining the correctness of data, systematic distortions in the dataset may also be picked up.
  • Accuracy (unbiasedness): Accuracy describes how well the data in the dataset corresponds to what is aimed for and how precise the data is.

How has the data been described?

  • Traceability (non-repudiation): Traceability indicates that any changes made to the dataset and its data can be traced. The origin of the data is known.
  • Intelligibility (interpretability, comprehensibility): Intelligibility describes the extent to which the dataset has metadata that helps to understand the data when in use.
  • Compliance with recommendations (interoperability, semantic uniformity, consistency): Compliance indicates that the dataset and its attribute data comply with known standards, practices and statutes and that they have been reported in connection with the dataset.

How can the data be used?

  • Machine readability: Machine readability indicates if the data has been structured to enable computerised processing, and processing in different information systems.
  • Punctuality (timeliness): Punctuality means that the dataset is available on the given date and with sufficient frequency to reflect changes in the dataset.
  • Access rights: Access rights describe how the rights to use the dataset have been defined and what users may do with the data, in other words what purposes the data can be used for.

Read more:

Defining access rights

This section describes how the dataset to be opened should be licensed, i.e. what terms of use should be set for the data.

There are no official recommendations for defining the access rights to datasets that are to be opened, but in practice, the opening of the data requires a licence. There are many ready-made licence options that you can use for your own dataset.

Data access rights are defined

  1. by selecting a suitable licence for the data that informs its users on what terms the published data can be utilised
  2. by describing the licence in the metadata of the dataset to be published. 

What kind of licence should be selected for opened data?

In order to qualify as open data, the shared data must have an open licence that allows for the free sharing, modification, and use of the dataset for all purposes, including commercial ones. In addition, it is worth considering whether the origin of the data should be mentioned.

Datasets published as open data are licenced under the Creative Commons CC BY 4.0 or CC0 licence. While no national recommendations currently exist in Finland for the licensing of public administration’s open datasets, the earlier JHS-189 recommendation on open data access rights recommended the use of the CC BY 4.0 licence.

According to the EU Regulation on High-value Datasets (2023/138), high-value datasets must be licensed under the Creative Commons CC BY 4.0 or CC0 licence, or some less restrictive open licence.

Creative Commons licences should be used, as their popularity makes it easier to deal with different issues, such as data access rights. It is important that the organisation does not start creating licences itself, as case law related to them cannot be predicted. 

The use of well-known licences also benefits data users:

  • Creative Commons licences are internationally recognised, which makes cross-border use possible.
  • Aggregating and reusing datasets is easier when they are subject to consistent and familiar terms and conditions.

Most common open data licences

Creative Commons CC0 1.0 Universal

The CC0 licence means that all copyrights to the data are waived. Data licenced under a CC0 licence has been fully released for free use, both for commercial and non-commercial purposes. The data user does not need to acknowledge the origin of the data or request permission for its use.

The metadata of datasets is often published under a CC0 licence. For example, the metadata on the hri.fi service is CC0-licensed, which makes it possible for the metadata to be automatically copied to the Open Data service.

Creative Commons Attribution International (CC BY 4.0)

CC BY 4.0 or CC Attribution International 4.0 licence obliges the data user to acknowledge the origin of the data. The data user must acknowledge the source, provide a link to the licence and indicate any changes that may have been made to the data. CC BY 4.0 licensed data can be used freely.

Example of acknowledgement

Helsinki Region Infoshare service recommends the following acknowledgement of using data published on the service: "Source: Revenue and expenditure of the City of Helsinki. Data maintained by Helsinki City Executive Office. Dataset downloaded from Helsinki Region Infoshare service on 15 November 2021 under Creative Commons Attribution 4.0 licence".

Read more about Creative Commons

Creative Commons is an international, non-commercial organisation that promotes the sharing and use of creativity and information by means of free legal tools. The free and user-friendly copyright licenses of Creative Commons provide an easy and standardised way to give the public the right to share and reuse creative outputs on specific terms and conditions. Rather than replacing copyrights, the CC licenses operate alongside them.

Read more about Creative Commons Finland’s work (in Finnish).

Read more about open data licensing on data.europa.eu.

Creative Commons offers assistance for selecting a suitable licence.

Avoindata Note icon

Limiting liability with a disclaimer

In addition to a licence, liability towards data users may sometimes need to be limited with a disclaimer.

Example of a disclaimer

“[Organisation name] is not liable for any loss, litigation, claim, prosecution, cost or damage of any nature, caused either directly or indirectly by association with the open data published by [organisation name] or the use of the open data published by [organisation name].”

Deciding to open data

This section describes how the organisation can proceed when it decides to open its data.

There are no official recommendations for deciding on the opening of data. Instead, each organisation can act in accordance with its own processes. For example, the person promoting the organisation’s data sharing and the organisation’s data administrator can make the final decision on the opening of the dataset to be shared.

Under Finnish legislation, the decision to open the datasets is made by each authority that has been given the statutory task of administering the data in question. For example, the Finnish Institute for Health and Welfare (THL) makes the decisions on providing its datasets as open data. There is no centralised body in Finland that would make decisions on the openness of data in the entire administration.

Data is opened so that it can be utilised. Organisations often know at least some of their customers who could use the data to be opened and see value in it. It is worth assessing the opportunities created by opening the data, for example in a workshop, with these potential data users. As input information, a description of what data the organisation could open is needed. Organisations often have a great deal of data, and only a small part of it can be opened at once.

At the same time, a decision on managing any residual risks should also be made. Residual risk refers to a risk or part of a risk that remains in force after the necessary measures have been taken, or that cannot be addressed or are best left unaddressed. For additional information on residual risks, see Risk Management Handbook for central government actors (in Finnish) (Publications of the Ministry of Finance 2023:54).

If the organisation intends to open several datasets, it may need to prioritise the order in which they are opened and the related development measures.

Practices employed by organisations

Practice of Helsinki Region Infoshare

Rather than an official decision on opening data being made, in the City of Helsinki the data owner specifies the data to be opened without a formal decision-making process. The number of datasets opened in HRI is small enough to avoid any need for prioritisation.

Practice of the Finnish Meteorological Institute

At the Finnish Meteorological Institute, decisions on opening data through an API and prioritisation are made by a steering group operating within the Institute.

Utilising the Open Data service

The Open Data service is a free publication platform for open data published in Finland. The platform operates on a self-service principle, which means that every authority and citizen can freely use it for opening and using data.

Support materials on the topic

This section contains support material related to the topics discussed in this step.

Training courses on the data.europa.eu website:

Training courses on the eOppiva website in Finnish: