Francesco Rizzo, Laura Vignola, Dario Camol, Mauro Bianchi
National Institute of Statistics (ISTAT)
1. The reason why Istat is starting to use SDMX 3
2. SDMX Istat framework 3
2.1 Short-term time series database (ConIstat) 4
2.1.1 Check and Loader software modules 5
2.1.2 ConIstat Web Navigator software module 6
2.2. Changes and new modules 7
2.2.1 Database changes 7
2.2.2 Extensions to the Check and Loader modules 8
2.2.3 The web service module 9
2.2.4 SDMX Web Navigator 11
2.2.5 Manager and web navigator reference metadata module 12
4.1 A generalized SDMX Query parser module (beta version) 13
1. The reason why Istat is starting to use SDMX
Istat is following the evolution of the SDMX initiative with interest taking the following two ways into account:
tactically, through the participation in the EUROSTAT SODI initiative;
strategically, through the building of a working group whose main aim is to analyze and verify the use of SDMX in the internal architecture of the Istat Information System.
The actual Istat Information System is supported by a distributed architecture. Several production Directorates operate through local sub-systems that, independently, cover the full life cycle of statistical data, from data collection to data dissemination. So Data and Metadata dissemination are currently located in different database maintained by different production Directorates.
To access these sub-systems, users must use navigation interfaces that are completely different from each other.
The need to uniform the data search using the same interface has become a very big requirement.
To achieve this objective Istat is involved in developing an Integrated Output Management System. This is an Information System oriented towards the integration of part of the life cycle of statistical data of the Institute, with particular emphasis placed on data dissemination.
The experience that we are acquiring participating in the SODI project will allow us to support the strategic interest at SDMX in Istat.
In order to facilitate this objective, we are developing a framework consisting of various compatible SDMX software modules. The framework could be used entirely from the reporting phase to the dissemination phase, or alternatively using the modules separately, integrating them into one Information System.
In the future it might be possible to distribute the framework under a Public Licence.
This document describes the architecture and the modules of the SDMX ISTAT Framework version 1.0
Istat is taking part in the SODI project with an internal working group made up of one analyst and four programmers. The project is divided in the following phases:
studying of SDMX;
reporting analysis – to know where data is stored, particularly STS and ESA data. We discovered that most of this data is stored in ConIstat database (shot-term time-series database, visible at the URL: http://con.istat.it).
analysis and design – to describe which existing modules must be extended and which new modules must be made. In particular the system must perform the following functions:
extend data reporting in order to integrate the existing database with other data required by the SODI project;
mapping between already existing data structure and DSDs defined by Eurostat;
provide a web service capable of accepting a SDMX Query and responding with a SDMX Compact;
provide a RSS file that informs when new data is loaded or updated, which then specifies the URL where to find the files containing the SDMX Query which then locates the new or updated data;
provide a web client program which can be connected to web service. This client is able to query the database using the statistical concepts defined into DSDs. As well as visualizing the results of the search, in XML and HTML format, it is able to visualize the SDMX query that will then be sent to the web service.
The following diagram shows the existing system (ConIstat) and new modules added to satisfy the SODI project.
2.1 Short-term time series database (ConIstat)
The Istat Short-term time series database stores 16.000 time series. The following statistical subject-matter domain list shows how these time series are organized:
Retail trade sales
Employment, wages and other labour indicators
Large firms labour indicators
Large firms labour indicators (seasonally adjusted data)
Large firms labour indicators (data adjusted for calendar effects)
Quarterly indicators on employment, earnings and social security contribution
The ConIstat system consists of the time series database and some software tools that cover all the data flow from reporting to dissemination.
2.1.1 Check and Loader software modules
Production Directorates organize updating data in a fix formatted record file (.dat). Look at this example of a file for the Production Price:
ppa19grCB 200611 119,9
ppa19grD 200611 114,8
ppa19grDA 200611 111,8
ppa19grDG 200611 114,5
ppa19grE 200611 149,3
ppa19grCB14 200611 119,9
The person responsible for sending data for a particular production Directorate, uses the Check module to prepare a “checked file” free from some type of errors. Generally the “.dat” files are prepared automatically by the production Directorate sub-systems, but occasionally they could be prepared manually, so a check process is necessary.
This “checked file” is sent to a centralized structure which provides loading, through the Loader module, into the “time series database”.
The Loader module besides loading data in the database, prepares “Dissemination fix formatted record file” (one for each sub-domain involved in the update).
Look at this example of a fixed formatted record file for the Production Price:
The “Dissemination fix formatted record files” were a specific requirement of some users that need to load their database with the uploaded data of the Istat time-series database.
2.1.2 ConIstat Web Navigator software module
This module is a web application that allows one to search for data using a complex query. A user can simultaneously manage time-series from different statistical domains with different frequencies.
Look at this extracting example of a monthly and quarterly time series, coming from different statistical domains:
Industry; turnover; national turnover index; Mining and quarrying
Continuous Labour Force Survey; Labour Force; by Geographical Area and Sex; Italy Female
2.2. Changes and new modules
As decided in the design phase, some of the existing modules were extended, others were created and a mapping operation, between already existing data and the DSDs defined by Eurostat were made.
2.2.1 Database changes
The need to describe the time series in ConIstat database, using the DSDs defined by Eurostat, has involved some changes in the database schema.
The ConIstat database schema consists of the following main tables:
METADATA, stores a set of structural metadata;
DATA, stores the observations for each time series;
In the METADATA table each row depicts a time series: each column depicts a statistical concept (dimension or attribute).
Particularly the fields Domain and SubDomain perform the same role of the second and third level in the “statistical subject-matter domain” list, described in the “SDMX content-oriented guide”. The fields Category,Type, Vs, ClassificationCode, Freq, and Um refer to the statistical concepts (dimensions and attributes) that allow to distinguish one time series from an other.
Look at a simplified diagram of the ConIstat database schema:
The changes that have been performed in the ConIstat database schema consist of:
creating two tables. One, named STS_METADATA, used to describe STS indicators and one, named ESA_METADATA, used to describe ESA1 and ESA2 indicators.
These tables inherit, the base structure from METADATA. After that, in each table a group of columns is added, whose number depends on statistical concepts used to define DSDs.
In the future it will be necessary to add other tables to store time series from other domains;
creating some lookup tables: CONCEPTS, CODE_LIST and DATAFLOWS;
creating a table, named KEY_FAMILIES that stores the features of each DSD. From this table, it is possible to know which table stores the structural metadata for each domain.
Look at a schematic diagram of the database after the changes:
Look at this example describing a record in the METADATA table and the corresponding mapping, using a DSD:
Domain: e – Industry
SubDomain: if – Turnover
Category: 13 – National Turnover Index
Type: g -Neither seasonally or working day adjusted
Check and Loader modules were extended to include a new data reporting function and a new dissemination function.
The new data reporting function (beta testing) allows one to collect data organized in GESMES format. This function will allow to collect data – not yet in the database – but already available in the production Directorates.
The new dissemination functions have been developed to satisfy the SODI requirements:
to publish a RSS file that informs when new data is loaded or updated, which then specifies the URL where to find the files containing the SDMX Query which then locates the new or updated data;
to publish one or more SDMX Query file(s);
to publish one or more SDMX Compact file(s) (optional).
The SDMX Compact files have been created in the event that the response-times for extracting an entire dataflow, on line, are too long. In this case instead of querying the database, the web service will directly take the already prepared SDMX Compact file. Logically, in this case, it is not possible to filter a specific data extract, but one must extract an entire dataset.
The core of the SDMX Istat Framework is the “SDMX data web service” module. It allows the use of the Pull exchange method to request data.
This web service can be located at the following URL: http://sodi.istat.it/sodiWS/service1.asmx.
A client software can request data from the “SDMX data web service” by sending a SDMX Query. The client receives data in SDMX Compact format.
The “SDMX web service” implements the functions described in the following diagram:
reads the SDMX Query stream;
verifies the XML format against a XML schema:
parsers the SDMX Query and decomposes it in elementary parts;
constructs a SQL query with the same meaning of the SDMX Query;
executes the SQL query;
reads the response of the database and creates a SDMX Compact stream;
sends the SDMX Compact stream to the client.
At the moment the “SDMX data web service” accepts SDMX Queries which contain only the “DataWhere” section and allows queries regarding an entire Dataflow or its subsets.
Look at this example of a SDMX Query that specifies how to extract the entire Dataflow of “neither seasonally or working day adjusted – monthly - Industrial Production”:
Look at this example of a SDMX Query that specifies how to extract a subset (only NS0040, N100CA, N11100 activities) of “neither seasonally or working day adjusted, seasonally adjusted and working day adjusted – monthly - Industrial Production”
We are working on a new version of the “query parser” sub-module. It will be able to interpret a SDMX Query, referring to different Dataflows or subsets of different Dataflows.
During design and prototyping phases we took into consideration the response-time of the “SDMX web service” when a client submitted a query that involved a large amount of data.
The algorithmic query was optimized and we got response-time under 20 seconds in the following test: we requested the “SDMX data web service” an entire Dataflow containing 685 time series, 120 months long, developing 82.200 observations.
2.2.4 SDMX Web Navigator
This module is a web application that acts as a client towards the “SDMX data web service”. This module was created with the intention to test the web service, using a graphic interface. Then we converted it in a real web navigator adding data presentation functions.
In general, using this module, it is possible to carry out the following functions:
querying database using the DSDs as analysis dimensions;
The following image shows the interface used to query the database:
From this graphic interface a user can build a SDMX query choosing the Dataflow, that contains the interested data, and then setting up some filters through the Dimensions and the time period.
Then user can choose if visualize or save the SDMX Query or send the query to the web service and visualize or save the resulting SDMX Compact file.
The following image shows the “query tester”, a user graphic interface that allow one to write a SDMX Query by himself and send it to the web service:
2.2.5 Manager and web navigator reference metadata module
The SODI project, as well as data and structural metadata exchange, deal with reference metadata exchange. For this objective we are developing a software module (in beta version) as part of the SDMX Istat Framework, that allows automating the production and disseminating of this type of metadata.
The production Directorates, send Eurostat and IMF MS-Word files containing reference metadata in a self-governing way through the use of templates.
The following diagrams show the process to produce reference metadata before and after the introduction of the this module:
During the analysis phase we compared the information request by Eurostat and IMF through their “word” template. We put a list of information together considering the requests in both cases and we designed a database schema that was able to store all the information.
Then we designed the following sub-modules:
reference metadata manager (beta testing)
reference metadata web navigator (beta testing)
word extractor (beta testing)
SDMX reference metadata web service (under construction)
The person responsible for filling in the templates can now use the “reference metadata manager” sub-module. This sub-module, through a web graphic interface, allows the storage of all the necessary information regarding reference metadata for a particular dataset.
The information flow in this sub-module is organized by means of documents.
A document contains all the reference metadata for a particular dataset in a particular time period and can be in the following states:
on working - the document is not complete;
posted - the document is complete and ready to be sent to Eurostat and/or IMF.
For a dataset, it is possible to have different documents made at different times (for example one for each year or one for each quarter).
This sub-module allows the following:
start a new session from scratch or recall an already “posted” document;
insert information about reference metadata filling in the text fields linked to each level;
interrupt the session and continue it in a different moment;
“post” (make definitive) the document.
The “Word extractor” sub-module allows one to export the information stored in the database in word format files in accordance with the templates. So the production Directorates can use this module to produce, also, the files that they must send to Eurostat and IMF.
4.1 A generalized SDMX Query parser module (beta version)
As specified in paragraph 2.2.1 the “SDMX data web service” accepts SDMX Queries which contain only the DataWhere section and permits queries regarding an entire Dataflow or its subsets. This module continues to accept only queries with DataWhere section, but it also accepts queries regarding different Dataflows and subsets of different Dataflows. This module permits to extract at the same time, for example, the following time series:
Industrial production - Seasonally adjusted and not working day adjusted – Consumer goods:
Turnover, domestic market (non-deflated) - Neither seasonally or working adjusted - Manufacture of rubber products:
DataSet ID=” STSIND_TURN_M”
In the DataWhere section it is possible to find these types of nodes:
The nodes Dimension, Attribute and Dataflow are all “simple nodes”: these types of nodes contain only a value.
The nodes Time, And and Or are all “complex nodes”: these types of nodes can contain other complex or simple nodes. Particularly a Time node can contain only two child nodes: “StartTime” and “EndTime”, while And and Or nodes can contain all types of nodes.
The idea is to transform a complex SDMX Query into a simpler one, applying some rules of Boolean logic:
the following hierarchical expression:
Node1 And Node2 And (Node3 And Node4) And (Node5 And Node6)
can be transformed in:
Node1 And Node2 And Node3 And Node4 And Node5 And Node6
the following hierarchical expression:
Node1 Or Node2 Or (Node3 Or Node4) Or (Node5 Or Node6)
can be transformed in:
Node1 Or Node2 Or Node3 Or Node4 Or Node5 Or Node6
the following hierarchical expression:
Node1 And Node2 And [(Node3 Or Node4) And (Node5 Or Node6)]
(Node1 And Node2 And Node3 And Node5) Or (Node1 And Node2 And Node4 And Node5) Or (Node1 And Node2 And Node3 And Node6) Or (Node1 And Node2 And Node4 And Node6)
the following hierarchical expression:
can be transformed in:
The SDMX Query is processed by the following steps:
rewrites the XML stream using Boolean logic rules, and adding to each node an attribute that acts as a “unique key”. The scope of the unique key is to set a hierarchy between “parent nodes” and “child nodes”;
converts the XML stream in a memory tabular data structure whose columns represent all types of nodes (Time, Dimension, Attribute, Dataflow) except for And and Or;
re-organizes the in-memory tabular data structure so that all columns have an And relation, and all rows have an Or relation;
converts the in-memory tabular structure in a SQL query.