Activities
1. Implement the service in a limited test location
Before rolling out a data solution or service on a large scale, it is essential to first test it in a limited controlled location. This pilot test allows for the identification of technical issues, user experience challenges, and potential data quality concerns. By starting small, teams can gather valuable feedback, refine and adjust the system, and make sure it aligns with all goals and operational requirements. A localized rollout also minimizes risk, saves costs, and increases the likelihood of success when the solution is expanded across boarder regions or departments.
2. Prepare the infrastructure for further rollout
To prepare the data infrastructure for a broader rollout of a data solution or service, several key aspects must be addressed:
- Scalability
- Data integration
- Security and compliance
- Monitoring and maintenance
- Backup and disaster recovery
- User access and permissions
- Documentation and training
By addressing these areas, organisations can ensure that the data infrastructure is resilient, efficient, and ready to support a full-scale rollout.
3. Implement monitoring systems
Data monitoring makes sure that any data submitted is reliable and meets the beforehand created criteria. It keeps checking the data to verify that it adapts to the rules, even when the rules might be changed.
Implementing a robust monitoring is essential to keep the data solution running smoothly, catch issues, and ensure performance aligns with the expectations.
What to monitor:
- Data pipeline health: Track success/failure rates, processing times, data latency, and throughput.
- Data quality: Monitor for anomalies like missing values, schema changes, duplicates, or outdated data.
- Infrastructure metrics: CPU usage, memory consumption, network throughput.
- User behaviour and service logs: API usage patterns, user adoption rates, and error logs.
It is advised to set up automated alerts, if e.g. anomalies in KPIs or model outputs occur of if the performance is degraded. A great way to have a low-cost solution implemented for these tasks is to opt for an open-source-tool. Here are three suggestions to open-source monitoring tools:
- Prometheus https://github.com/prometheus/prometheus
- Apache Superset https://github.com/apache/superset
- Apache Airflow https://github.com/apache/airflow
4. Create participatory data
The first step is to classify your data. Ask yourself: Is the data safe to be shared with anyone (Open Data)? Does it require authentication or should the data be available with a limited access (restricted data)? Or is the data even confidential or sensitive in a way that it is not meant to be for public use? For this, a data classification policy should be defined, which clarifies how each dataset is handled.
Next, apply data anonymization and aggregation. Remove or mask PII (e.g. Names, addresses, birthdates, GPS-level locations). Rather aggregate the data, e.g. show neighbourhoods instead of individual households. Use role-based access control (RBAC). If certain datasets need controlled access, allow role-specific access. This can be granted for researchers, certain departments etc. Public users mostly get limited views or summary statistics. Tools like CKAN, Socrata or Keycloak support access controls. Furthermore, secure the data infrastructure by using HTTPS for secure transmission, implementing rate limiting to prevent abuse, use logging and monitoring to detect unusual access patterns and regularly patch and update backend services.
5. Communicate and document data processing
Education is key to a safe usage of data. Data literacy programs can should be run not only for citizens but for the specialists as well. These should provide guides and guidelines on how to use data responsibly by showcasing examples in different dimensions: Technical, juridical, ethical.
Another good way to showcase loophole or errors in your data system is to let citizens flag problematic or sensitive datasets. Involve communities in data governance and set up review boards.
Resources
1. D4A Best practices
2. D4A Lessons Leaned & Pitfalls to be avoided
3. D4A Training Modules
4. "Go deeper" - Insights on a Method
5. Academic papers
Sources