Skip to content

Latest commit

 

History

History
1255 lines (871 loc) · 76.1 KB

CHANGELOG.md

File metadata and controls

1255 lines (871 loc) · 76.1 KB

Changelog

All notable changes to FlowKit will be documented in this file.

The format is based on Keep a Changelog.

Added

Changed

  • Mode is now available for use with categorical metrics when running joined spatial aggregates via api. #2021
  • Flowmachine now includes the version number in query ids which means cache entries are per-version. #4489

Fixed

  • Fixed dangling async tasks not being properly cancelled during server shutdown #6833

Removed

Changed

  • FlowMachine now requires python >= 3.11

Fixed

  • Direction enum not being recognised #6787

Removed

  • Removed Oracle fdw

Added

  • New flowmachine query CalendarActivity, which retrives subscribers pattern of active days
  • New flowmachine queries PerValueAggregate and RedactedPerValueAggregate, which group by the value column of another query and apply an aggregate to subscribers with that grouping.
  • New flowapi queries and flowclient functions for calendar_activity and localised_calendar_activity, which return counts of subscribers per sequence of active days, and per sequence of active days additionally grouped by the subscribers reference location
  • Added new StringStatistic enum, which enumerates valid statistics for use with postgres string types

Changed

  • HistogramAggregation has moved to flowmachine.features.nonspatial_aggregates
  • Statistic moved to flowmachine.core.statistic_types
  • TotalActivePeriodsSubscriber no longer returns an extra inactive_periods column

Fixed

  • Fixed 500 error when getting api spec from FlowAPI #6686

Added

  • Added support for Parquet foreign tables using parquet_fdw

Changed

  • FlowKit test and synthetic data now uses parquet foreign tables.

Warning

The location of the parquet files in the container is /parquet_data, if you are testing with larger amounts of data you may wish to add an additional bind mount for this location.

Warning

This change is not backwards compatible with earlier releases of FlowDB, and you will need to repopulate your deployment. We recommend combining this change with the new parquet support.

  • FlowETL is now built on Airflow 2.9.2

Added

  • Added FlowDB table infrastructure.invalid_cell_info for recording cell information that could not be included in infrastructure.cell_info (including cells with null or duplicate cell IDs). #6626
  • The file name of FlowDB's automatically generated at init config file can now be specified by setting the AUTO_CONFIG_FILE_NAME environment variable. By default this is postgresql.configurator.conf.

Changed

  • FlowDB now triggers an ANALYZE on newly created cache tables to generate statistics rather than waiting for autovacuum
  • FlowDB now produces JSON formatted logs by default. Set FLOWDB_LOG_DEST=csvlog for the old default behaviour.
  • The logging destination of FlowDB can now be configured at init by setting the FLOWDB_LOG_DEST environment variable, valid options are stderr, csvlog, and jsonlog.
  • The location inside the container of FlowDB's automatically generated config file has changed to /flowdb_autoconf/$AUTO_CONFIG_FILE_NAME.

Changed

  • FlowDB now enables partitionwise aggregation planning by default
  • FlowDB now uses a default fillfactor of 100 for cache table indexes
  • EXCLUDE constraint on FlowDB infrastructure.cell_info table requires unique mno_cell_id across all simultaneously-valid cells per cells_table_version, regardless of to_include. #6626

Fixed

  • Queries that have multiple of the same subquery with different parameters no longer cause duplicate scopes in tokens. #6580
  • FlowETL QA checks count_imeis, count_imsis, max_msisdns_per_imei and max_msisdns_per_imsi now only count non-null IMEIs/IMSIs. #6619

Fixed

  • FlowETL get_qa_checks no longer attempts to create duplicate tasks for QA checks defined in the DAG folder. #6494

Removed

Added

  • Test and synthetic data generators now perform QA checks on the generated data. #6467
  • Added new /qa endpoint to FlowAPI and FlowClient, which supports getting the results of QA checks run by FlowETL #2704
  • Added new available_qa_checks property to flowmachine Connection objects #2704
  • Added new get_qa_checks method to flowmachine Connection objects #2704

Fixed

  • Test QA check IDs are now of the same format as those produced by FlowETL. #6472
  • FlowAuth now runs migrations correctly on startup. #6480

Changed

  • MostFrequentLocation now breaks ties based on the last used location, instead of by arbitrary Postgres sort order. #6268

  • Users no longer have write access to the public schema in FlowDB following a change introduced in PostgreSQL 15

  • FlowDB is now built on PostgreSQL 16, debian bullseye

    Warning

    You may need to update your docker version to use newer releases of FlowDB. You will also need to create a fresh database and reimport data if you are upgrading from a previous FlowDB release.

Added

  • FlowETL sensor NRowsPresentSensor which checks for a specified minimum number of rows.

Changed

  • ForeignStagingTableOperator will now error if the underlying file cannot be read or the command returns an error. #5763
  • Flowmachine now requires SQLAlchemy >= 2.0.0 #6066

Added

Changed

  • Upgraded Python dependencies

Fixed

Removed

Added

  • Added new FlowDB tables infrastructure.cell_info and infrastructure.cells_table_versions to keep track of changes to the cell info over time (note: the new tables have not yet replaced infrastructure.cells as the source of cell information for FlowKit queries). #6184

Changed

  • Updated flowpyter-task to 1.1.0

Removed

  • Removed AutoFlow. #6394

Added

  • Added flowpyter-task to FlowETL container

Added

  • FlowETL now updates a new table events.location_ids each time a new day of CDR data is ingested, to record the first and last date that each location ID appears in the data. #5376
  • New FlowETL QA check "count_locatable_events", which counts the number of added rows with location ID corresponding to a cell with a known location. #5289
  • flowkit_jwt_generator is now published as a wheel via pypi

Changed

  • docker-compose has been replaced with docker compose in the makefile; this might break builds on machines that haven't updated their docker in a while.

Fixed

  • SQLAlchemy version installed in the FlowMachine docker image is now compatible with the flowmachine library. #6052

Added

  • Quickstart script now supports arbitrary countries via EXAMPLE_COUNTRY env var. #5796
  • FlowDB's maximum locks per transaction setting can now be controlled using the MAX_LOCKS_PER_TRANSACTION env var. #5157

Changed

  • Increased FlowDB's default maximum locks per transaction to 365 * 5 * 4 * (1 + 4). #5157

Fixed

  • Null values in first column of first row of ingested data no longer cause flowetl to skip ingestion #5090

Fixed

  • Fixed migrations being missing from the built FlowAuth docker images #5818

Added

  • Added Alembic support via flask-migrate to Flowauth #5799

Added

  • Added views etl.ingested_state, etl.available_dates and etl.deduped_post_etl_queries in FlowDB, for convenient extraction of relevant information from the ETL tables. #5641
  • Added MajorityLocationWithUnlocatable query class and majority_location function. #5720

Changed

  • Important; tokens issued by previous versions of Flowauth are not compatible with this version. Users will need to regenerate tokens using the updated Flowauth.
  • Move from groups to roles in flowauth; see here for full details. #5613
  • Changed AIRFLOW__CORE__SQL_ALCHEMY_CONN env var to AIRFLOW__DATABASE__SQL_ALCHEMY_CONN
  • RoleScopePicker component redesigned and reimplemented.
  • Docs now recommend creating a separate bind mount for airflow scheduler logs, and include this in the secrets quickstart. #3622
  • jwt tokens now use sub instead of identity for JWT_IDENTITY_CLAIM.
  • A majority_location query with include_unlocatable=True will now include rows for all subscribers in the subscriber_location_weights sub-query, including those for whom all weights are negative (previously subscribers with only negative weights were excluded).

Fixed

  • Fixed a potential deadlock when using a small connection pool and store-ing queries
  • AutoFlow can now be run in a docker container with non-default user. #5574
  • Passing an empty list of events tables when creating a query now raises ValueError: Empty tables list. instead of a MissingDateError. #436
  • Flowmachine now looks at only the most recent state (per CDR type per CDR date) in etl.etl_records to determine available dates. #5641
  • It is now possible to run API queries that include multiple different aggregation units (e.g. joined_spatial_aggregate with displacement metric). #4649
  • Demo roles can now be used in worked_examples. #5735

Removed

  • Removed the include_unlocatable parameter from MajorityLocation class (the majority_location function should be used instead if include_unlocatable is required). #5720

Added

  • Added get_aggregation_unit server action, for getting the aggregation unit associated with a query specification. #5141

Changed

  • nocturnal_events now expects a night_hours parameter with nested sub-fields start_hour and end_hour, instead of two parameters night_start_hour and night_end_hour.
  • Spatial units with a mapping table now only include cells that appear in the mapping table. #5360

Fixed

  • Invalid sub-query specs nested within a modal_location spec now raise appropriate validation errors, instead of being masked by internal flowmachine server errors. #4816

Added

  • inflows and outflows exposed via API endpoint + added to flowclient #2029, #4866

Changed

  • Action Needed Airflow updated to version 2.3.3; backup flowetl_db before applying update #4940
  • Tables created under the cache schema in FlowDB will automatically be set to be owned by the flowmachine user. #4714
  • Query.explain will now explain the query even where it is already stored. #1285
  • unstored_dependencies_graph no longer blocks until dependencies are in a determinate state. #4949
  • In and out flows no longer return location columns with to/from suffix.
  • FlowDB now always creates a role named flowmachine.
  • Flowmachine will set the state of a query being stored to cancelled if interrupted while the store is running.
  • Flowmachine now supports sqlalchemy >=1.4 #5140

Fixed

  • Flowmachine now makes the built in flowmachine role owner of cache tables as a post-action when a query is stored. #4714
  • TopupBalance now returns the weighted mode when requested instead of weighted median #1412
  • Fixed in and out flow geojson for multicolumn location types #5132
  • quick_start.sh should no longer raise a misleading error if ss is not installed. #3151

Removed

  • use_file_flux_sensor removed entirely. #2812
  • Model, ModelResult and Louvain have been removed. #5168

Added

  • Most frequent locations is now available via FlowAPI. #3165
  • Total active periods is now available via FlowAPI.
  • Made hour of day slicing available via FlowAPI. #3165
  • Added visited on most days reference location query. #4267
  • Added unique value from query list query. #4486
  • Added mixin for exposing start_date and end_date internally as datetime objects #4497
  • Added CombineFirst and CoalescedLocation queries. #4524
  • Added MajorityLocation query. #4522
  • Added join_type param to Flows class. #4539
  • Added PerSubscriberAggregate query. #4559
  • Added FlowETL QA checks 'count_imeis', 'count_imsis', 'count_locatable_location_ids', 'count_null_imeis', 'count_null_imsis', 'count_null_location_ids', 'max_msisdns_per_imei', 'max_msisdns_per_imsi', 'count_added_rows_outgoing', 'count_null_counterparts', 'count_null_durations', 'count_onnet_msisdns_incoming', 'count_onnet_msisdns_outgoing', 'count_onnet_msisdns', 'max_duration' and 'median_duration'. #4552
  • Added FilteredReferenceLocation query, which returns only rows where a subscriber visited a reference location the required number of times. #4584
  • Added LabelledSpatialAggregate query and redaction, which sub-aggregates by subscriber labels. #4668
  • Added MobilityClassification query, to classify subscribers by mobility type based on a sequence of locations. #4666
  • Exposed CoalescedLocation via FlowAPI, in the specific case where the fallback location is a FilteredReferenceLocation query. #4585
  • Added LabelledFlows query, which returns flows disaggregated by label #4679
  • Exposed LabelledSpatialAggregate and LabelledFlows via FlowAPI, with a MobilityClassification query accepted as the 'labels' parameter. #4669
  • Added RedactedLabelledAggregate and subclasses for redacting labelled data (see ADR 0011). #4671

Changed

  • Harmonised FlowAPI parameter names for start and end dates. They are now all start_date and end_date
  • Further improvements to token display in FlowAuth. #1124
  • Increased the FlowDB quickstart container's timeout to 15 minutes. #782
  • Union and Query.union now accept a variable number of queries to concatenate. #4565

Fixed

  • Autoflow's prefect version is now current. #2544
  • FlowMachine server will now successfully remove cache for queries defined in an interactive flowmachine session during cleanup. #4008

Added

  • FlowETL flux check can be turned off by setting use_flux_sensor=False in create_dag. #3603

Changed

  • The use_file_flux_sensor argument to create_dag is deprecated. To use the table-based flux check in a file-based DAG, set use_flux_sensor='table'.
  • Improvements to token display in FlowAuth. #2812

Added

  • A list of additional paths to FlowETL QA checks can now be supplied to create_dag and get_qa_checks. #3484
  • FlowETL docker container now includes the upgrade check script for Airflow 2.0.0.

Fixed

  • Additional FlowETL QA checks in the dags folder are now picked up. #3484
  • Quickstart will no longer raise a warning about unset Autoflow related environment variables. #2118

Fixed

  • FlowETL QA checks with template sections conditional on the cdr_type argument now render correctly. #3479

Fixed

  • Fixed FlowClient ignoring custom SSL certificates #3344

Fixed

  • Fixed FlowETL not using the randomly generated secret key to secure sessions with the web interface if one is not explicitly provided using AIRFLOW__WEBSERVER__SECRET_KEY. #3244

Fixed

  • Reinstated tabs navigation in the docs #3238
  • Removed $ from code snippets in developer docs #3224
  • FlowETL now randomly generates a secret key to secure sessions with the web interface if one is not explicitly provided using AIRFLOW__WEBSERVER__SECRET_KEY. #3244

Fixed

  • Docs displaying None where they shouldn't

Added

  • Previously run, or currently running queries can now be referenced as a subscriber subset via FlowAPI. #1009
  • total_network_objects, location_introversion, and unique_subscriber_counts now also accept subscriber subsets.
  • The validity window for FlowAuth 2factor codes can now be configured using the TWO_FACTOR_VALID_WINDOW env variable. #3203

Changed

  • get_cached_query_objects_ordered_by_score is now a generator. #3116
  • Flowclient now uses httpx instead of requests, for improved async performance and http2 support. #1789

Fixed

  • FlowAPI now correctly logs all query run, poll, and retrieval requests for matching with FlowMachine. #3071
  • Links in the installation docs are now generated correctly. #3152

Changed

  • When creating a file-based DAG using create_dag, you can now use the slower, table based method of checking whether the file is being written. #2857

Added

  • The issuer name can now be set for FlowAuth's 2factor authentication using the FLOWAUTH_TWO_FACTOR_ISSUER environment variable.

  • FlowAPI's internal port can now be set using the FLOWAPI_PORT environment variable, but continues to default to 9090. #2723

    With thanks to JIPS for supporting this work.

  • FlowETL's default port can now be set using the FLOWETL_PORT environment variable, but continues to default to 8080. #2724

    With thanks to JIPS for supporting this work.

Changed

  • Test and synthetic DFS data now uses the same pool of subscribers as CDR data. #2713

    With thanks to JIPS for supporting this work.

Added

Added

  • Queries run through FlowAPI can now be run on only a subset of the available CDR types, by supplying an event_types parameter. #2631
  • FlowETL now includes QA checks for the earliest and latest timestamps in the ingested data. #2627

Fixed

  • The FlowETL 'count_duplicates' QA check now correctly counts the number of duplicate rows. #2651

Added

  • FlowDB's SQL synthetic data generator can now generate events for any country, not just Nepal.

    To generate synthetic data for a different country, supply the COUNTRY environment variable when starting the container, and a valid GADM GID code for the region to simulate a disaster.

Changed

  • FlowMachine's docker container now uses Python 3.8
  • FlowAPI's docker container now uses Python 3.8
  • FlowAuth's docker container now uses Python 3.8
  • AutoFlow's docker container now uses Python 3.8
  • FlowDB's SQL synthetic data generator now uses GADM 3.6 boundaries.
  • FlowAuth and FlowAPI now exchange tokens with compressed claims. #2625

Fixed

  • FlowAuth will no longer fail to start if there are directories with names the same as the SSL certificate secrets.

Changed

  • JoinToLocation is cacheable only if the joined query is also cacheable.

Changed

  • SubscriberLocations are no longer cacheable using FlowMachine.

Fixed

  • Fixed cache shrinking failing when large numbers of tables have been written. #2462
  • Fixed FlowAuth's MySQL support.

Fixed

  • Added missing bridge table arguments to Several FlowClient methods.

Added

  • FlowAuth now supports MySQL as a database backend.
  • FlowKit now allows the use of bridge tables to manually specify linkages between cells and geometries.

Fixed

  • FlowAuth no longer errors after a period of inactivity due to timed out database connections. #2382

Added

  • Added new FlowAPI aggregates; unique_visitor_counts, active_at_reference_location_counts, unmoving_counts, unmoving_at_reference_location_counts, trips_od_matrix, and consecutive_trips_od_matrix
  • Added new Flows type query to FlowAPI unique_locations, which produces the paired regional connectivity COVID-19 indicator
  • Added FlowClient function unique_locations_spec, which can be used on either side of a flows query
  • Added FlowClient functions: unique_visitor_counts, active_at_reference_location_counts, unmoving_counts, unmoving_at_reference_location_counts, trips_od_matrix, and consecutive_trips_od_matrix. #2333
  • FlowClient now has an asyncio API. Use connect_async instead of connect to create an ASyncConnection, and await methods on APIQuery objects. #2199

Fixed

  • Fixed FlowMachine server becoming deadlocked under load. #2390

Added

Changed

  • FlowETL is now based on the official apache-airflow docker image. As a result, you should now bind mount your host dags directory to /opt/airflow/dags, and your logs directory to /opt/airflow/logs.

Fixed

  • FlowMachine server will now ignore values for the FLOWMACHINE_SERVER_THREADPOOL_SIZE environment variable which can't be cast to int. #2304

Added

  • histogram_aggregate added to FlowAPI and FlowClient. Allows the user to obtain a histogram over a per-subscriber metric. #1076

Added

  • FlowClient now displays a progress bar when waiting for a query to ready, indicating how many parts of that query still need to be run.

Added

  • Added a flowclient Query class to represent a FlowKit query #1980.
  • Added method flowclient.Connection.update_token, to replace the API token for an existing connection.

Changed

  • The names of flowclient functions for generating query specifications have been renamed to <previous_name>_spec (e.g. flowclient.modal_location is now flowclient.modal_location_spec).
  • flowclient.get_status now returns "not_running" (instead of raising FileNotFoundError) if a query is not running or completed.
  • Flowclient functions location_event_counts_spec, meaningful_locations_aggregate_spec, meaningful_locations_between_label_od_matrix_spec, meaningful_locations_between_dates_od_matrix_spec, flows_spec, unique_subscriber_counts_spec, location_introversion_spec, total_network_objects_spec, aggregate_network_objects_spec, spatial_aggregate_spec and joined_spatial_aggregate_spec have moved to the flowclient.aggregates submodule.

Added

  • FlowAPI can now return results in CSV and GeoJSON format, FlowClient now supports getting GeoJSON formatted results. #2003

Added

  • FlowAPI now reports the proportion of subqueries cached for a query when polling. #1202
  • FlowClient now logs info messages with the proportion of subqueries cached for a query when polling. #1202

Fixed

  • Fixed the display of deeply nested permissions for flows in FlowAuth. #2110

Fixed

  • Fixed tokens which used the FlowAuth demo data not being accepted by FlowAPI. #2108

Changed

  • Flowmachine now uses an enum for interaction direction parameters (but will still accept them as strings). #357

Removed

  • Removed unused aggregates, results and features schemas from FlowDB. #587

Added

  • Improved UI for API permissions in FlowAuth.

Changed

  • The format of user claims expected has changed from a dictionary, to string based format. FlowAPI now expects the claims key of any token to contain a list of scope strings.
  • Permissions for joined spatial aggregates can now be set at a finer level in FlowAuth, to allow administrators to grant access only to specific combinations of query types at different aggregation units.
  • FlowAuth no longer requires administrators to manually configure API routes, and will extract them from a FlowAPI server's open api specification.
  • FlowAuth now uses structlog for log messages.
  • FlowAPI no longer mandates a top level aggregation_unit field in query specifications.
  • FlowClient's flows and modal_location functions no longer require an aggregation unit.

Removed

  • The poll type permission has been removed, and is implicitly granted by both read and get_result rights.
  • FlowAuth no longer allows administrators to specify the name of a FlowAPI server, and will instead use the name specified in the server's open api specification.

Fixed

  • Queries which have been removed Flowmachine's cache, or cancelled can now be rerun. #1898

Added

  • FlowMachine can now use multiple FlowDB backends, redis instances or execution pools via the flowmachine.connections or flowmachine.core.context.context context managers. #391
  • flowmachine.core.connection.Connection now has a conn_id attribute, which is unique per database host. #391

Changed

  • flowmachine.connect no longer returns a Connection object. The connection should be accessed via flowmachine.core.context.get_db(). #391
  • connection, redis, and threadpool are no longer available as attributes of Query, and should be accessed via flowmachine.core.context.get_db(), flowmachine.core.context.get_redis() and flowmachine.core.context.get_executor(). #391

Removed

  • Removed Query.connection, Query.redis, and Query.threadpool. #391

Added

  • Added a worked example to demonstrate using joined spatial aggregate queries. #1938

Changed

  • Connection.available_dates is now a property and returns results based on the etl.etl_records table. #1873

Fixed

  • Fixed the run action blocking the FlowMachine server in some scenarios. #1256

Removed

  • Removed tables and columns methods from the Connection class in FlowMachine
  • Removed the inspector attribute from the Connection class in FlowMachine

Added

  • FlowMachine now periodically prunes the cache to below the permitted cache size. #1307 The frequency of this pruning is configurable using the FLOWMACHINE_CACHE_PRUNING_FREQUENCY environment variable to Flowmachine, and queries are excluded from being removed by the automatic shrinker based on the cache_protected_period config key within FlowDB.
  • FlowDB now includes Paul Ramsey's OGR foreign data wrapper, for easy loading of GIS data. #1512
  • FlowETL now allows all configuration options to be set using docker secrets. #1515
  • Added a new component, AutoFlow, to automate running Jupyter notebooks when new data is added to FlowDB. #1570
  • FLOWETL_INTEGRATION_TESTS_SAVE_AIRFLOW_LOGS environment variable added to allow copying the Airflow logs in FlowETL integration tests into the /mounts/logs directory for debugging. #1019
  • Added new IterativeMedianFilter query to Flowmachine, which applies an iterative median filter to the output of another query. #1339
  • FlowDB now includes the TDS foreign data wrapper. #1729
  • Added contributing and support instructions. #1791
  • New FlowETL module installable via pip to aid in ETL dag creation.

Changed

  • FlowDB is now built on PostgreSQL 12 #1396 and PostGIS 3.
  • FlowETL is now built on Airflow 10.1.6.
  • FlowETL now defaults to disabling Airflow's REST API, and enables RBAC for the webui. #1516
  • FlowETL now requires that the FLOWETL_AIRFLOW_ADMIN_USERNAME and FLOWETL_AIRFLOW_ADMIN_PASSWORD environment variables be set, which specify the default web ui account. #1516
  • FlowAPI will no longer return a result for rows in spatial aggregate, joined spatial aggregate, flows, total events, meaningful locations aggregate, meaningful locations od, or unique subscriber count where the aggregate would contain less than 16 sims. #1026
  • FlowETL now requires that AIRFLOW__CORE__SQL_ALCHEMY_CONN be provided as an environment variable or secret. #1702, #1703
  • FlowAuth now records last used two-factor authentication codes in an expiring cache, which supports either a file-based, or redis backend. #1173
  • AutoFlow now uses Bundler to manage Ruby dependencies.
  • The end_date parameter of flowclient.modal_location_from_dates now refers to the day after the final date included in the range, so is now consistent with other queries that have start/end date parameters. #819
  • Date intervals in AutoFlow date stencils are now interpreted as half-open intervals (i.e. including start date, excluding end date), for consistency with date ranges elsewhere in FlowKit.
  • flowmachine user now has read access to ETL metadata tables in FlowDB

Fixed

  • Quickstart should no longer fail on systems which do not include the netstat tool. #1472
  • Fixed an error that prevented FlowAuth admin users from resetting users' passwords using the FlowAuth UI. #1635
  • The 'Cancel' button on the FlowAuth 'New User' form no longer submits the form. #1636
  • FlowAuth backend now sends a meaningful 400 response when trying to create a user with an empty password. #1637
  • Usernames of deleted users can now be re-used as usernames for new users. #1638
  • RedactedJoinedSpatialAggregate now only redacts rows with too few subscribers. #1747
  • FlowDB now uses a more conservative default setting for tcp_keepalives_idle of 10 minutes, to avoid connections being killed after 15 minutes when running in a docker swarm. #1771
  • Aggregation units and api routes can now be added to servers. #1815
  • Fixed several issues with FlowETL. #1529 #1499 #1498 #1497

Removed

  • Removed pg_cron.

Added

  • Added new DistanceSeries query to Flowmachine, which produces per-subscriber time series of distance from a reference point. #1313
  • Added new ImputedDistanceSeries query to Flowmachine, which produces contiguous per-subscriber time series of distance from a reference point by filling in gaps using the rolling median. #1337

Changed

Fixed

  • The FlowETL config file is now always validated, avoiding runtime errors if a config setting is wrong or missing. #1375
  • FlowETL now only creates DAGs for CDR types which are present in the config, leading to a better user experience in the Airflow UI. #1376
  • The concurrency settings in the FlowETL config are no longer ignored. #1378
  • The FlowETL deployment example has been updated so that it no longer fails due to a missing foreign data wrapper for the available CDR dates. #1379
  • Fixed error when editing a user in FlowAuth who did not have two factor enabled. #1374
  • Fixed not being able to enable a newly added api route on existing servers in FlowAuth. #1373

Removed

  • The default_args section in the FlowETL config file has been removed. #1377

Added

  • FlowAuth now makes version information available at /version and displays it in the web ui. #835
  • FlowETL now comes with a deployment example (in flowetl/deployment_example/). #1126
  • FlowETL now allows to run supplementary post-ETL queries. #989
  • Random sampling is now exposed via the API, for all non-aggregated query kinds. #1007
  • New aggregate added to FlowMachine - HistogramAggregation, which constructs histograms over the results of other queries. #1075
  • New IntereventInterval query class - returns stats over the gap between events as a time interval.
  • Added submodule flowmachine.core.dependency_graph, which contains functions related to creating or using query dependency graphs (previously these were in utils.py).
  • New config option sql_find_available_dates in FlowETL to provide SQL code to determine the available dates. #1295

Changed

  • FlowDB is now based on PostgreSQL 11.5 and PostGIS 2.5.3
  • When running queries through FlowAPI, the query's dependencies will also be cached by default. This behaviour can be switched off by setting FLOWMACHINE_SERVER_DISABLE_DEPENDENCY_CACHING=true. #1152
  • NewSubscribers now takes a pair of UniqueSubscribers queries instead of the arguments to them
  • Flowmachine's default random sampling method is now random_ids rather than the non-reproducible system_rows. #1263
  • IntereventPeriod now returns stats over the gap between events in fractional time units, instead of time intervals. #1265
  • Attempting to store a query that does not have a standard table name (e.g. EventTableSubset or unseeded random sample) will now raise an UnstorableQueryError instead of ValueError.
  • In the FlowETL deployment example, the external ingestion database is now set up separately from the FlowKit components and connected to FlowDB via a docker overlay network. #1276
  • The md5 attribute of the Query class has been renamed to query_id #1288.
  • DistanceMatrix no longer returns duplicate rows for the lon-lat spatial unit.
  • Previously, Displacement defaulted to returning NaN for subscribers who have a location in the reference location but were not seen in the time period for the displacement query. These subscribers are no longer returned unless the return_subscribers_not_seen argument is set to True.
  • PopulationWeightedOpportunities is now available under flowmachine.features.location, instead of flowmachine.models
  • PopulationWeightedOpportunities no longer supports erroring with incomplete per-location departure rate vectors and will instead omit any locations not included from the results
  • PopulationWeightedOpportunities no longer requires use of the run() method

Fixed

  • Quickstart will no longer fail if it has been run previously with a different FlowDB data size and not explicitly shut down. #900

Removed

  • Flowmachine's subscriber_locations_cluster function has been removed - use HartiganCluster or MeaningfulLocations directly.
  • FlowAPI no longer supports the non-reproducible random sampling method system_rows. #1263

Added

  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes event counts. #992
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes top-up amount. #967
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes nocturnal events. #1025
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes top-up balance. #968
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes displacement. #1010
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes pareto interactions. #1012
  • FlowETL now supports ingesting from a postgres table in addition to CSV files. #1027
  • FLOWETL_RUNTIME_CONFIG environment variable added to control which DAG definitions the FlowETL integration tests should use (valid values: "testing", "production").
  • FLOWETL_INTEGRATION_TESTS_DISABLE_PULLING_DOCKER_IMAGES environment variable added to allow running the FlowETL integration tests against locally built docker images during development.
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes handset. #1011 and #1029
  • JoinedSpatialAggregate now supports "distr" stats which computes outputs the relative distribution of the passed metrics.
  • Added SubscriberHandsetCharacteristic to FlowMachine
  • FlowAuth now supports optional two-factor authentication #121

Changed

  • The flowdb containers for test_data and synthetic_data were split into two separate containers and quick_start.sh downloads the docker-compose files to a new temporary directory on each run. #843
  • Flowmachine now returns more informative error messages when query parameter validation fails. #1055

Removed

  • TESTING environment variable was removed (previously used by the FlowETL integration tests).
  • Removed SubscriberPhoneType from FlowMachine to avoid redundancy.

Added

  • PRIVATE_JWT_SIGNING_KEY environment variable/secret added to FlowAuth, which should be a PEM encoded RSA private key, optionally base64 encoded if supplied as an environment variable.
  • PUBLIC_JWT_SIGNING_KEY environment variable/secret added to FlowAPI, which should be a PEM encoded RSA public key, optionally base64 encoded if supplied as an environment variable.
  • The dev provisioning Ansible playbook now automatically generates an SSH key pair for the flowkit user. #892
  • Added new classes to represent spatial units in FlowMachine.
  • Added a Geography query class, to get geography data for a spatial unit.
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes unique location counts.#949
  • FlowAPI's 'joined_spatial_aggregate' endpoint now exposes subscriber degree.#969
  • Flowdb now contains an auxiliary table to record outcomes of queries that can be run as part of the regular ETL process #988

Changed

  • The quick-start script now only pulls the docker images for the services that are actually started up. #898
  • FlowAuth and FlowAPI are now linked using an RSA keypair, instead of per-server shared secrets. #89
  • Location-related FlowMachine queries now take a spatial_unit parameter instead of level.
  • The quick-start script now uses the environment variable GIT_REVISION to control the version to be deployed.
  • Create token page permission and spatial aggregation checkboxes are now hidden by default.#834
  • The flowetl mounted directories archive, dump, ingest, quarantine were replaced with a single files directory and files are no longer moved. #946
  • FlowDB's postgresql has been updated to 11.4, which addresses several bugs and one major vulnerability.

Fixed

  • When creating a new token in FlowAuth, the expiry now always shows the year, seconds till expiry, and timezone. #260
  • Distances in Displacement are now calculated with longitude and latitude the corrcet way around. #913
  • The quick-start script now works correctly with branches. #902
  • Fixed location_event_counts failing to work when specifying a subset of event types #1015
  • FlowAPI will now show the correct version in the API spec, flowmachine and flowclient will show the correct versions in the worked examples. #818

Removed

  • Removed cell_mappings.py, get_columns_for_level and BadLevelError.

  • JWT_SECRET_KEY has been removed in favour of RSA keys.

  • The FlowDB tables infrastructure.countries and infrastructure.operators have been removed. #958

Added

  • Buttons to copy token to clipboard and download token as file added to token list page. #704
  • Two new worked examples: "Cell Towers Per Region" and "Unique Subscriber Counts". #633, #634

Changed

  • The FLOWDB_DEBUG environment variable has been renamed to FLOWDB_ENABLE_POSTGRES_DEBUG_MODE.
  • FlowAuth will now automatically set up the database when started without needing to trigger via the cli.
  • FlowAuth now requires that at least one administrator account is created by providing env vars or secrets for:
    • FLOWAUTH_ADMIN_PASSWORD
    • FLOWAUTH_ADMIN_USERNAME

Fixed

  • The FLOWDB_DEBUG environment variable used to have no effect. This has been fixed. #811
  • Previously, queries could be stuck in an executing state if writing their cache metadata failed, they will now correctly show as having errored. #833
  • Fixed an issue where Table objects could be in an inconsistent cache state after resetting cache #832
  • FlowAuth's docker container can now be used with a Postgres backing database. #825
  • FlowAPI now starts up successfully when following the "Secrets Quickstart" instructions in the docs. #836
  • The command to generate an SSL certificate in the "Secrets Quickstart" section in the docs has been fixed and made more robust #837
  • FlowAuth will no longer try to initialise the database or create demo data multiple times when running under uwsgi with multiple workers #844
  • Fixed issue of Multiple tokens don't line up on FlowAuth "Tokens" page #849

Removed

  • The FLOWDB_SERVICES environment variable has been removed from the toplevel Makefile, so that now DOCKER_SERVICES is the only environment variable that controls which services are spun up when running make up. #827

Added

  • FlowKit's worked examples are now Dockerized, and available as part of the quick setup script #614
  • Skeleton for Airflow based ETL system added with basic ETL DAG specification and tests.
  • The docs now contain information about required versions of installation prerequisites #703
  • FlowAPI now requires the FLOWAPI_IDENTIFIER environment variable to be set, which contains the name used to identify this FlowAPI server when generating tokens in FlowAuth #727
  • flowmachine.utils.calculate_dependency_graph now includes the Query objects in the query_object field of the graph's nodes dictionary #767
  • Architectural Decision Records (ADR) have been added and are included in the auto-generated docs #780
  • Added FlowDB environment variables SHARED_BUFFERS_SIZE and EFFECTIVE_CACHE_SIZE, to allow manually setting the Postgres configuration parameters shared_buffers and effective_cache_size.
  • The function print_dependency_tree() now takes an optional argument show_stored to display information whether dependent queries have been stored or not #804
  • A new function plot_dependency_graph() has been added which allows to conveniently plot and visualise a dependency graph for use in Jupyter notebooks (this requires IPython and pygraphviz to be installed) #786

Changed

  • Parameter names in flowmachine.connect() have been renamed as follows to be consistent with the associated environment variables #728:
    • db_port -> flowdb_port
    • db_user -> flowdb_user
    • db_pass -> flowdb_password
    • db_host -> flowdb_host
    • db_connection_pool_size -> flowdb_connection_pool_size
    • db_connection_pool_overflow -> flowdb_connection_pool_overflow
  • FlowAPI and FlowAuth now expect an audience key to be present in tokens #727
  • Dependent queries are now only included once in the md5 calculation of a given query (in particular, it changes the query ids compared to previous FlowKit versions).
  • Error is displayed in the add user form of Flowauth if username is alredy exists. #690
  • Error is displayed in the add group form of Flowauth if group name already exists. #709
  • FlowAuth's add new server page now shows helper text for bad inputs. #749
  • The class SubscriberSubsetterBase in FlowMachine no longer inherits from Query #740 (this changes the query ids compared to previous FlowKit versions).

Fixed

  • FlowClient docs rendered to website now show the options available for arguments that require a string from some set of possibilities #695.
  • The Flowmachine loggers are now initialised only once when flowmachine is imported, with a call to connect() only changing the log level #691
  • The FERNET_KEY environment variable for FlowAuth is now named FLOWAUTH_FERNET_KEY
  • The quick-start script now correctly aborts if one of the FlowKit services doesn't fully start up #745
  • The maps in the worked examples docs pages now appear in any browser
  • Example invocations of generate-jwt are no longer uncopyable due to line wrapping #778
  • API parameter interval for location_event_counts queries is now correctly passed to the underlying FlowMachine query object #807.

Added

  • Added a new module, flowkit-jwt-generator, which generates test JWT tokens for use with FlowAPI #564
  • A new Ansible playbook was added in deployment/provision-dev.yml. In addition to the standard provisioning this installs pyenv, Python 3.7, pipenv and clones the FlowKit repository, which is useful for development purposes.
  • Added a 'quick start' setup script for trying out a complete FlowKit system #688.

Changed

  • FlowAPI's available_dates endpoint now always returns available dates for all event types and does not accept JSON
  • Hints are now displayed in the add user form of FlowAuth if the form is not completed #679
  • Error messages are now displayed when generating a new token in FlowAuth if the token's name is invalid #799
  • The Ansible playbooks in deployment/ now allow configuring the username and password for the FlowKit user account.
  • Default compose file no longer includes build blocks, these have been moved to docker-compose-build.yml.

Fixed

  • FlowDB synthetic data container no longer silently fails to generate data if data generator is not set #654

Fixed

  • Fixed TotalNetworkObjects raising an error when run with a lat-long level #108
  • Radius of gyration no longer incorrectly appears as a top level api query

Added

  • Added new flowclient API entrypoint, aggregate_network_objects, to access equivalent flowmachine query #601
  • FlowAPI now exposes the API spec at the spec/openapi.json endpoint, and an interactive version of the spec at the spec/redoc endpoint
  • Added Makefile target make up-no_build, to spin up all containers without building the images
  • Added resync_redis_with_cache function to cache utils, to allow administrators to align redis with FlowDB #636
  • Added new flowclient API entrypoint, radius_of_gyration, to access (with simplified parameters) equivalent flowmachine query RadiusOfGyration #602

Changed

  • The period argument to TotalNetworkObjects in FlowMachine has been renamed total_by
  • The period argument to total_network_objects in FlowClient has been renamed total_by
  • The by argument to AggregateNetworkObjects in FlowMachine has been renamed to aggregate_by
  • The stop_date argument to the modal_location_from_dates and meaningful_locations_* functions in FlowClient has been renamed end_date #470
  • get_result_by_query_id now accepts a poll_interval argument, which allows polling frequency to be changed
  • The start and stop argument to EventTableSubset are now mandatory.
  • RadiusOfGyration now returns a value column instead of an rog column
  • TotalNetworkObjects and AggregateNetworkObjects now return a value column, rather than statistic_name
  • All environment variables are now in a single development_environment file in the project root, development environment setup has been simplified
  • Default FlowDB users for FlowMachine and FlowAPI have changed from "analyst" and "reporter" to "flowmachine" and "flowapi", respectively
  • Docs and integration tests now use top level compose file
  • The following environment variables have been renamed:
    • FLOWMACHINE_SERVER (FlowAPI) -> FLOWMACHINE_HOST
    • FM_PASSWORD (FlowDB), FLOWDB_PASS (FlowMachine) -> FLOWMACHINE_FLOWDB_PASSWORD
    • API_PASSWORD (FlowDB), FLOWDB_PASS (FlowAPI) -> FLOWAPI_FLOWDB_PASSWORD
    • FM_USER (FlowDB), FLOWDB_USER (FlowMachine) -> FLOWMACHINE_FLOWDB_USER
    • API_USER (FlowDB), FLOWDB_USER (FlowAPI) -> FLOWAPI_FLOWDB_USER
    • LOG_LEVEL (FlowMachine) -> FLOWMACHINE_LOG_LEVEL
    • LOG_LEVEL (FlowAPI) -> FLOWAPI_LOG_LEVEL
    • DEBUG (FlowDB) -> FLOWDB_DEBUG
    • DEBUG (FlowMachine) -> FLOWMACHINE_SERVER_DEBUG_MODE
  • The following Docker secrets have been renamed:
    • FLOWAPI_DB_USER -> FLOWAPI_FLOWDB_USER
    • FLOWAPI_DB_PASS -> FLOWAPI_FLOWDB_PASSWORD
    • FLOWMACHINE_DB_USER -> FLOWMACHINE_FLOWDB_USER
    • FLOWMACHINE_DB_PASS -> FLOWMACHINE_FLOWDB_PASSWORD
    • POSTGRES_PASSWORD_FILE -> POSTGRES_PASSWORD
    • REDIS_PASSWORD_FILE -> REDIS_PASSWORD
  • status enum in FlowDB renamed to etl_status
  • reset_cache now requires a redis client argument

Fixed

  • Fixed being unable to add new users or servers when running FlowAuth with a Postgres database #622
  • Resetting the cache using reset_cache will now reset the state of queries in redis as well #650
  • Fixed mode statistic for AggregateNetworkObjects #651

Removed

  • Removed docker-compose-dev.yml, and docker-compose files in docs/, flowdb/tests/ and integration_tests/.
  • Removed Dockerfile-dev Dockerfiles
  • Removed ENV defaults from the FlowMachine Dockerfile
  • Removed POSTGRES_DB environment variable from FlowDB Dockerfile, database name is now hardcoded as flowdb

Added

  • Added new spatial_aggregate API endpoint and FlowClient function #599
  • Added new flowclient API entrypoint, total_network_objects(), to access (with simplified parameters) equivalent flowmachine query #581
  • Added new flowclient API entrypoint, location_introversion(), to access (with simplified parameters) equivalent flowmachine query #577
  • Added new flowclient API entrypoint, unique_subscriber_counts(), to access (with simplified parameters) equivalent flowmachine query #562
  • New schema aggregates and table aggregates.aggregates have been created for maintaining a record of the process and completion of scheduled aggregates.
  • New joined_spatial_aggregate API endpoint and FlowClient function #600

Changed

  • daily_location and modal_location query types are no longer accepted as top-level queries, and must be wrapped using spatial_aggregate
  • JoinedSpatialAggregate no longer accepts positional arguments
  • JoinedSpatialAggregate now supports "avg", "max", "min", "median", "mode", "stddev" and "variance" stats

Fixed

  • total_network_objects no longer returns results from AggregateNetworkObjects #603

Fixed

  • Fixed #514, which would cause the client to hang after submitting a query that couldn't be created
  • Fixed #575, so that events at midnight are now considered to be happening on the following day

Added

  • Added HandsetStats to FlowMachine.
  • Added new ContactReferenceLocationStats query class to FlowMachine.
  • A new zmq message get_available_dates was added to the flowmachine server, along with the /available_dates endpoint in flowapi and the function get_available_dates() in flowclient. These allow to determine the dates that are available in the database for the supported event types.

Changed

  • FlowMachine's debugging logs are now from a single logger (flowmachine.debug) and include the submodule in the submodule field instead of using it as the logger name
  • FlowMachine's query run logger now uses the logger name flowmachine.query_run_log
  • FlowAPI's access, run and debug loggers are now named flowapi.access, flowapi.query and flowapi.debug
  • FlowAPI's access and run loggers, and FlowMachine's query run logger now log to stdout instead of stderr
  • Passwords for Redis and FlowDB must now be explicitly provided to flowmachine via argument to connect, env var, or secret

Removed

  • FlowMachine and FlowAPI no longer support logging to a file

Added

  • The flowmachine python library is now pip installable (pip install flowmachine)
  • The flowmachine server now supports additional actions: get_available_queries, get_query_schemas, ping.
  • Flowdb now contains a new dfs schema and associated tables to process mobile money transactions. In addition, flowdb_testdata contains sample data for DFS transactions.
  • The docs now include three worked examples of CDR analysis using FlowKit.
  • Flowmachine now supports calculating the total amount of various DFS metrics (transaction amount, commission, fee, discount) per aggregation unit during a given date range. These metrics are also exposed in FlowAPI via the query kind dfs_metric_total_amount.

Changed

  • The JSON structure when setting queries running via flowapi or the flowmachine server has changed: query parameters are now "inlined" alongside the query_kind key, rather than nested using a separate params key. Example:
    • previously: {"query_kind": "daily_location", "params": {"date": "2016-01-01", "aggregation_unit": "admin3", "method": "last"}},
    • now: {"query_kind": "daily_location", "date": "2016-01-01", "aggregation_unit": "admin3", "method": "last"}
  • The JSON structure of zmq reply messages from the flowmachine server was changed. Replies now have the form: {"status": "[success|error]", "msg": "...", "payload": {...}.
  • The flowmachine server action get_sql was renamed to get_sql_for_query_result.
  • The parameter daily_location_method was renamed to method.

Added

  • When running integration tests locally, normally pytest will automatically spin up servers for flowmachine and flowapi as part of the test setup. This can now be disabled by setting the environment variable FLOWKIT_INTEGRATION_TESTS_DISABLE_AUTOSTART_SERVERS=TRUE.
  • The integration tests now use the environment variables FLOWAPI_HOST, FLOWAPI_PORT to determine how to connect to the flowapi server.
  • A new data generator has been added to the synthetic data container which supports more data types, simple disaster simulation, and more plausible behaviours as well as increased performance

Changed

  • FlowAPI now reports queued/running status for queries instead of just accepted
  • The following environment variables have been renamed:
    • DB_USER -> FLOWDB_USER
    • DB_USER -> FLOWDB_HOST
    • DB_PASS -> FLOWDB_PASS
    • DB_PW -> FLOWDB_PASS
    • API_DB_USER -> FLOWAPI_DB_USER
    • API_DB_PASS -> FLOWAPI_DB_PASS
    • FM_DB_USER -> FLOWMACHINE_DB_USER
    • FM_DB_PASS -> FLOWMACHINE_DB_PASS
  • Added numerator_direction to ProportionEventType to allow for proportion of directed events.

Fixed

  • Server no longer loses track of queries under heavy load
  • TopUpBalances no longer always uses entire topups table

Removed

  • The environment variable DB_NAME has been removed.

Changed

  • MDSVolume no longer allows specifying the table, and will always use the mds table.
  • All FlowMachine logs are now in structured json form
  • FlowAPI now uses structured logs for debugging messages

Added

  • Added TopUpAmount, TopUpBalance query classes to FlowMachine.
  • Added PerLocationEventStats, PerContactEventStats to FlowMachine

Removed

  • Removed TotalSubscriberEvents from FlowMachine as it is superseded by EventCount.

Added

  • Dockerised development setup, with support for live reload of flowmachine and flowapi after source code changes.
  • Pre-commit hook for Python formatting with black.
  • Added new IntereventPeriod, ContactReciprocal, ProportionContactReciprocal, ProportionEventReciprocal, ProportionEventType and MDSVolume query classes to FlowMachine.

Changed

  • CustomQuery now requires column names to be specified
  • Query classes are now required to declare the column names they return via the column_names property
  • FlowAPI now reports whether a query is queued or running when polling
  • FlowDB test data and synthetic data images are now available from their own Docker repos (Flowminder/flowdb-testdata, Flowminder/flowdb-synthetic-data)
  • Changed query class name from NocturnalCalls to NocturnalEvents.

Fixed

  • FlowAPI is now an installable python module

Removed

  • Query objects can no longer be recalculated to cache and must be explicitly removed first
  • Arbitrary Flow maths
  • EdgeList query type
  • Removes query class ProportionOutgoing as it becomes redundant with the the introduction of ProportionEventType.

Added

  • API route for retrieving geography data from FlowDB
  • Aggregated meaningful locations are now available via FlowAPI
  • Origin-destination matrices between meaningful locations are now available via FlowAPI
  • Added new MeaningfulLocations, MeaningfulLocationsAggregate and MeaningfulLocationsOD query classes to FlowMachine

Changed

  • Constructors for HartiganCluster, LabelEventScore, EventScore and CallDays now have different signatures
  • Restructured and extended documentation; added high-level overview and more targeted information for different types of users

Added

  • Support for running FlowDB as an arbitrary user via docker's --user flag

Removed

  • Support for setting the uid and gid of the postgres user when building FlowDB

Fixed

  • Fixed being unable to build if the port used by git:// is not open

Added

  • Added utilities for managing and inspecting the query cache

Changed

  • FlowDB now requires a password to be set for the flowdb superuser

Added

  • Support for password protected redis

Changed

  • Changed the default redis image to bitnami's redis (to enable password protection)

Added

  • Added structured logging of access attempts, query running, and data access
  • Added CHANGELOG.md
  • Added support for Postgres JIT in FlowDB
  • Added total location events metric to FlowAPI and FlowClient
  • Added ETL bookkeeping schema to FlowDB

Changed

  • Added changelog update to PR template
  • Increased default shared memory size for FlowDB containers

Fixed

  • Fixed being unable to delete groups in FlowAuth
  • Fixed make up not working with defaults

Added

  • Added Python 3.6 support for FlowClient