All notable changes to FlowKit will be documented in this file.
The format is based on Keep a Changelog.
- Mode is now available for use with categorical metrics when running joined spatial aggregates via api. #2021
- Flowmachine now includes the version number in query ids which means cache entries are per-version. #4489
- Fixed dangling async tasks not being properly cancelled during server shutdown #6833
- FlowMachine now requires python >= 3.11
- Direction enum not being recognised #6787
- Removed Oracle fdw
- New flowmachine query
CalendarActivity
, which retrives subscribers pattern of active days - New flowmachine queries
PerValueAggregate
andRedactedPerValueAggregate
, which group by the value column of another query and apply an aggregate to subscribers with that grouping. - New flowapi queries and flowclient functions for
calendar_activity
andlocalised_calendar_activity
, which return counts of subscribers per sequence of active days, and per sequence of active days additionally grouped by the subscribers reference location - Added new
StringStatistic
enum, which enumerates valid statistics for use with postgres string types
HistogramAggregation
has moved toflowmachine.features.nonspatial_aggregates
Statistic
moved toflowmachine.core.statistic_types
TotalActivePeriodsSubscriber
no longer returns an extrainactive_periods
column
- Fixed 500 error when getting api spec from FlowAPI #6686
- Added support for Parquet foreign tables using parquet_fdw
- FlowKit test and synthetic data now uses parquet foreign tables.
Warning
The location of the parquet files in the container is /parquet_data
, if you are testing with larger amounts of data you may wish to add an additional bind mount for this location.
- FlowDB now uses declarative partitioning
- FlowETL now attached new data as partitions, rather than subtables
Warning
This change is not backwards compatible with earlier releases of FlowDB, and you will need to repopulate your deployment. We recommend combining this change with the new parquet support.
- FlowETL is now built on Airflow 2.9.2
- Added FlowDB table
infrastructure.invalid_cell_info
for recording cell information that could not be included ininfrastructure.cell_info
(including cells with null or duplicate cell IDs). #6626 - The file name of FlowDB's automatically generated at init config file can now be specified by setting the
AUTO_CONFIG_FILE_NAME
environment variable. By default this ispostgresql.configurator.conf
.
- FlowDB now triggers an ANALYZE on newly created cache tables to generate statistics rather than waiting for autovacuum
- FlowDB now produces JSON formatted logs by default. Set
FLOWDB_LOG_DEST=csvlog
for the old default behaviour. - The logging destination of FlowDB can now be configured at init by setting the
FLOWDB_LOG_DEST
environment variable, valid options arestderr
,csvlog
, andjsonlog
. - The location inside the container of FlowDB's automatically generated config file has changed to
/flowdb_autoconf/$AUTO_CONFIG_FILE_NAME
.
- FlowDB now enables partitionwise aggregation planning by default
- FlowDB now uses a default fillfactor of 100 for cache table indexes
EXCLUDE
constraint on FlowDBinfrastructure.cell_info
table requires uniquemno_cell_id
across all simultaneously-valid cells percells_table_version
, regardless ofto_include
. #6626
- Queries that have multiple of the same subquery with different parameters no longer cause duplicate scopes in tokens. #6580
- FlowETL QA checks
count_imeis
,count_imsis
,max_msisdns_per_imei
andmax_msisdns_per_imsi
now only count non-null IMEIs/IMSIs. #6619
- FlowETL
get_qa_checks
no longer attempts to create duplicate tasks for QA checks defined in the DAG folder. #6494
- Removed
flowpyter-task
from the FlowETL Docker image. For a Docker image withflowpyter-task
included, see (flowminder/flowbot)[https://hub.docker.com/r/flowminder/flowbot].
- Test and synthetic data generators now perform QA checks on the generated data. #6467
- Added new
/qa
endpoint to FlowAPI and FlowClient, which supports getting the results of QA checks run by FlowETL #2704 - Added new
available_qa_checks
property to flowmachineConnection
objects #2704 - Added new
get_qa_checks
method to flowmachineConnection
objects #2704
- Test QA check IDs are now of the same format as those produced by FlowETL. #6472
- FlowAuth now runs migrations correctly on startup. #6480
-
MostFrequentLocation
now breaks ties based on the last used location, instead of by arbitrary Postgres sort order. #6268 -
Users no longer have write access to the public schema in FlowDB following a change introduced in PostgreSQL 15
-
FlowDB is now built on PostgreSQL 16, debian bullseye
You may need to update your docker version to use newer releases of FlowDB. You will also need to create a fresh database and reimport data if you are upgrading from a previous FlowDB release.
- FlowETL sensor
NRowsPresentSensor
which checks for a specified minimum number of rows.
ForeignStagingTableOperator
will now error if the underlying file cannot be read or the command returns an error. #5763- Flowmachine now requires SQLAlchemy >= 2.0.0 #6066
- Upgraded Python dependencies
- Added new FlowDB tables
infrastructure.cell_info
andinfrastructure.cells_table_versions
to keep track of changes to the cell info over time (note: the new tables have not yet replacedinfrastructure.cells
as the source of cell information for FlowKit queries). #6184
- Updated flowpyter-task to 1.1.0
- Removed AutoFlow. #6394
- Added flowpyter-task to FlowETL container
- FlowETL now updates a new table
events.location_ids
each time a new day of CDR data is ingested, to record the first and last date that each location ID appears in the data. #5376 - New FlowETL QA check "count_locatable_events", which counts the number of added rows with location ID corresponding to a cell with a known location. #5289
- flowkit_jwt_generator is now published as a wheel via pypi
docker-compose
has been replaced withdocker compose
in the makefile; this might break builds on machines that haven't updated their docker in a while.
- SQLAlchemy version installed in the FlowMachine docker image is now compatible with the flowmachine library. #6052
- Quickstart script now supports arbitrary countries via
EXAMPLE_COUNTRY
env var. #5796 - FlowDB's maximum locks per transaction setting can now be controlled using the
MAX_LOCKS_PER_TRANSACTION
env var. #5157
- Increased FlowDB's default maximum locks per transaction to
365 * 5 * 4 * (1 + 4)
. #5157
- Null values in first column of first row of ingested data no longer cause flowetl to skip ingestion #5090
- Fixed migrations being missing from the built FlowAuth docker images #5818
- Added Alembic support via
flask-migrate
to Flowauth #5799
- Added views
etl.ingested_state
,etl.available_dates
andetl.deduped_post_etl_queries
in FlowDB, for convenient extraction of relevant information from the ETL tables. #5641 - Added
MajorityLocationWithUnlocatable
query class andmajority_location
function. #5720
- Important; tokens issued by previous versions of Flowauth are not compatible with this version. Users will need to regenerate tokens using the updated Flowauth.
- Move from
groups
toroles
in flowauth; see here for full details. #5613 - Changed
AIRFLOW__CORE__SQL_ALCHEMY_CONN
env var toAIRFLOW__DATABASE__SQL_ALCHEMY_CONN
- RoleScopePicker component redesigned and reimplemented.
- Docs now recommend creating a separate bind mount for airflow scheduler logs, and include this in the secrets quickstart. #3622
jwt
tokens now usesub
instead ofidentity
forJWT_IDENTITY_CLAIM
.- A
majority_location
query withinclude_unlocatable=True
will now include rows for all subscribers in thesubscriber_location_weights
sub-query, including those for whom all weights are negative (previously subscribers with only negative weights were excluded).
- Fixed a potential deadlock when using a small connection pool and
store
-ing queries - AutoFlow can now be run in a docker container with non-default user. #5574
- Passing an empty list of events tables when creating a query now raises
ValueError: Empty tables list.
instead of aMissingDateError
. #436 - Flowmachine now looks at only the most recent state (per CDR type per CDR date) in
etl.etl_records
to determine available dates. #5641 - It is now possible to run API queries that include multiple different aggregation units (e.g.
joined_spatial_aggregate
withdisplacement
metric). #4649 - Demo roles can now be used in
worked_examples
. #5735
- Removed the
include_unlocatable
parameter fromMajorityLocation
class (themajority_location
function should be used instead ifinclude_unlocatable
is required). #5720
- Added
get_aggregation_unit
server action, for getting the aggregation unit associated with a query specification. #5141
nocturnal_events
now expects anight_hours
parameter with nested sub-fieldsstart_hour
andend_hour
, instead of two parametersnight_start_hour
andnight_end_hour
.- Spatial units with a mapping table now only include cells that appear in the mapping table. #5360
- Invalid sub-query specs nested within a
modal_location
spec now raise appropriate validation errors, instead of being masked by internal flowmachine server errors. #4816
- Action Needed Airflow updated to version 2.3.3; backup flowetl_db before applying update #4940
- Tables created under the cache schema in FlowDB will automatically be set to be owned by the
flowmachine
user. #4714 Query.explain
will now explain the query even where it is already stored. #1285unstored_dependencies_graph
no longer blocks until dependencies are in a determinate state. #4949- In and out flows no longer return location columns with to/from suffix.
- FlowDB now always creates a role named
flowmachine.
- Flowmachine will set the state of a query being stored to cancelled if interrupted while the store is running.
- Flowmachine now supports sqlalchemy >=1.4 #5140
- Flowmachine now makes the built in
flowmachine
role owner of cache tables as a post-action when a query isstore
d. #4714 - TopupBalance now returns the weighted mode when requested instead of weighted median #1412
- Fixed in and out flow geojson for multicolumn location types #5132
quick_start.sh
should no longer raise a misleading error ifss
is not installed. #3151
use_file_flux_sensor
removed entirely. #2812Model
,ModelResult
andLouvain
have been removed. #5168
- Most frequent locations is now available via FlowAPI. #3165
- Total active periods is now available via FlowAPI.
- Made hour of day slicing available via FlowAPI. #3165
- Added visited on most days reference location query. #4267
- Added unique value from query list query. #4486
- Added mixin for exposing start_date and end_date internally as datetime objects #4497
- Added
CombineFirst
andCoalescedLocation
queries. #4524 - Added
MajorityLocation
query. #4522 - Added
join_type
param toFlows
class. #4539 - Added
PerSubscriberAggregate
query. #4559 - Added FlowETL QA checks 'count_imeis', 'count_imsis', 'count_locatable_location_ids', 'count_null_imeis', 'count_null_imsis', 'count_null_location_ids', 'max_msisdns_per_imei', 'max_msisdns_per_imsi', 'count_added_rows_outgoing', 'count_null_counterparts', 'count_null_durations', 'count_onnet_msisdns_incoming', 'count_onnet_msisdns_outgoing', 'count_onnet_msisdns', 'max_duration' and 'median_duration'. #4552
- Added
FilteredReferenceLocation
query, which returns only rows where a subscriber visited a reference location the required number of times. #4584 - Added
LabelledSpatialAggregate
query and redaction, which sub-aggregates by subscriber labels. #4668 - Added
MobilityClassification
query, to classify subscribers by mobility type based on a sequence of locations. #4666 - Exposed
CoalescedLocation
via FlowAPI, in the specific case where the fallback location is aFilteredReferenceLocation
query. #4585 - Added
LabelledFlows
query, which returns flows disaggregated by label #4679 - Exposed
LabelledSpatialAggregate
andLabelledFlows
via FlowAPI, with aMobilityClassification
query accepted as the 'labels' parameter. #4669 - Added
RedactedLabelledAggregate
and subclasses for redacting labelled data (see ADR 0011). #4671
- Harmonised FlowAPI parameter names for start and end dates. They are now all
start_date
andend_date
- Further improvements to token display in FlowAuth. #1124
- Increased the FlowDB quickstart container's timeout to 15 minutes. #782
Union
andQuery.union
now accept a variable number of queries to concatenate. #4565
- Autoflow's prefect version is now current. #2544
- FlowMachine server will now successfully remove cache for queries defined in an interactive flowmachine session during cleanup. #4008
- FlowETL flux check can be turned off by setting
use_flux_sensor=False
increate_dag
. #3603
- The
use_file_flux_sensor
argument tocreate_dag
is deprecated. To use the table-based flux check in a file-based DAG, setuse_flux_sensor='table'
. - Improvements to token display in FlowAuth. #2812
- A list of additional paths to FlowETL QA checks can now be supplied to
create_dag
andget_qa_checks
. #3484 - FlowETL docker container now includes the upgrade check script for Airflow 2.0.0.
- Additional FlowETL QA checks in the dags folder are now picked up. #3484
- Quickstart will no longer raise a warning about unset Autoflow related environment variables. #2118
- FlowETL QA checks with template sections conditional on the
cdr_type
argument now render correctly. #3479
- Fixed FlowClient ignoring custom SSL certificates #3344
- Fixed FlowETL not using the randomly generated secret key to secure sessions with the web interface if one is not explicitly provided using
AIRFLOW__WEBSERVER__SECRET_KEY
. #3244
- Reinstated tabs navigation in the docs #3238
- Removed
$
from code snippets in developer docs #3224 - FlowETL now randomly generates a secret key to secure sessions with the web interface if one is not explicitly provided using
AIRFLOW__WEBSERVER__SECRET_KEY
. #3244
- Docs displaying None where they shouldn't
- Previously run, or currently running queries can now be referenced as a subscriber subset via FlowAPI. #1009
- total_network_objects, location_introversion, and unique_subscriber_counts now also accept subscriber subsets.
- The validity window for FlowAuth 2factor codes can now be configured using the
TWO_FACTOR_VALID_WINDOW
env variable. #3203
get_cached_query_objects_ordered_by_score
is now a generator. #3116- Flowclient now uses httpx instead of requests, for improved async performance and http2 support. #1789
- FlowAPI now correctly logs all query run, poll, and retrieval requests for matching with FlowMachine. #3071
- Links in the installation docs are now generated correctly. #3152
- When creating a file-based DAG using
create_dag
, you can now use the slower, table based method of checking whether the file is being written. #2857
-
The issuer name can now be set for FlowAuth's 2factor authentication using the
FLOWAUTH_TWO_FACTOR_ISSUER
environment variable. -
FlowAPI's internal port can now be set using the
FLOWAPI_PORT
environment variable, but continues to default to9090
. #2723With thanks to JIPS for supporting this work.
-
FlowETL's default port can now be set using the
FLOWETL_PORT
environment variable, but continues to default to8080
. #2724With thanks to JIPS for supporting this work.
-
Test and synthetic DFS data now uses the same pool of subscribers as CDR data. #2713
With thanks to JIPS for supporting this work.
- FlowDB's SQL synthetic data generator now uses the WorldPop project's 2016 population raster for the country chosen as the basis for generating data.
- Queries run through FlowAPI can now be run on only a subset of the available CDR types, by supplying an
event_types
parameter. #2631 - FlowETL now includes QA checks for the earliest and latest timestamps in the ingested data. #2627
- The FlowETL 'count_duplicates' QA check now correctly counts the number of duplicate rows. #2651
-
FlowDB's SQL synthetic data generator can now generate events for any country, not just Nepal.
To generate synthetic data for a different country, supply the
COUNTRY
environment variable when starting the container, and a valid GADM GID code for the region to simulate a disaster.
- FlowMachine's docker container now uses Python 3.8
- FlowAPI's docker container now uses Python 3.8
- FlowAuth's docker container now uses Python 3.8
- AutoFlow's docker container now uses Python 3.8
- FlowDB's SQL synthetic data generator now uses GADM 3.6 boundaries.
- FlowAuth and FlowAPI now exchange tokens with compressed claims. #2625
- FlowAuth will no longer fail to start if there are directories with names the same as the SSL certificate secrets.
JoinToLocation
is cacheable only if the joined query is also cacheable.
SubscriberLocations
are no longer cacheable using FlowMachine.
- Fixed cache shrinking failing when large numbers of tables have been written. #2462
- Fixed FlowAuth's MySQL support.
- Added missing bridge table arguments to Several FlowClient methods.
- FlowAuth now supports MySQL as a database backend.
- FlowKit now allows the use of bridge tables to manually specify linkages between cells and geometries.
- FlowAuth no longer errors after a period of inactivity due to timed out database connections. #2382
- Added new FlowAPI aggregates;
unique_visitor_counts
,active_at_reference_location_counts
,unmoving_counts
,unmoving_at_reference_location_counts
,trips_od_matrix
, andconsecutive_trips_od_matrix
- Added new Flows type query to FlowAPI
unique_locations
, which produces the paired regional connectivity COVID-19 indicator - Added FlowClient function
unique_locations_spec
, which can be used on either side of aflows
query - Added FlowClient functions:
unique_visitor_counts
,active_at_reference_location_counts
,unmoving_counts
,unmoving_at_reference_location_counts
,trips_od_matrix
, andconsecutive_trips_od_matrix
. #2333 - FlowClient now has an asyncio API. Use
connect_async
instead ofconnect
to create anASyncConnection
, andawait
methods onAPIQuery
objects. #2199
- Fixed FlowMachine server becoming deadlocked under load. #2390
- Added subscriber metrics:
ActiveAtReferenceLocation
,Unmoving
,UnmovingAtReferenceLocation
andUniqueLocations
- Added location metrics and their
Redacted*
equivalents:UniqueVisitorCounts
UnmovingAtReferenceLocationCounts
(COVID-19 equivalent)ActiveAtReferenceLocationCounts
UnmovingCount
(COVID-19 equivalent)TripsODMatrix
(COVID-19 equivalent)ConsecutiveTripsODMatrix
(COVID-19 equivalent) See https://covid19.flowminder.org for more detail on how Flowminder is supporting the global COVID-19 response.
- FlowETL is now based on the official apache-airflow docker image. As a result, you should now bind mount your host dags directory to
/opt/airflow/dags
, and your logs directory to/opt/airflow/logs
.
- FlowMachine server will now ignore values for the
FLOWMACHINE_SERVER_THREADPOOL_SIZE
environment variable which can't be cast toint
. #2304
histogram_aggregate
added to FlowAPI and FlowClient. Allows the user to obtain a histogram over a per-subscriber metric. #1076
- FlowClient now displays a progress bar when waiting for a query to ready, indicating how many parts of that query still need to be run.
- Added a flowclient
Query
class to represent a FlowKit query #1980. - Added method
flowclient.Connection.update_token
, to replace the API token for an existing connection.
- The names of flowclient functions for generating query specifications have been renamed to
<previous_name>_spec
(e.g.flowclient.modal_location
is nowflowclient.modal_location_spec
). flowclient.get_status
now returns"not_running"
(instead of raisingFileNotFoundError
) if a query is not running or completed.- Flowclient functions
location_event_counts_spec
,meaningful_locations_aggregate_spec
,meaningful_locations_between_label_od_matrix_spec
,meaningful_locations_between_dates_od_matrix_spec
,flows_spec
,unique_subscriber_counts_spec
,location_introversion_spec
,total_network_objects_spec
,aggregate_network_objects_spec
,spatial_aggregate_spec
andjoined_spatial_aggregate_spec
have moved to theflowclient.aggregates
submodule.
- FlowAPI can now return results in CSV and GeoJSON format, FlowClient now supports getting GeoJSON formatted results. #2003
- FlowAPI now reports the proportion of subqueries cached for a query when polling. #1202
- FlowClient now logs info messages with the proportion of subqueries cached for a query when polling. #1202
- Fixed the display of deeply nested permissions for flows in FlowAuth. #2110
- Fixed tokens which used the FlowAuth demo data not being accepted by FlowAPI. #2108
- Flowmachine now uses an enum for interaction direction parameters (but will still accept them as strings). #357
- Removed unused aggregates, results and features schemas from FlowDB. #587
- Improved UI for API permissions in FlowAuth.
- The format of user claims expected has changed from a dictionary, to string based format. FlowAPI now expects the claims key of any token to contain a list of scope strings.
- Permissions for joined spatial aggregates can now be set at a finer level in FlowAuth, to allow administrators to grant access only to specific combinations of query types at different aggregation units.
- FlowAuth no longer requires administrators to manually configure API routes, and will extract them from a FlowAPI server's open api specification.
- FlowAuth now uses structlog for log messages.
- FlowAPI no longer mandates a top level
aggregation_unit
field in query specifications. - FlowClient's
flows
andmodal_location
functions no longer require an aggregation unit.
- The poll type permission has been removed, and is implicitly granted by both read and get_result rights.
- FlowAuth no longer allows administrators to specify the name of a FlowAPI server, and will instead use the name specified in the server's open api specification.
- Queries which have been removed Flowmachine's cache, or cancelled can now be rerun. #1898
- FlowMachine can now use multiple FlowDB backends, redis instances or execution pools via the
flowmachine.connections
orflowmachine.core.context.context
context managers. #391 flowmachine.core.connection.Connection
now has aconn_id
attribute, which is unique per database host. #391
flowmachine.connect
no longer returns aConnection
object. The connection should be accessed viaflowmachine.core.context.get_db()
. #391connection
,redis
, andthreadpool
are no longer available as attributes ofQuery
, and should be accessed viaflowmachine.core.context.get_db()
,flowmachine.core.context.get_redis()
andflowmachine.core.context.get_executor()
. #391
- Removed
Query.connection
,Query.redis
, andQuery.threadpool
. #391
- Added a worked example to demonstrate using joined spatial aggregate queries. #1938
Connection.available_dates
is now a property and returns results based on theetl.etl_records
table. #1873
- Fixed the run action blocking the FlowMachine server in some scenarios. #1256
- Removed
tables
andcolumns
methods from theConnection
class in FlowMachine - Removed the
inspector
attribute from theConnection
class in FlowMachine
- FlowMachine now periodically prunes the cache to below the permitted cache size. #1307
The frequency of this pruning is configurable using the
FLOWMACHINE_CACHE_PRUNING_FREQUENCY
environment variable to Flowmachine, and queries are excluded from being removed by the automatic shrinker based on thecache_protected_period
config key within FlowDB. - FlowDB now includes Paul Ramsey's OGR foreign data wrapper, for easy loading of GIS data. #1512
- FlowETL now allows all configuration options to be set using docker secrets. #1515
- Added a new component, AutoFlow, to automate running Jupyter notebooks when new data is added to FlowDB. #1570
FLOWETL_INTEGRATION_TESTS_SAVE_AIRFLOW_LOGS
environment variable added to allow copying the Airflow logs in FlowETL integration tests into the /mounts/logs directory for debugging. #1019- Added new
IterativeMedianFilter
query to Flowmachine, which applies an iterative median filter to the output of another query. #1339 - FlowDB now includes the TDS foreign data wrapper. #1729
- Added contributing and support instructions. #1791
- New FlowETL module installable via pip to aid in ETL dag creation.
- FlowDB is now built on PostgreSQL 12 #1396 and PostGIS 3.
- FlowETL is now built on Airflow 10.1.6.
- FlowETL now defaults to disabling Airflow's REST API, and enables RBAC for the webui. #1516
- FlowETL now requires that the
FLOWETL_AIRFLOW_ADMIN_USERNAME
andFLOWETL_AIRFLOW_ADMIN_PASSWORD
environment variables be set, which specify the default web ui account. #1516 - FlowAPI will no longer return a result for rows in spatial aggregate, joined spatial aggregate, flows, total events, meaningful locations aggregate, meaningful locations od, or unique subscriber count where the aggregate would contain less than 16 sims. #1026
- FlowETL now requires that
AIRFLOW__CORE__SQL_ALCHEMY_CONN
be provided as an environment variable or secret. #1702, #1703 - FlowAuth now records last used two-factor authentication codes in an expiring cache, which supports either a file-based, or redis backend. #1173
- AutoFlow now uses Bundler to manage Ruby dependencies.
- The
end_date
parameter offlowclient.modal_location_from_dates
now refers to the day after the final date included in the range, so is now consistent with other queries that have start/end date parameters. #819 - Date intervals in AutoFlow date stencils are now interpreted as half-open intervals (i.e. including start date, excluding end date), for consistency with date ranges elsewhere in FlowKit.
flowmachine
user now has read access to ETL metadata tables in FlowDB
- Quickstart should no longer fail on systems which do not include the
netstat
tool. #1472 - Fixed an error that prevented FlowAuth admin users from resetting users' passwords using the FlowAuth UI. #1635
- The 'Cancel' button on the FlowAuth 'New User' form no longer submits the form. #1636
- FlowAuth backend now sends a meaningful 400 response when trying to create a user with an empty password. #1637
- Usernames of deleted users can now be re-used as usernames for new users. #1638
- RedactedJoinedSpatialAggregate now only redacts rows with too few subscribers. #1747
- FlowDB now uses a more conservative default setting for
tcp_keepalives_idle
of 10 minutes, to avoid connections being killed after 15 minutes when running in a docker swarm. #1771 - Aggregation units and api routes can now be added to servers. #1815
- Fixed several issues with FlowETL. #1529 #1499 #1498 #1497
- Removed pg_cron.
- Added new
DistanceSeries
query to Flowmachine, which produces per-subscriber time series of distance from a reference point. #1313 - Added new
ImputedDistanceSeries
query to Flowmachine, which produces contiguous per-subscriber time series of distance from a reference point by filling in gaps using the rolling median. #1337
- The FlowETL config file is now always validated, avoiding runtime errors if a config setting is wrong or missing. #1375
- FlowETL now only creates DAGs for CDR types which are present in the config, leading to a better user experience in the Airflow UI. #1376
- The
concurrency
settings in the FlowETL config are no longer ignored. #1378 - The FlowETL deployment example has been updated so that it no longer fails due to a missing foreign data wrapper for the available CDR dates. #1379
- Fixed error when editing a user in FlowAuth who did not have two factor enabled. #1374
- Fixed not being able to enable a newly added api route on existing servers in FlowAuth. #1373
- The
default_args
section in the FlowETL config file has been removed. #1377
- FlowAuth now makes version information available at
/version
and displays it in the web ui. #835 - FlowETL now comes with a deployment example (in
flowetl/deployment_example/
). #1126 - FlowETL now allows to run supplementary post-ETL queries. #989
- Random sampling is now exposed via the API, for all non-aggregated query kinds. #1007
- New aggregate added to FlowMachine -
HistogramAggregation
, which constructs histograms over the results of other queries. #1075 - New
IntereventInterval
query class - returns stats over the gap between events as a time interval. - Added submodule
flowmachine.core.dependency_graph
, which contains functions related to creating or using query dependency graphs (previously these were inutils.py
). - New config option
sql_find_available_dates
in FlowETL to provide SQL code to determine the available dates. #1295
- FlowDB is now based on PostgreSQL 11.5 and PostGIS 2.5.3
- When running queries through FlowAPI, the query's dependencies will also be cached by default. This behaviour can be switched off by setting
FLOWMACHINE_SERVER_DISABLE_DEPENDENCY_CACHING=true
. #1152 NewSubscribers
now takes a pair ofUniqueSubscribers
queries instead of the arguments to them- Flowmachine's default random sampling method is now
random_ids
rather than the non-reproduciblesystem_rows
. #1263 IntereventPeriod
now returns stats over the gap between events in fractional time units, instead of time intervals. #1265- Attempting to store a query that does not have a standard table name (e.g.
EventTableSubset
or unseeded random sample) will now raise anUnstorableQueryError
instead ofValueError
. - In the FlowETL deployment example, the external ingestion database is now set up separately from the FlowKit components and connected to FlowDB via a docker overlay network. #1276
- The
md5
attribute of theQuery
class has been renamed toquery_id
#1288. DistanceMatrix
no longer returns duplicate rows for the lon-lat spatial unit.- Previously,
Displacement
defaulted to returningNaN
for subscribers who have a location in the reference location but were not seen in the time period for the displacement query. These subscribers are no longer returned unless thereturn_subscribers_not_seen
argument is set toTrue
. PopulationWeightedOpportunities
is now available underflowmachine.features.location
, instead offlowmachine.models
PopulationWeightedOpportunities
no longer supports erroring with incomplete per-location departure rate vectors and will instead omit any locations not included from the resultsPopulationWeightedOpportunities
no longer requires use of therun()
method
- Quickstart will no longer fail if it has been run previously with a different FlowDB data size and not explicitly shut down. #900
- Flowmachine's
subscriber_locations_cluster
function has been removed - useHartiganCluster
orMeaningfulLocations
directly. - FlowAPI no longer supports the non-reproducible random sampling method
system_rows
. #1263
- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes event counts. #992
- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes top-up amount. #967
- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes nocturnal events. #1025
- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes top-up balance. #968
- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes displacement. #1010
- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes pareto interactions. #1012
- FlowETL now supports ingesting from a postgres table in addition to CSV files. #1027
FLOWETL_RUNTIME_CONFIG
environment variable added to control which DAG definitions the FlowETL integration tests should use (valid values: "testing", "production").FLOWETL_INTEGRATION_TESTS_DISABLE_PULLING_DOCKER_IMAGES
environment variable added to allow running the FlowETL integration tests against locally built docker images during development.- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes handset. #1011 and #1029
JoinedSpatialAggregate
now supports "distr" stats which computes outputs the relative distribution of the passed metrics.- Added
SubscriberHandsetCharacteristic
to FlowMachine - FlowAuth now supports optional two-factor authentication #121
- The flowdb containers for test_data and synthetic_data were split into two separate containers and quick_start.sh downloads the docker-compose files to a new temporary directory on each run. #843
- Flowmachine now returns more informative error messages when query parameter validation fails. #1055
TESTING
environment variable was removed (previously used by the FlowETL integration tests).- Removed
SubscriberPhoneType
from FlowMachine to avoid redundancy.
PRIVATE_JWT_SIGNING_KEY
environment variable/secret added to FlowAuth, which should be a PEM encoded RSA private key, optionally base64 encoded if supplied as an environment variable.PUBLIC_JWT_SIGNING_KEY
environment variable/secret added to FlowAPI, which should be a PEM encoded RSA public key, optionally base64 encoded if supplied as an environment variable.- The dev provisioning Ansible playbook now automatically generates an SSH key pair for the
flowkit
user. #892 - Added new classes to represent spatial units in FlowMachine.
- Added a
Geography
query class, to get geography data for a spatial unit. - FlowAPI's 'joined_spatial_aggregate' endpoint now exposes unique location counts.#949
- FlowAPI's 'joined_spatial_aggregate' endpoint now exposes subscriber degree.#969
- Flowdb now contains an auxiliary table to record outcomes of queries that can be run as part of the regular ETL process #988
- The quick-start script now only pulls the docker images for the services that are actually started up. #898
- FlowAuth and FlowAPI are now linked using an RSA keypair, instead of per-server shared secrets. #89
- Location-related FlowMachine queries now take a
spatial_unit
parameter instead oflevel
. - The quick-start script now uses the environment variable
GIT_REVISION
to control the version to be deployed. - Create token page permission and spatial aggregation checkboxes are now hidden by default.#834
- The flowetl mounted directories
archive, dump, ingest, quarantine
were replaced with a singlefiles
directory and files are no longer moved. #946 - FlowDB's postgresql has been updated to 11.4, which addresses several bugs and one major vulnerability.
- When creating a new token in FlowAuth, the expiry now always shows the year, seconds till expiry, and timezone. #260
- Distances in
Displacement
are now calculated with longitude and latitude the corrcet way around. #913 - The quick-start script now works correctly with branches. #902
- Fixed
location_event_counts
failing to work when specifying a subset of event types #1015 - FlowAPI will now show the correct version in the API spec, flowmachine and flowclient will show the correct versions in the worked examples. #818
-
Removed
cell_mappings.py
,get_columns_for_level
andBadLevelError
. -
JWT_SECRET_KEY
has been removed in favour of RSA keys. -
The FlowDB tables
infrastructure.countries
andinfrastructure.operators
have been removed. #958
- Buttons to copy token to clipboard and download token as file added to token list page. #704
- Two new worked examples: "Cell Towers Per Region" and "Unique Subscriber Counts". #633, #634
- The
FLOWDB_DEBUG
environment variable has been renamed toFLOWDB_ENABLE_POSTGRES_DEBUG_MODE
. - FlowAuth will now automatically set up the database when started without needing to trigger via the cli.
- FlowAuth now requires that at least one administrator account is created by providing env vars or secrets for:
FLOWAUTH_ADMIN_PASSWORD
FLOWAUTH_ADMIN_USERNAME
- The
FLOWDB_DEBUG
environment variable used to have no effect. This has been fixed. #811 - Previously, queries could be stuck in an executing state if writing their cache metadata failed, they will now correctly show as having errored. #833
- Fixed an issue where
Table
objects could be in an inconsistent cache state after resetting cache #832 - FlowAuth's docker container can now be used with a Postgres backing database. #825
- FlowAPI now starts up successfully when following the "Secrets Quickstart" instructions in the docs. #836
- The command to generate an SSL certificate in the "Secrets Quickstart" section in the docs has been fixed and made more robust #837
- FlowAuth will no longer try to initialise the database or create demo data multiple times when running under uwsgi with multiple workers #844
- Fixed issue of Multiple tokens don't line up on FlowAuth "Tokens" page #849
- The
FLOWDB_SERVICES
environment variable has been removed from the toplevel Makefile, so that nowDOCKER_SERVICES
is the only environment variable that controls which services are spun up when runningmake up
. #827
- FlowKit's worked examples are now Dockerized, and available as part of the quick setup script #614
- Skeleton for Airflow based ETL system added with basic ETL DAG specification and tests.
- The docs now contain information about required versions of installation prerequisites #703
- FlowAPI now requires the
FLOWAPI_IDENTIFIER
environment variable to be set, which contains the name used to identify this FlowAPI server when generating tokens in FlowAuth #727 flowmachine.utils.calculate_dependency_graph
now includes theQuery
objects in thequery_object
field of the graph's nodes dictionary #767- Architectural Decision Records (ADR) have been added and are included in the auto-generated docs #780
- Added FlowDB environment variables
SHARED_BUFFERS_SIZE
andEFFECTIVE_CACHE_SIZE
, to allow manually setting the Postgres configuration parametersshared_buffers
andeffective_cache_size
. - The function
print_dependency_tree()
now takes an optional argumentshow_stored
to display information whether dependent queries have been stored or not #804 - A new function
plot_dependency_graph()
has been added which allows to conveniently plot and visualise a dependency graph for use in Jupyter notebooks (this requires IPython and pygraphviz to be installed) #786
- Parameter names in
flowmachine.connect()
have been renamed as follows to be consistent with the associated environment variables #728:db_port -> flowdb_port
db_user -> flowdb_user
db_pass -> flowdb_password
db_host -> flowdb_host
db_connection_pool_size -> flowdb_connection_pool_size
db_connection_pool_overflow -> flowdb_connection_pool_overflow
- FlowAPI and FlowAuth now expect an audience key to be present in tokens #727
- Dependent queries are now only included once in the md5 calculation of a given query (in particular, it changes the query ids compared to previous FlowKit versions).
- Error is displayed in the add user form of Flowauth if username is alredy exists. #690
- Error is displayed in the add group form of Flowauth if group name already exists. #709
- FlowAuth's add new server page now shows helper text for bad inputs. #749
- The class
SubscriberSubsetterBase
in FlowMachine no longer inherits fromQuery
#740 (this changes the query ids compared to previous FlowKit versions).
- FlowClient docs rendered to website now show the options available for arguments that require a string from some set of possibilities #695.
- The Flowmachine loggers are now initialised only once when flowmachine is imported, with a call to
connect()
only changing the log level #691 - The FERNET_KEY environment variable for FlowAuth is now named FLOWAUTH_FERNET_KEY
- The quick-start script now correctly aborts if one of the FlowKit services doesn't fully start up #745
- The maps in the worked examples docs pages now appear in any browser
- Example invocations of
generate-jwt
are no longer uncopyable due to line wrapping #778 - API parameter
interval
forlocation_event_counts
queries is now correctly passed to the underlying FlowMachine query object #807.
- Added a new module,
flowkit-jwt-generator
, which generates test JWT tokens for use with FlowAPI #564 - A new Ansible playbook was added in
deployment/provision-dev.yml
. In addition to the standard provisioning this installs pyenv, Python 3.7, pipenv and clones the FlowKit repository, which is useful for development purposes. - Added a 'quick start' setup script for trying out a complete FlowKit system #688.
- FlowAPI's
available_dates
endpoint now always returns available dates for all event types and does not accept JSON - Hints are now displayed in the add user form of FlowAuth if the form is not completed #679
- Error messages are now displayed when generating a new token in FlowAuth if the token's name is invalid #799
- The Ansible playbooks in
deployment/
now allow configuring the username and password for the FlowKit user account. - Default compose file no longer includes build blocks, these have been moved to
docker-compose-build.yml
.
- FlowDB synthetic data container no longer silently fails to generate data if data generator is not set #654
- Fixed
TotalNetworkObjects
raising an error when run with a lat-long level #108 - Radius of gyration no longer incorrectly appears as a top level api query
- Added new flowclient API entrypoint,
aggregate_network_objects
, to access equivalent flowmachine query #601 - FlowAPI now exposes the API spec at the
spec/openapi.json
endpoint, and an interactive version of the spec at thespec/redoc
endpoint - Added Makefile target
make up-no_build
, to spin up all containers without building the images - Added
resync_redis_with_cache
function to cache utils, to allow administrators to align redis with FlowDB #636 - Added new flowclient API entrypoint,
radius_of_gyration
, to access (with simplified parameters) equivalent flowmachine queryRadiusOfGyration
#602
- The
period
argument toTotalNetworkObjects
in FlowMachine has been renamedtotal_by
- The
period
argument tototal_network_objects
in FlowClient has been renamedtotal_by
- The
by
argument toAggregateNetworkObjects
in FlowMachine has been renamed toaggregate_by
- The
stop_date
argument to themodal_location_from_dates
andmeaningful_locations_*
functions in FlowClient has been renamedend_date
#470 get_result_by_query_id
now accepts apoll_interval
argument, which allows polling frequency to be changed- The
start
andstop
argument toEventTableSubset
are now mandatory. RadiusOfGyration
now returns avalue
column instead of anrog
columnTotalNetworkObjects
andAggregateNetworkObjects
now return avalue
column, rather thanstatistic_name
- All environment variables are now in a single
development_environment
file in the project root, development environment setup has been simplified - Default FlowDB users for FlowMachine and FlowAPI have changed from "analyst" and "reporter" to "flowmachine" and "flowapi", respectively
- Docs and integration tests now use top level compose file
- The following environment variables have been renamed:
FLOWMACHINE_SERVER
(FlowAPI) ->FLOWMACHINE_HOST
FM_PASSWORD
(FlowDB),FLOWDB_PASS
(FlowMachine) ->FLOWMACHINE_FLOWDB_PASSWORD
API_PASSWORD
(FlowDB),FLOWDB_PASS
(FlowAPI) ->FLOWAPI_FLOWDB_PASSWORD
FM_USER
(FlowDB),FLOWDB_USER
(FlowMachine) ->FLOWMACHINE_FLOWDB_USER
API_USER
(FlowDB),FLOWDB_USER
(FlowAPI) ->FLOWAPI_FLOWDB_USER
LOG_LEVEL
(FlowMachine) ->FLOWMACHINE_LOG_LEVEL
LOG_LEVEL
(FlowAPI) ->FLOWAPI_LOG_LEVEL
DEBUG
(FlowDB) ->FLOWDB_DEBUG
DEBUG
(FlowMachine) ->FLOWMACHINE_SERVER_DEBUG_MODE
- The following Docker secrets have been renamed:
FLOWAPI_DB_USER
->FLOWAPI_FLOWDB_USER
FLOWAPI_DB_PASS
->FLOWAPI_FLOWDB_PASSWORD
FLOWMACHINE_DB_USER
->FLOWMACHINE_FLOWDB_USER
FLOWMACHINE_DB_PASS
->FLOWMACHINE_FLOWDB_PASSWORD
POSTGRES_PASSWORD_FILE
->POSTGRES_PASSWORD
REDIS_PASSWORD_FILE
->REDIS_PASSWORD
status
enum in FlowDB renamed toetl_status
reset_cache
now requires a redis client argument
- Fixed being unable to add new users or servers when running FlowAuth with a Postgres database #622
- Resetting the cache using
reset_cache
will now reset the state of queries in redis as well #650 - Fixed
mode
statistic forAggregateNetworkObjects
#651
- Removed
docker-compose-dev.yml
, and docker-compose files indocs/
,flowdb/tests/
andintegration_tests/
. - Removed
Dockerfile-dev
Dockerfiles - Removed
ENV
defaults from the FlowMachine Dockerfile - Removed
POSTGRES_DB
environment variable from FlowDB Dockerfile, database name is now hardcoded asflowdb
- Added new
spatial_aggregate
API endpoint and FlowClient function #599 - Added new flowclient API entrypoint, total_network_objects(), to access (with simplified parameters) equivalent flowmachine query #581
- Added new flowclient API entrypoint, location_introversion(), to access (with simplified parameters) equivalent flowmachine query #577
- Added new flowclient API entrypoint, unique_subscriber_counts(), to access (with simplified parameters) equivalent flowmachine query #562
- New schema
aggregates
and tableaggregates.aggregates
have been created for maintaining a record of the process and completion of scheduled aggregates. - New
joined_spatial_aggregate
API endpoint and FlowClient function #600
daily_location
andmodal_location
query types are no longer accepted as top-level queries, and must be wrapped usingspatial_aggregate
JoinedSpatialAggregate
no longer accepts positional argumentsJoinedSpatialAggregate
now supports "avg", "max", "min", "median", "mode", "stddev" and "variance" stats
total_network_objects
no longer returns results fromAggregateNetworkObjects
#603
- Fixed #514, which would cause the client to hang after submitting a query that couldn't be created
- Fixed #575, so that events at midnight are now considered to be happening on the following day
- Added
HandsetStats
to FlowMachine. - Added new
ContactReferenceLocationStats
query class to FlowMachine. - A new zmq message
get_available_dates
was added to the flowmachine server, along with the/available_dates
endpoint in flowapi and the functionget_available_dates()
in flowclient. These allow to determine the dates that are available in the database for the supported event types.
- FlowMachine's debugging logs are now from a single logger (
flowmachine.debug
) and include the submodule in the submodule field instead of using it as the logger name - FlowMachine's query run logger now uses the logger name
flowmachine.query_run_log
- FlowAPI's access, run and debug loggers are now named
flowapi.access
,flowapi.query
andflowapi.debug
- FlowAPI's access and run loggers, and FlowMachine's query run logger now log to stdout instead of stderr
- Passwords for Redis and FlowDB must now be explicitly provided to flowmachine via argument to
connect
, env var, or secret
- FlowMachine and FlowAPI no longer support logging to a file
- The flowmachine python library is now pip installable (
pip install flowmachine
) - The flowmachine server now supports additional actions:
get_available_queries
,get_query_schemas
,ping
. - Flowdb now contains a new
dfs
schema and associated tables to process mobile money transactions. In addition,flowdb_testdata
contains sample data for DFS transactions. - The docs now include three worked examples of CDR analysis using FlowKit.
- Flowmachine now supports calculating the total amount of various DFS metrics (transaction amount,
commission, fee, discount) per aggregation unit during a given date range. These metrics are also
exposed in FlowAPI via the query kind
dfs_metric_total_amount
.
- The JSON structure when setting queries running via flowapi or the flowmachine server has changed:
query parameters are now "inlined" alongside the
query_kind
key, rather than nested using a separateparams
key. Example:- previously:
{"query_kind": "daily_location", "params": {"date": "2016-01-01", "aggregation_unit": "admin3", "method": "last"}}
, - now:
{"query_kind": "daily_location", "date": "2016-01-01", "aggregation_unit": "admin3", "method": "last"}
- previously:
- The JSON structure of zmq reply messages from the flowmachine server was changed.
Replies now have the form:
{"status": "[success|error]", "msg": "...", "payload": {...}
. - The flowmachine server action
get_sql
was renamed toget_sql_for_query_result
. - The parameter
daily_location_method
was renamed tomethod
.
- When running integration tests locally, normally pytest will automatically spin up servers for flowmachine and flowapi as part of the test setup.
This can now be disabled by setting the environment variable
FLOWKIT_INTEGRATION_TESTS_DISABLE_AUTOSTART_SERVERS=TRUE
. - The integration tests now use the environment variables
FLOWAPI_HOST
,FLOWAPI_PORT
to determine how to connect to the flowapi server. - A new data generator has been added to the synthetic data container which supports more data types, simple disaster simulation, and more plausible behaviours as well as increased performance
- FlowAPI now reports queued/running status for queries instead of just accepted
- The following environment variables have been renamed:
DB_USER
->FLOWDB_USER
DB_USER
->FLOWDB_HOST
DB_PASS
->FLOWDB_PASS
DB_PW
->FLOWDB_PASS
API_DB_USER
->FLOWAPI_DB_USER
API_DB_PASS
->FLOWAPI_DB_PASS
FM_DB_USER
->FLOWMACHINE_DB_USER
FM_DB_PASS
->FLOWMACHINE_DB_PASS
- Added
numerator_direction
toProportionEventType
to allow for proportion of directed events.
- Server no longer loses track of queries under heavy load
TopUpBalances
no longer always uses entire topups table
- The environment variable
DB_NAME
has been removed.
MDSVolume
no longer allows specifying the table, and will always use themds
table.- All FlowMachine logs are now in structured json form
- FlowAPI now uses structured logs for debugging messages
- Added
TopUpAmount
,TopUpBalance
query classes to FlowMachine. - Added
PerLocationEventStats
,PerContactEventStats
to FlowMachine
- Removed
TotalSubscriberEvents
from FlowMachine as it is superseded byEventCount
.
- Dockerised development setup, with support for live reload of
flowmachine
andflowapi
after source code changes. - Pre-commit hook for Python formatting with black.
- Added new
IntereventPeriod
,ContactReciprocal
,ProportionContactReciprocal
,ProportionEventReciprocal
,ProportionEventType
andMDSVolume
query classes to FlowMachine.
CustomQuery
now requires column names to be specified- Query classes are now required to declare the column names they return via the
column_names
property - FlowAPI now reports whether a query is queued or running when polling
- FlowDB test data and synthetic data images are now available from their own Docker repos (Flowminder/flowdb-testdata, Flowminder/flowdb-synthetic-data)
- Changed query class name from
NocturnalCalls
toNocturnalEvents
.
- FlowAPI is now an installable python module
- Query objects can no longer be recalculated to cache and must be explicitly removed first
- Arbitrary
Flow
maths EdgeList
query type- Removes query class
ProportionOutgoing
as it becomes redundant with the the introduction ofProportionEventType
.
- API route for retrieving geography data from FlowDB
- Aggregated meaningful locations are now available via FlowAPI
- Origin-destination matrices between meaningful locations are now available via FlowAPI
- Added new
MeaningfulLocations
,MeaningfulLocationsAggregate
andMeaningfulLocationsOD
query classes to FlowMachine
- Constructors for
HartiganCluster
,LabelEventScore
,EventScore
andCallDays
now have different signatures - Restructured and extended documentation; added high-level overview and more targeted information for different types of users
- Support for running FlowDB as an arbitrary user via docker's
--user
flag
- Support for setting the uid and gid of the postgres user when building FlowDB
- Fixed being unable to build if the port used by
git://
is not open
- Added utilities for managing and inspecting the query cache
- FlowDB now requires a password to be set for the flowdb superuser
- Support for password protected redis
- Changed the default redis image to bitnami's redis (to enable password protection)
- Added structured logging of access attempts, query running, and data access
- Added CHANGELOG.md
- Added support for Postgres JIT in FlowDB
- Added total location events metric to FlowAPI and FlowClient
- Added ETL bookkeeping schema to FlowDB
- Added changelog update to PR template
- Increased default shared memory size for FlowDB containers
- Fixed being unable to delete groups in FlowAuth
- Fixed
make up
not working with defaults
- Added Python 3.6 support for FlowClient