Squashed commit of the following:

commit b317169 Author: Håkon V. Treider <[email protected]> Date: Tue Oct 4 01:46:43 2022 +0200 Dps Fetch Refactor #5: New dps fetch code (#1043) * New dps fetch code * typos and tweaks * Update cognite/client/_api/datapoint_tasks.py * forgot self arg... * Fix insertion of DatapointsArray * Make dps retrieve method kwargs only * fstring fix * Fix to avoid float dtype for count agg commit 4e673cb Author: Håkon V. Treider <[email protected]> Date: Tue Oct 4 01:43:13 2022 +0200 Dps Fetch Refactor #9: Unit tests, added, fixed and improved. (#1047) * Unit tests, added, fixed and improved. * add test for split_into_n_parts and find_duplicates commit 8099cb3 Author: Håkon V. Treider <[email protected]> Date: Tue Oct 4 01:42:49 2022 +0200 Integration tests, added, fixed and improved (#1048) commit a049ec4 Author: Håkon V. Treider <[email protected]> Date: Tue Oct 4 01:42:30 2022 +0200 Dps Fetch Refactor #11: Readme, docs and changelog (#1049) * Readme, docs and changelog * Improve changelog * Changelog almost done * Mooooooare changelog details * Apply 12 spell check suggestions from code review * Reword the note about `include_aggregate_name` * Update CHANGELOG.md commit e74699e Author: Håkon V. Treider <[email protected]> Date: Tue Oct 4 00:21:33 2022 +0200 Dps Fetch Refactor #12: Fix `ms_to_datetime` on windows (#1056) * Fix ms_to_datetime for windows * make single ms_to_datetime impl. work on all OSs. remove removed warnings from filterwarnings commit 59aace9 Author: Håkon V. Treider <[email protected]> Date: Tue Oct 4 00:11:57 2022 +0200 Dps Fetch Refactor #6: Add array-based data classes. Change camel case defaults (#1044) * Add array-based data classes. Change camel case defaults * No copy in df constructor from numpy arrays yields another 100 x * Update cognite/client/data_classes/datapoints.py * Update cognite/client/data_classes/datapoints.py * remove unnecessary list conv * Remove _strip_aggregate_names entirely commit f0b0525 Author: Håkon V. Treider <[email protected]> Date: Fri Sep 30 14:04:31 2022 +0200 Dps Fetch Refactor #7: Extend utils with new functionality (#1045) * Extend utils with new functionality * add hashable bound to type var commit c281ac0 Author: Håkon V. Treider <[email protected]> Date: Fri Sep 30 14:04:00 2022 +0200 Dps Fetch Refactor #3: Add ThreadPoolExecutor with PriorityQueue (#1041) * Add ThreadPoolExecutor with PriorityQueue * move mypy ignore flag to ini file commit 920cd84 Author: Håkon V. Treider <[email protected]> Date: Fri Sep 30 13:12:17 2022 +0200 Dps Fetch Refactor #2: Update dependencies and version. Run poetry lock (#1040) * Update dependencies and version. Run poetry lock * relax and tighten dep. reqs * Trim pandas version (no-op) commit f0a69d8 Merge: 6be4b44 7abd1c2 Author: Håkon V. Treider <[email protected]> Date: Thu Sep 29 02:22:23 2022 +0200 Merge branch 'master' into v5-release commit 7abd1c2 Author: Jaime Silva <[email protected]> Date: Wed Sep 28 03:55:39 2022 -0500 Rename alpha data models destination in transformations (#1050) * rename alpha data models to data models * version bump * unskip dms and jobs tests * show warning when using FDMs on transformations commit aa3f52b Author: tuanng-cognite <[email protected]> Date: Tue Sep 27 14:00:50 2022 +0200 geospatial aggregation to support output (#1032) * geospatial aggregation to support output * format * rename * fix deprecation message * improve example commit 6be4b44 Merge: 88c3877 98f5033 Author: Håkon V. Treider <[email protected]> Date: Tue Sep 27 03:06:22 2022 +0200 Merge branch 'master' into v5-release commit 88c3877 Author: Håkon V. Treider <[email protected]> Date: Tue Sep 27 03:04:20 2022 +0200 Cleanup of CogClient mock duplicates. Update with 20 missing APIs... (#1046) commit 9b30dba Author: Håkon V. Treider <[email protected]> Date: Tue Sep 27 02:48:55 2022 +0200 Move DatapointsAPI to time_series.data. Many minor fixups. Bump max_workers to 20. (#1042) commit 98f5033 Author: Håkon V. Treider <[email protected]> Date: Mon Sep 26 19:44:30 2022 +0200 Simplify github workflows. Minor project linting settings changes (#1039) * Simplify GitHub workflows. Minor project linting settings changes
cognitedata · Oct 6, 2022 · 101b31d · 101b31d
1 parent 0f5dda5
commit 101b31d
Show file tree

Hide file tree

Showing 62 changed files with 5,760 additions and 2,459 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,18 +10,85 @@ Changes are grouped as follows
 - `Added` for new features.
 - `Changed` for changes in existing functionality.
 - `Deprecated` for soon-to-be removed features.
+- `Improved` for transparent changes, e.g. better performance.
 - `Removed` for now removed features.
 - `Fixed` for any bug fixes.
 - `Security` in case of vulnerabilities.
 
+## [5.0.0] - 28-09-22
+### Improved
+- Greatly increased speed of datapoints fetching, especially when asking for...:
+  - Large number of time series (~80+)
+  - Very few time series (1-3)
+  - Any query using a finite `limit`
+  - Any query for `string` datapoints
+- Peak memory consumption is 25-30 % lower when using the new `retrieve_arrays` method (with the same number of `max_workers`).
+- Converting fetched datapoints to a Pandas `DataFrame` via `to_pandas()` (or time saved by using `retrieve_dataframe` directly) has changed from `O(N)` to `O(1)`, i.e., speedup depends on the size and is typically 4-5 orders of magnitude faster (!) (only applies to `DatapointsArray` and `DatapointsArrayList` as returned by the `retrieve_arrays` method).
+- Individual customization of queries is now available for all retrieve endpoints. Previously only `aggregates` could be customized. Now all parameters can be passed either as top-level or as individual settings. This is now aligned with the API.
+- Documentation for the retrieve endpoints has been overhauled with lots of new usage patterns and (better!) examples, check it out!
+- Vastly better test coverage for datapoints fetching logic. You can have increased trust in the results from the SDK!
+
+### Added
+- New optional dependency, `numpy`.
+- A new datapoints fetching method, `retrieve_arrays`, that loads data directly into NumPy arrays for improved speed and lower memory usage.
+- These arrays are stored in the new resource types `DatapointsArray` and `DatapointsArrayList` which offer more efficient memory usage and zero-overhead pandas-conversion.
+
+### Changed
+- The default value for `max_workers`, controlling the max number of concurrent threads, has been increased from 10 to 20.
+- The main way to interact with the `DatapointsAPI` has been moved from `client.datapoints` to `client.time_series.data` to align and unify with the `SequenceAPI`. All example code has been updated to reflect this change. Note, however, that the `client.datapoints` will still work until the next major release, but will until then issue a `DeprecationWarning`.
+- The utility function `datetime_to_ms` no longer issues a `FutureWarning` on missing timezone information. It will now interpret naive `datetime`s as local time as is Python's default interpretation.
+- The utility function `ms_to_datetime` no longer issues a `FutureWarning` on returning a naive `datetime` in UTC. It will now return an aware `datetime` object in UTC.
+- All data classes in the SDK that represent a Cognite resource type have a `to_pandas` method. Previously, these had various defaults for the `camel_case` parameter, but they have all been changed to `False`.
+- The method `DatapointsAPI.insert_dataframe` has new default values for `dropna` (now `True`, still being applied on a per-column basis) and `external_id_headers` (now `True`, disincentivizing the use of internal IDs).
+- The previous fetching logic awaited and collected all errors before raising (through the use of an "initiate-and-forget" thread pool). This is great for e.g. updates/inserts to make sure you are aware of all partial changes. However, when reading datapoints, a better option is to just fail fast (which it now does).
+- `DatapointsAPI.[retrieve/retrieve_arrays/retrieve_dataframe]` no longer requires `start` (default: `0`) and `end` (default: `now`). This is now aligned with the API.
+- All retrieve methods accept a list of full query dictionaries for `id` and `external_id` giving full flexibility for individual settings like `start` time, `limit`, and `granularity` (to name a few), previously only possible with the `DatapointsAPI.query` endpoint. This is now aligned with the API.
+- Aggregates returned now include the time period(s) (given by `granularity` unit) that `start` and `end` are a part of (as opposed to only "fully in-between" points). This is now aligned with the API.
+This is also a **bugfix**: Due to the SDK rounding differently than the API, you could supply `start` and `end` (with `start < end`) and still be given an error that `start is not before end`. This can no longer happen.
+- Fetching raw datapoints using `return_outside_points=True` now returns both outside points (if they exist), regardless of `limit` setting. Previously the total number of points was capped at `limit`, thus typically only returning the first. Now up to `limit+2` datapoints are always returned. This is now aligned with the API.
+- Asking for the same time series any number of times no longer raises an error (from the SDK), which is useful for instance when fetching disconnected time periods. This is now aligned with the API.
+- ...this change also causes the `.get` method of `DatapointsList` and `DatapointsArrayList` to now return a list of `Datapoints` or `DatapointsArray` respectively when duplicated identifiers were queried. (For data scientists and others used to `pandas`, this syntax is familiar to the slicing logic of `Series` and `DataFrame` when used with non-unique indices).
+There is also a subtle **bugfix** here: since the previous implementation allowed the same time series to be specified by both its `id` and `external_id`, using `.get` to get it would always yield the settings that were specified by the `external_id`. This will now return a `list` as explained above.
+- Datapoints fetching algorithm has changed from one that relied on up-to-date and correct `count` aggregates to be fast (with fallback on serial fetching if missing), to recursively (and reactively) splitting the time-domain into smaller and smaller pieces, depending on the discovered-as-fetched density-distribution of datapoints in time and available threads. The new approach also has the ability to group more than 1 (one) time series per API request (when beneficial) and short-circuit once a user-given limit has been reached (if/when given). This method is now used for *all types of queries*; numeric raw-, string raw-, and aggregate datapoints.
+
+#### Change: `retrieve_dataframe`
+- Previously, fetching was constricted to either raw- OR aggregate datapoints. This restriction has been lifted and the method now works exactly like the other retrieve-methods (with a few extra options relevant only for pandas `DataFrame`s).
+- Used to fetch time series given by `id` and `external_id` separately - this is no longer the case. This gives a significant, additional speedup when both are supplied.
+- The `complete` parameter has been removed and partially replaced by `uniform_index` (bool) which covers a subset of the previous features (with some modifications: now gives uniform index all the way from first given `start` to last given `end`). Rationale: Weird and unintuitive syntax (passing a string using a comma to separate options).
+- Interpolating, forward-filling or in general, imputation (also controlled via the `complete` parameter) is completely removed as the resampling logic *really* should be up to the user fetching the data to decide, not the SDK.
+- New parameter `column_names` (as already used in several existing `to_pandas` methods) decides whether to pick `id`s or `external_id`s as the dataframe column names. Previously, when both were supplied, the dataframe ended up with a mix.
+Read more below in the removed section or check out the method's updated documentation.
+
+### Fixed
+- **Critical**: Fetching aggregate datapoints now works properly with the `limit` parameter. In the old implementation, `count` aggregates were first fetched to split the time domain efficiently - but this has little-to-no informational value when fetching *aggregates* with a granularity, as the datapoints distribution can take on "any shape or form". This often led to just a few returned batches of datapoints due to miscounting.
+- Fetching datapoints using `limit=0` now returns zero datapoints, instead of "unlimited". This is now aligned with the API.
+- Removing aggregate names from the columns in a Pandas `DataFrame` in the previous implementation used `Datapoints._strip_aggregate_name()`, but this had a bug: Whenever raw datapoints were fetched all characters  after the last pipe character (`|`) in the tag name would be removed completely. In the new version, the aggregate name is only added when asked for.
+- The method `Datapoints.to_pandas` could return `dtype=object` for numeric time series when all aggregate datapoints were missing; which is not *that* unlikely, e.g., when using `interpolation` aggregate on a `is_step=False` time series with datapoints spacing above one hour on average. In such cases, an object array only containing `None` would be returned instead of float array dtype with `NaN`s. Correct dtype is now enforced by an explicit cast.
+- Fixed a rare bug in `DatapointsAPI.query` when no time series was found (`ignore_unknown_ids=True`) and `.get` was used on the empty returned `DatapointsList` object which would raise an exception because the identifiers-to-datapoints mapping was not defined.
+
+### Fixed: Extended time domain
+- `TimeSeries.[first/count]()` now work with the expanded time domain (minimum age of datapoints was moved from 1970 to 1900, see [4.2.1]).
+  - `TimeSeries.first()` now considers datapoints before 1970 and after "now".
+  - `TimeSeries.count()` now considers datapoints before 1970 and after "now" and will raise an error for string time series as `count` (or any other aggregate) is not defined.
+- The utility function `ms_to_datetime` no longer raises `ValueError` for inputs from before 1970, but will raise for input outside the allowed minimum- and maximum supported timestamps in the API.
+**Note**: that support for `datetime`s before 1970 may be limited on Windows.
+
+### Removed
+- All convenience methods related to plotting and the use of `matplotlib`. Rationale: No usage and low utility value: the SDK should not be a data science library.
+- The entire method `DatapointsAPI.retrieve_dataframe_dict`. Rationale: Due to its slightly confusing syntax and return value, it basically saw no use "in the wild".
+
+### Other
+Evaluation of `protobuf` performance: In its current state, using `protobuf` results in significant performance degradation compared to JSON. Additionally, it adds an extra dependency, which, if installed in its pure-Python distribution, results in earth-shattering performance degradation.
+
+
 ## [4.8.0] - 2022-09-30
 ### Added
 - Add operations for geospatial rasters
 
 ## [4.7.1] - 2022-09-29
 
 ### Fixed
-- Fixed the `FunctionsAPI.create` method for Windows-users by removing 
+- Fixed the `FunctionsAPI.create` method for Windows-users by removing
   validation of `requirements.txt`.
 
 ## [4.7.0] - 2022-09-28
@@ -45,7 +112,6 @@ Changes are grouped as follows
 ### Fixed
 - Fixes the issue when updating transformations with new nonce credentials
 
-
 ## [4.5.1] - 2022-09-08
 ### Fixed
 - Don't depend on typing_extensions module, since we don't have it as a dependency.
@@ -164,8 +230,8 @@ other OAuth flows.
 - added support for nonce authentication on transformations
 
 ### Changed
-- if no source or destination credentials are provided on transformation create, an attempt will be made to create a session with the CogniteClient credentials, if it succeeds the aquired nonce will be used.
-- if OIDC credentials are provided on transformation create/update, an attempt will be made to create a session with the given credentials, if it succeeds the aquired nonce credentials will replace the given client credentials before sending the request.
+- if no source or destination credentials are provided on transformation create, an attempt will be made to create a session with the CogniteClient credentials, if it succeeds, the acquired nonce will be used.
+- if OIDC credentials are provided on transformation create/update, an attempt will be made to create a session with the given credentials. If it succeeds, the acquired nonce credentials will replace the given client credentials before sending the request.
 
 ## [3.3.0] - 2022-07-21
 ### Added

diff --git a/README.md b/README.md
@@ -13,8 +13,8 @@ Cognite Python SDK
 [![mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
 
-This is the Cognite Python SDK for developers and data scientists working with Cognite Data Fusion (CDF). 
-The package is tightly integrated with pandas, and helps you work easily and efficiently with data in Cognite Data 
+This is the Cognite Python SDK for developers and data scientists working with Cognite Data Fusion (CDF).
+The package is tightly integrated with pandas, and helps you work easily and efficiently with data in Cognite Data
 Fusion (CDF).
 
 ## Refererence documentation
@@ -34,11 +34,12 @@ $ pip install cognite-sdk
 ### With optional dependencies
 A number of optional dependencies may be specified in order to support a wider set of features.
 The available extras (along with the libraries they include) are:
+- numpy `[numpy]`
 - pandas `[pandas]`
 - geo `[geopandas, shapely]`
 - sympy `[sympy]`
 - functions `[pip]`
-- all `[pandas, geopandas, shapely, sympy, pip]`
+- all `[numpy, pandas, geopandas, shapely, sympy, pip]`
 
 To include optional dependencies, specify them like this with pip:
 
@@ -51,6 +52,11 @@ or like this if you are using poetry:
 $ poetry add cognite-sdk -E pandas -E geo
 ```
 
+### Performance notes
+If you regularly need to fetch large amounts of datapoints, consider installing with `numpy`
+(or with `pandas`, as it depends on numpy) for best performance, then use the `retrieve_arrays` endpoint.
+This avoids building large pure Python data structures, and instead reads data directly into `numpy.ndarrays`.
+
 ### Windows specific
 
 On Windows, it is recommended to install `geopandas` and its dependencies using `conda` package manager,

diff --git a/cognite/client/_api/annotations.py b/cognite/client/_api/annotations.py
@@ -11,9 +11,6 @@
 class AnnotationsAPI(APIClient):
     _RESOURCE_PATH = "/annotations"
 
-    def __init__(self, *args: Any, **kwargs: Any) -> None:
-        super().__init__(*args, **kwargs)
-
     @overload
     def create(self, annotations: Annotation) -> Annotation:
         ...

diff --git a/cognite/client/_api/datapoint_constants.py b/cognite/client/_api/datapoint_constants.py
@@ -0,0 +1,87 @@
+from datetime import datetime
+from typing import Dict, Iterable, List, Optional, TypedDict, Union
+
+try:
+    import numpy as np
+    import numpy.typing as npt
+
+    NUMPY_IS_AVAILABLE = True
+except ImportError:  # pragma no cover
+    NUMPY_IS_AVAILABLE = False
+
+if NUMPY_IS_AVAILABLE:
+    NumpyDatetime64NSArray = npt.NDArray[np.datetime64]
+    NumpyInt64Array = npt.NDArray[np.int64]
+    NumpyFloat64Array = npt.NDArray[np.float64]
+    NumpyObjArray = npt.NDArray[np.object_]
+
+# Datapoints API-limits:
+DPS_LIMIT_AGG = 10_000
+DPS_LIMIT = 100_000
+POST_DPS_OBJECTS_LIMIT = 10_000
+FETCH_TS_LIMIT = 100
+RETRIEVE_LATEST_LIMIT = 100
+
+
+ALL_SORTED_DP_AGGS = sorted(
+    [
+        "average",
+        "max",
+        "min",
+        "count",
+        "sum",
+        "interpolation",
+        "step_interpolation",
+        "continuous_variance",
+        "discrete_variance",
+        "total_variation",
+    ]
+)
+
+
+class CustomDatapointsQuery(TypedDict, total=False):
+    # No field required
+    start: Union[int, str, datetime, None]
+    end: Union[int, str, datetime, None]
+    aggregates: Optional[List[str]]
+    granularity: Optional[str]
+    limit: Optional[int]
+    include_outside_points: Optional[bool]
+    ignore_unknown_ids: Optional[bool]
+
+
+class DatapointsQueryId(CustomDatapointsQuery):
+    id: int  # required field
+
+
+class DatapointsQueryExternalId(CustomDatapointsQuery):
+    external_id: str  # required field
+
+
+class CustomDatapoints(TypedDict, total=False):
+    # No field required
+    start: int
+    end: int
+    aggregates: Optional[List[str]]
+    granularity: Optional[str]
+    limit: int
+    include_outside_points: bool
+
+
+class DatapointsPayload(CustomDatapoints):
+    items: List[CustomDatapoints]
+
+
+DatapointsTypes = Union[int, float, str]
+
+
+class DatapointsFromAPI(TypedDict):
+    id: int
+    externalId: Optional[str]
+    isString: bool
+    isStep: bool
+    datapoints: List[Dict[str, DatapointsTypes]]
+
+
+DatapointsIdTypes = Union[int, DatapointsQueryId, Iterable[Union[int, DatapointsQueryId]]]
+DatapointsExternalIdTypes = Union[str, DatapointsQueryExternalId, Iterable[Union[str, DatapointsQueryExternalId]]]