Release 0.9.0

Major Features and Improvements

Performance improvements for vocabulary generation when using top_k.
Utility to deep-copy Beam PCollections was added to avoid unnecessary materialization.
Utilize deep_copy to avoid unnecessary materialization of pcollections when the input data is immutable. This feature is currently off by default and can be enabled by setting tft.Context.use_deep_copy_optimization=True.
Add bucketize_per_key which computes separate quantiles for each key and then bucketizes each value according to the quantiles computed for its key.
tft.scale_to_z_score is now implemented with a single pass over the data.
Export schema_utils package to convert from the tensorflow-metadata package to the (soon to be deprecated) tf_metadata subpackage of tensorflow-transform.

Bug Fixes and Other Changes

Memory reduction during vocabulary generation.
Clarify documentation on return values from tft.compute_and_apply_vocabulary and tft.string_to_int.
tft.unit now explicitly creates Beam PCollections and validates the transformed dataset by writing and then reading it from disk.
tft.min, tft.size, tft.sum, tft.scale_to_z_score and tft.bucketize now support tf.SparseTensor.
Fix to tft.scale_to_z_score so it no longer attempts to divide by 0 when the variance is 0.
Fix bug where internal graph analysis didn't handle the case where an operation has control inputs that are operations (as opposed to tensors).
tft.sparse_tensor_to_dense_with_shape added which allows densifying a SparseTensor while specifying the resulting Tensor's shape.
Add load_transform_graph method to TFTransformOutput to load the transform graph without applying it. This has the effect of adding variables to the checkpoint when calling it from the training input_fn when using tf.Estimator.
'tft.vocabulary' and 'tft.compute_and_apply_vocabulary' now accept an optional weights argument. When weights is provided, weighted frequencies are used instead of frequencies based on counts.
'tft.quantiles' and 'tft.bucketize' now accept an optoinal weights argument. When weights is provided, weighted count is used for quantiles instead of the counts themselves.
Updated examples to construct the schema using dataset_schema.from_feature_spec.
Updated the census example to allow the 'education-num' feature to be missing and fill in a default value when it is.
Depends on tensorflow-metadata>=0.9,<1.
Depends on apache-beam[gcp]>=2.6,<3.

Breaking changes

We now validate a Schema in its constructor to make sure that it can be converted to a feature spec. In particular only tf.int64, tf.string and tf.float32 types are allowed.
We now disallow default values for FixedColumnRepresentation.
It is no longer possible to set a default value in the Schema, and validation of shape parameters will occur earlier.
Removed Schema.as_batched_placeholders() method.
Removed all components of DatasetMetadata except the schema, and removed all related classes and code.
Removed the merge method for DatasetMetadata and related classes.
read_metadata can now only read from a single metadata directory and read_metadata and write_metadata no longer accept the versions parameter. They now only read/write the JSON format.
Requires pre-installed TensorFlow >=1.9,<2.

Deprecations

apply_function is no longer needed and is deprecated. apply_function(fn, *args) is now equivalent to fn(*args). tf.Transform is able to handle while loops and tables without the user wrapping the function call in apply_function.

Release 0.8.0

Major Features and Improvements

Add TFTransformOutput utility class that wraps the output of tf.Transform for use in training. This makes it easier to consume the output written by tf.Transform (see update examples for usage).
Increase efficiency of quantiles (and therefore bucketize).

Bug Fixes and Other Changes

Change tft.sum/tft.mean/tft.var to only support basic numeric types.
Widen the output type of tft.sum for some input types to avoid overflow and/or to preserve precision.
For int32 and int64 input types, change the output type of tft.mean/ tft.var/tft.scale_to_z_score from float64 to float32 .
Change the output type of tft.size to be always int64.
Context now accepts passthrough_keys which can be used when additional information should be attached to dataset instances in the pipeline which should not be part of the transformation graph, for example: instance keys.
In addition to using TFTransformOutput, the examples demonstrate new workflows where a vocabulary is computed, but not applied, in the preprocessing_fn.
Added dependency on the absl-py package.
TransformTestCase test cases can now be parameterized.
Add support for partitioned variables when loading a model.
Export the coders subpackage so that users can access it as tft.coders, e.g. tft.coders.ExampleProtoCoder.
Setting dtypes for numpy arrays in tft.coders.ExampleProtoCoder and tft.coders.CsvCoder.
tft.mean, tft.max and tft.var now support tf.SparseTensor.
Update examples to use "core" TensorFlow estimator API (tf.estimator).
Depends on protobuf>=3.6.0<4.

Breaking changes

apply_saved_transform is removed. See note on partially_apply_saved_transform in the Deprecations section.
No longer set vocabulary_file in IntDomain when using tft.compute_and_apply_vocabulary or tft.apply_vocabulary.
Requires pre-installed TensorFlow >=1.8,<2.

Deprecations

The expected_asset_file_contents of TransformTestCase.assertAnalyzeAndTransformResults has been deprecated, use expected_vocab_file_contents instead.
transform_fn_io.TRANSFORMED_METADATA_DIR and transform_fn_io.TRANSFORM_FN_DIR should not be used, they are now aliases for TFTransformOutput.TRANSFORMED_METADATA_DIR and TFTransformOutput.TRANSFORM_FN_DIR respectively.
partially_apply_saved_transform is deprecated, users should use the transform_raw_features method of TFTransformOuptut instead. These differ in that partially_apply_saved_transform can also be used to return both the input placeholders and the outputs. But users do not need this functionality because they will typically create the input placeholders themselves based on the feature spec.
Renamed tft.uniques to tft.vocabulary, tft.string_to_int to tft.compute_and_apply_vocabulary and tft.apply_vocab to tft.apply_vocabulary. The existing methods will remain for a few more minor releases but are now deprecated and should get migrated away from.

Release 0.6.0

Major Features and Improvements

Bug Fixes and Other Changes

Depends on apache-beam[gcp]>=2.4,<3.
Trim min/max value in tft.bucketize where the computed number of bucket boundaries is more than requested. Updated documentation to clearly indicate that the number of buckets is computed using approximate algorithms, and that computed number can be more or less than requested.
Change the namespace used for Beam metrics from tensorflow_transform to tfx.Transform.
Update Beam metrics to also log vocabulary sizes.
CsvCoder updated to support unicode.
Update examples to not use the coder argument for IO, and instead use a separate beam.Map to encode/decode data.

Breaking changes

Requires pre-installed TensorFlow >=1.6,<2.

Deprecations

Release 0.5.0

Major Features and Improvements

Batching of input instances is now done automatically and dynamically.
Added analyzers to compute covariance matrices (tft.covariance) and principal components for PCA (tft.pca).
CombinerSpec and combine_analyzer now accept multiple inputs/outputs.

Bug Fixes and Other Changes

Depends on apache-beam[gcp]>=2.3,<3.
Fixes a bug where TransformDataset would not return correct output if the output DatasetMetadata contained deferred values (such as vocabularies).
Added checks that the prepreprocessing function's outputs all have the same size in the batch dimension.
Added tft.apply_buckets which takes an input tensor and a list of bucket boundaries, and returns bucketized data.
tft.bucketize and tft.apply_buckets now set metadata for the output tensor, which means the resulting tf.Metadata for the output of these functions will contain min and max values based on the number of buckets, and also be set to categorical.
Testing helper function assertAnalyzeAndTransformResults can now also test the content of vocabulary files and other assets.
Reduces the number of beam stages needed for certain analyzers, which can be a performance bottleneck when transforming many features.
Performance improvements in tft.uniques.
Fix a bug in tft.bucketize where the bucket boundary could be same as a min/max value, and was getting dropped.
Allows scaling individual components of a tensor independently with tft.scale_by_min_max, tft.scale_to_0_1, and tft.scale_to_z_score.
Fix a bug where apply_saved_transform could only be applied in the global name scope.
Add warning when frequency_threshold that are <= 1. This is a no-op and generally reflects mistaking frequency_threshold for a relative frequency where in fact it is an absolute frequency.

Breaking changes

The interfaces of CombinerSpec and combine_analyzer have changed to allow for multiple inputs/outputs.
Requires pre-installed TensorFlow >=1.5,<2.

Deprecations

Release 0.4.0

Major Features and Improvements

Added a combine_analyzer() that supports user provided combiner, conforming to beam.CombinFn(). This allows users to implement custom combiners (e.g. median), to complement analyzers (like min, max) that are prepackaged in TFT.
Quantiles Analyzer (tft.quantiles), with a corresponding tft.bucketize mapper.

Bug Fixes and Other Changes

Depends on apache-beam[gcp]>=2.2,<3.
Fixes some KeyError issues that appeared in certain circumstances when one would call AnalyzeAndTransformDataset (due to a now-fixed Apache Beam [bug] (https://issues.apache.org/jira/projects/BEAM/issues/BEAM-2966)).
Allow all functions that accept and return tensors, to accept an optional name scope, in line with TensorFlow coding conventions.
Update examples to construct input functions by hand instead of using helper functions.
Change scale_by_min_max/scale_to_0_1 to return the average(min, max) of the range in case all values are identical.
Added export of serving model to examples.
Use "core" version of feature columns (tf.feature_column instead of tf.contrib) in examples.
A few bug fixes and improvements for coders regarding Python 3.

Breaking changes

Requires pre-installed TensorFlow >= 1.4.
No longer distributing a WHL file in PyPI. Only doing a source distribution which should however be compatible with all platforms (ie you are still able to pip install tensorflow-transform and use requirements.txt or setup.py files for environment setup).
Some functions now introduce a new name scope when they did not before so the names of tensors may change. This will only affect you if you directly lookup tensors by name in the graph produced by tf.Transform.
Various Analyzer Specs (_NumericCombineSpec, _UniquesSpec, _QuantilesSpec) are now private. Analyzers are accessible only via the top-level TFT functions (min, max, sum, size, mean, var, uniques, quantiles).

Deprecations

The serving_input_fns on tensorflow_transform/saved/input_fn_maker.py will be removed on a future version and should not be used on new code, see the examples directory for details on how to migrate your code to define their own serving functions.

Release 0.3.1

Major Features and Improvements

We now provide helper methods for creating serving_input_receiver_fn for use with tf.estimator. These mirror the existing functions targeting the legacy tf.contrib.learn.estimators-- i.e. for each *_serving_input_fn() in input_fn_maker there is now also a *_serving_input_receiver_fn().

Bug Fixes and Other Changes

Introduced tft.apply_vocab this allows users to separately apply a single vocabulary (as generated by tft.uniques) to several different columns.
Provide a source distribution tar tensorflow-transform-X.Y.Z.tar.gz.

Breaking Changes

The default prefix for tft.string_to_int vocab_filename changed from vocab_string_to_int to vocab_string_to_int_uniques. To make your pipelines resilient to implementation details please set vocab_filename if you are using the generated vocab_filename on a downstream component.

Release 0.3.0

Major Features and Improvements

Added hash_strings mapper.
Write vocabularies as asset files instead of constants in the SavedModel.

Bug Fixes and Other Changes

'tft.tfidf' now adds 1 to idf values so that terms in every document in the corpus have a non-zero tfidf value.
Performance and memory usage improvement when running with Beam runners that use multi-threaded workers.
Performance optimizations in ExampleProtoCoder.
Depends on apache-beam[gcp]>=2.1.1,<3.
Depends on protobuf>=3.3<4.
Depends on six>=1.9,<1.11.

Breaking Changes

Requires pre-installed TensorFlow >= 1.3.
Removed tft.map use tft.apply_function instead (as needed).
Removed tft.tfidf_weights use tft.tfidf instead.
beam_metadata_io.WriteMetadata now requires a second pipeline argument (see examples).
A Beam bug will now affect users who call AnalyzeAndTransformDataset in certain circumstances. Roughly speaking, if you call beam.Pipeline() at some point (as all our examples do) you will not experience this bug. The bug is characterized by an error similar to KeyError: (u'AnalyzeAndTransformDataset/AnalyzeDataset/ComputeTensorValues/Extract[Maximum:0]', None) This bug will be fixed in Beam 2.2.

Release 0.1.10

Major Features and Improvements

Add json-example serving input functions to TF.Transform.
Add variance analyzer to tf.transform.

Bug Fixes and Other Changes

Remove duplication in output of tft.tfidf.
Ensure ngrams output dense_shape is greater than or equal to 0.
Alters the behavior and interface of tensorflow_transform.mappers.ngrams.
Depends on apache-beam[gcp]=>2,<3.
Making TF Parallelism runner-dependent.
Fixes issue with csv serving input function.
Various performance and stability improvements.

Deprecations

tft.map will be removed on version 0.2.0, see the examples directory for instructions on how to use tft.apply_function instead (as needed).
tft.tfidf_weights will be removed on version 0.2.0, use tft.tfidf instead.

Release 0.1.9

Major Features and Improvements

Refactor internals to remove Column and Statistic classes

Bug Fixes and Other Changes

Remove collections from graph to avoid warnings
Return float32 from tfidf_weights
Update tensorflow_transform to use tf.saved_model APIs.
Add default values on example proto coder.
Various performance and stability improvements.

Files

RELEASE.md

Latest commit

History

RELEASE.md

File metadata and controls

Release 0.9.0

Major Features and Improvements

Bug Fixes and Other Changes

Breaking changes

Deprecations

Release 0.8.0

Major Features and Improvements

Bug Fixes and Other Changes

Breaking changes

Deprecations

Release 0.6.0

Major Features and Improvements

Bug Fixes and Other Changes

Breaking changes

Deprecations

Release 0.5.0

Major Features and Improvements

Bug Fixes and Other Changes

Breaking changes

Deprecations

Release 0.4.0

Major Features and Improvements

Bug Fixes and Other Changes

Breaking changes

Deprecations

Release 0.3.1

Major Features and Improvements

Bug Fixes and Other Changes

Breaking Changes

Release 0.3.0

Major Features and Improvements

Bug Fixes and Other Changes

Breaking Changes

Release 0.1.10

Major Features and Improvements

Bug Fixes and Other Changes

Deprecations

Release 0.1.9

Major Features and Improvements

Bug Fixes and Other Changes