Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
[MXNET-651] MXNet Model Backwards Compatibility Checker (#11626)
Browse files Browse the repository at this point in the history
* Added MNIST-MLP-Module-API models to check model save and load_checkpoint methods

* Added LENET with Conv2D operator training file

* Added LENET with Conv2d operator inference file

* Added LanguageModelling with RNN training file

* Added LamguageModelling with RNN inference file

* Added hybridized LENET Gluon Model training file

* Added hybridized LENET gluon model inference file

* Added license headers

* Refactored the model and inference files and extracted out duplicate code in a common file

* Added runtime function for executing the MBCC files

* Added JenkinsFile for MBCC to be run as a nightly job

* Added boto3 install for s3 uploads

* Added README for MBCC

* Added license header

* Added more common functions from lm_rnn_gluon_train and inference files into common.py to clean up code

* Added scripts for training models on older versions of MXNet

* Added check for preventing inference script from crashing in case no trained models are found

* Fixed indentation issue

* Replaced Penn Tree Bank Dataset with Sherlock Holmes Dataset

* Fixed indentation issue

* Removed training in models and added smaller models. Now we are simply checking a forward pass in the model with dummy data.

* Updated README

* Fixed indentation error

* Fixed indentation error

* Removed code duplication in the training file

* Added comments for runtime_functions script for training files

* Merged S3 Buckets for storing data and models into one

* Automated the process to fetch MXNet versions from git tags

* Added defensive checks for the case where the data might not be found

* Fixed issue where we were performing inference on state model files

* Replaced print statements with logging ones

* Removed boto install statements and move them into ubuntu_python docker

* Separated training and uploading of models into separate files so that training runs in Docker and upload runs outside Docker

* Fixed pylint warnings

* Updated comments and README

* Removed the venv for training process

* Fixed indentation in the MBCC Jenkins file and also separated out training and inference into two separate stages

* Fixed indendation

* Fixed erroneous single quote

* Added --user flag to check for Jenkins error

* Removed unused methods

* Added force flag in the pip command to install mxnet

* Removed the force-re-install flag

* Changed exit 1 to exit 0

* Added quotes around the shell command

* added packlibs and unpack libs for MXNet builds

* Changed PythonPath from relative to absolute

* Created dedicated bucket with correct permission

* Fix for python path in training

* Changed bucket name to CI bucket

* Added set -ex to the upload shell script

* Now raising an exception if no models are found in the S3 bucket

* Added regex to train models script

* Added check for performing inference only on models trained on same major versions

* Added set -ex flags to shell scripts

* Added multi-version regex checks in training

* Fixed typo in regex

* Now we will train models for all the minor versions for a given major version by traversing the tags

* Added check for validating current_version
  • Loading branch information
piyushghai authored and marcoabreu committed Jul 31, 2018
1 parent 7ffb252 commit a56a569
Show file tree
Hide file tree
Showing 10 changed files with 801 additions and 2 deletions.
4 changes: 2 additions & 2 deletions ci/docker/install/ubuntu_python.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,5 @@ wget -nv https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
python2 get-pip.py

pip2 install nose cpplint==1.3.0 pylint==1.8.3 'numpy<1.15.0,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1
pip3 install nose cpplint==1.3.0 pylint==1.8.3 'numpy<1.15.0,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1
pip2 install nose cpplint==1.3.0 pylint==1.8.3 'numpy<1.15.0,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
pip3 install nose cpplint==1.3.0 pylint==1.8.3 'numpy<1.15.0,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
14 changes: 14 additions & 0 deletions ci/docker/runtime_functions.sh
Original file line number Diff line number Diff line change
Expand Up @@ -899,6 +899,20 @@ nightly_test_javascript() {
make -C /work/mxnet/amalgamation libmxnet_predict.js MIN=1 EMCC=/work/deps/emscripten/emcc
}

#Tests Model backwards compatibility on MXNet
nightly_model_backwards_compat_test() {
set -ex
export PYTHONPATH=/work/mxnet/python/
./tests/nightly/model_backwards_compatibility_check/model_backward_compat_checker.sh
}

#Backfills S3 bucket with models trained on earlier versions of mxnet
nightly_model_backwards_compat_train() {
set -ex
export PYTHONPATH=./python/
./tests/nightly/model_backwards_compatibility_check/train_mxnet_legacy_models.sh
}

# Nightly 'MXNet: The Straight Dope' Single-GPU Tests
nightly_straight_dope_python2_single_gpu_tests() {
set -ex
Expand Down
120 changes: 120 additions & 0 deletions tests/nightly/model_backwards_compatibility_check/JenkinsfileForMBCC
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
// -*- mode: groovy -*-
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.


//This is a Jenkinsfile for the model backwards compatibility checker. The format and some functions have been picked up from the top-level Jenkinsfile.

err = null
mx_lib = 'lib/libmxnet.so, lib/libmxnet.a, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a'

def init_git() {
deleteDir()
retry(5) {
try {
timeout(time: 15, unit: 'MINUTES') {
checkout scm
sh 'git submodule update --init --recursive'
sh 'git clean -d -f'
}
} catch (exc) {
deleteDir()
error "Failed to fetch source codes with ${exc}"
sleep 2
}
}
}

// pack libraries for later use
def pack_lib(name, libs=mx_lib) {
sh """
echo "Packing ${libs} into ${name}"
echo ${libs} | sed -e 's/,/ /g' | xargs md5sum
"""
stash includes: libs, name: name
}

// unpack libraries saved before
def unpack_lib(name, libs=mx_lib) {
unstash name
sh """
echo "Unpacked ${libs} from ${name}"
echo ${libs} | sed -e 's/,/ /g' | xargs md5sum
"""
}

def docker_run(platform, function_name, use_nvidia, shared_mem = '500m') {
def command = "ci/build.py --docker-registry ${env.DOCKER_CACHE_REGISTRY} %USE_NVIDIA% --platform %PLATFORM% --shm-size %SHARED_MEM% /work/runtime_functions.sh %FUNCTION_NAME%"
command = command.replaceAll('%USE_NVIDIA%', use_nvidia ? '--nvidiadocker' : '')
command = command.replaceAll('%PLATFORM%', platform)
command = command.replaceAll('%FUNCTION_NAME%', function_name)
command = command.replaceAll('%SHARED_MEM%', shared_mem)

sh command
}

try {
stage('MBCC Train'){
node('restricted-mxnetlinux-cpu') {
ws('workspace/modelBackwardsCompat') {
init_git()
// Train models on older versions
docker_run('ubuntu_nightly_cpu', 'nightly_model_backwards_compat_train', false)
// upload files to S3 here outside of the docker environment
sh "./tests/nightly/model_backwards_compatibility_check/upload_models_to_s3.sh"
}
}
}

stage('MXNet Build'){
node('restricted-mxnetlinux-cpu') {
ws('workspace/build-cpu') {
init_git()
docker_run('ubuntu_cpu','build_ubuntu_cpu', false)
pack_lib('cpu', mx_lib)
}
}
}

stage('MBCC Inference'){
node('restricted-mxnetlinux-cpu') {
ws('workspace/modelBackwardsCompat') {
init_git()
unpack_lib('cpu', mx_lib)
// Perform inference on the latest version of MXNet
docker_run('ubuntu_nightly_cpu', 'nightly_model_backwards_compat_test', false)
}
}
}
} catch (caughtError) {
node("restricted-mxnetlinux-cpu") {
sh "echo caught ${caughtError}"
err = caughtError
currentBuild.result = "FAILURE"
}
} finally {
node("restricted-mxnetlinux-cpu") {
// Only send email if model backwards compat test failed
if (currentBuild.result == "FAILURE") {
emailext body: 'Nightly tests for model backwards compatibity on MXNet branch : ${BRANCH_NAME} failed. Please view the build at ${BUILD_URL}', replyTo: '${EMAIL}', subject: '[MODEL BACKWARDS COMPATIBILITY TEST FAILED] build ${BUILD_NUMBER}', to: '${EMAIL}'
}
// Remember to rethrow so the build is marked as failing
if (err) {
throw err
}
}
}
25 changes: 25 additions & 0 deletions tests/nightly/model_backwards_compatibility_check/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Model Backwards Compatibility Tests

This folder contains the scripts that are required to run the nightly job of verifying the compatibility and inference results of models (trained on earlier versions of MXNet) when loaded on the latest release candidate. The tests flag if:
- The models fail to load on the latest version of MXNet.
- The inference results are different.


## JenkinsfileForMBCC
This is configuration file for jenkins job.

## Details
- Currently the APIs that covered for model saving/loading are : do_checkpoint/load_checkpoint, save_params/load_params, save_parameters/load_parameters(added v1.2.1 onwards), export/gluon.SymbolBlock.imports.
- These APIs are covered over models with architectures such as : MLP, RNNs, LeNet, LSTMs covering the four scenarios described above.
- More operators/models will be added in the future to extend the operator coverage.
- The model train file is suffixed by `_train.py` and the trained models are hosted in AWS S3.
- The trained models for now are backfilled into S3 starting from every MXNet release version v1.1.0 via the `train_mxnet_legacy_models.sh`.
- `train_mxnet_legacy_models.sh` script checks out the previous two releases using git tag command and trains and uploads models to S3 on those MXNet versions.
- The S3 bucket's folder structure looks like this :
* 1.1.0/<model-1-files> 1.1.0/<model-2-files>
* 1.2.0/<model-1-files> 1.2.0/<model-2-files>
- The <model-1-files> is also a folder which contains the trained model symbol definitions, toy datasets it was trained on, weights and parameters of the model and other relevant files required to reload the model.
- Over a period of time, the training script would have accumulated a repository of models trained over several versions of MXNet (both major and minor releases).
- The inference part is checked via the script `model_backwards_compat_inference.sh`.
- The inference script scans the S3 bucket for MXNet version folders as described above and runs the inference code for each model folder found.

214 changes: 214 additions & 0 deletions tests/nightly/model_backwards_compatibility_check/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
#!/usr/bin/env python

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.


import boto3
import mxnet as mx
import os
import numpy as np
import logging
from mxnet import gluon
import mxnet.ndarray as F
from mxnet.gluon import nn
import re
from mxnet.test_utils import assert_almost_equal

# Set fixed random seeds.
mx.random.seed(7)
np.random.seed(7)
logging.basicConfig(level=logging.INFO)

# get the current mxnet version we are running on
mxnet_version = mx.__version__
model_bucket_name = 'mxnet-ci-prod-backwards-compatibility-models'
data_folder = 'mxnet-model-backwards-compatibility-data'
backslash = '/'
s3 = boto3.resource('s3')
ctx = mx.cpu(0)


def get_model_path(model_name):
return os.path.join(os.getcwd(), 'models', str(mxnet_version), model_name)


def get_module_api_model_definition():
input = mx.symbol.Variable('data')
input = mx.symbol.Flatten(data=input)

fc1 = mx.symbol.FullyConnected(data=input, name='fc1', num_hidden=128)
act1 = mx.sym.Activation(data=fc1, name='relu1', act_type="relu")
fc2 = mx.symbol.FullyConnected(data=act1, name='fc2', num_hidden=2)
op = mx.symbol.SoftmaxOutput(data=fc2, name='softmax')
model = mx.mod.Module(symbol=op, context=ctx, data_names=['data'], label_names=['softmax_label'])
return model


def save_inference_results(inference_results, model_name):
assert (isinstance(inference_results, mx.ndarray.ndarray.NDArray))
save_path = os.path.join(get_model_path(model_name), ''.join([model_name, '-inference']))

mx.nd.save(save_path, {'inference': inference_results})


def load_inference_results(model_name):
inf_dict = mx.nd.load(model_name+'-inference')
return inf_dict['inference']


def save_data_and_labels(test_data, test_labels, model_name):
assert (isinstance(test_data, mx.ndarray.ndarray.NDArray))
assert (isinstance(test_labels, mx.ndarray.ndarray.NDArray))

save_path = os.path.join(get_model_path(model_name), ''.join([model_name, '-data']))
mx.nd.save(save_path, {'data': test_data, 'labels': test_labels})


def clean_model_files(files, model_name):
files.append(model_name + '-inference')
files.append(model_name + '-data')

for file in files:
if os.path.isfile(file):
os.remove(file)


def download_model_files_from_s3(model_name, folder_name):
model_files = list()
bucket = s3.Bucket(model_bucket_name)
prefix = folder_name + backslash + model_name
model_files_meta = list(bucket.objects.filter(Prefix = prefix))
if len(model_files_meta) == 0:
logging.error('No trained models found under path : %s', prefix)
return model_files
for obj in model_files_meta:
file_name = obj.key.split('/')[2]
model_files.append(file_name)
# Download this file
bucket.download_file(obj.key, file_name)

return model_files


def get_top_level_folders_in_bucket(s3client, bucket_name):
# This function returns the top level folders in the S3Bucket.
# These folders help us to navigate to the trained model files stored for different MXNet versions.
bucket = s3client.Bucket(bucket_name)
result = bucket.meta.client.list_objects(Bucket=bucket.name, Delimiter=backslash)
folder_list = list()
if 'CommonPrefixes' not in result:
logging.error('No trained models found in S3 bucket : %s for this file. '
'Please train the models and run inference again' % bucket_name)
raise Exception("No trained models found in S3 bucket : %s for this file. "
"Please train the models and run inference again" % bucket_name)
return folder_list
for obj in result['CommonPrefixes']:
folder_name = obj['Prefix'].strip(backslash)
# We only compare models from the same major versions. i.e. 1.x.x compared with latest 1.y.y etc
if str(folder_name).split('.')[0] != str(mxnet_version).split('.')[0]:
continue
# The top level folders contain MXNet Version # for trained models. Skipping the data folder here
if folder_name == data_folder:
continue
folder_list.append(obj['Prefix'].strip(backslash))

if len(folder_list) == 0:
logging.error('No trained models found in S3 bucket : %s for this file. '
'Please train the models and run inference again' % bucket_name)
raise Exception("No trained models found in S3 bucket : %s for this file. "
"Please train the models and run inference again" % bucket_name)
return folder_list


def create_model_folder(model_name):
path = get_model_path(model_name)
if not os.path.exists(path):
os.makedirs(path)


class Net(gluon.Block):
def __init__(self, **kwargs):
super(Net, self).__init__(**kwargs)
with self.name_scope():
# layers created in name_scope will inherit name space
# from parent layer.
self.conv1 = nn.Conv2D(20, kernel_size=(5, 5))
self.pool1 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2))
self.conv2 = nn.Conv2D(50, kernel_size=(5, 5))
self.pool2 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2))
self.fc1 = nn.Dense(500)
self.fc2 = nn.Dense(2)

def forward(self, x):
x = self.pool1(F.tanh(self.conv1(x)))
x = self.pool2(F.tanh(self.conv2(x)))
# 0 means copy over size from corresponding dimension.
# -1 means infer size from the rest of dimensions.
x = x.reshape((0, -1))
x = F.tanh(self.fc1(x))
x = F.tanh(self.fc2(x))
return x


class HybridNet(gluon.HybridBlock):
def __init__(self, **kwargs):
super(HybridNet, self).__init__(**kwargs)
with self.name_scope():
# layers created in name_scope will inherit name space
# from parent layer.
self.conv1 = nn.Conv2D(20, kernel_size=(5, 5))
self.pool1 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2))
self.conv2 = nn.Conv2D(50, kernel_size=(5, 5))
self.pool2 = nn.MaxPool2D(pool_size=(2, 2), strides=(2, 2))
self.fc1 = nn.Dense(500)
self.fc2 = nn.Dense(2)

def hybrid_forward(self, F, x):
x = self.pool1(F.tanh(self.conv1(x)))
x = self.pool2(F.tanh(self.conv2(x)))
# 0 means copy over size from corresponding dimension.
# -1 means infer size from the rest of dimensions.
x = x.reshape((0, -1))
x = F.tanh(self.fc1(x))
x = F.tanh(self.fc2(x))
return x


class SimpleLSTMModel(gluon.Block):
def __init__(self, **kwargs):
super(SimpleLSTMModel, self).__init__(**kwargs)
with self.name_scope():
self.model = mx.gluon.nn.Sequential(prefix='')
with self.model.name_scope():
self.model.add(mx.gluon.nn.Embedding(30, 10))
self.model.add(mx.gluon.rnn.LSTM(20))
self.model.add(mx.gluon.nn.Dense(100))
self.model.add(mx.gluon.nn.Dropout(0.5))
self.model.add(mx.gluon.nn.Dense(2, flatten=True, activation='tanh'))

def forward(self, x):
return self.model(x)


def compare_versions(version1, version2):
'''
https://stackoverflow.com/questions/1714027/version-number-comparison-in-python
'''
def normalize(v):
return [int(x) for x in re.sub(r'(\.0+)*$','', v).split(".")]
return cmp(normalize(version1), normalize(version2))
Loading

0 comments on commit a56a569

Please sign in to comment.