Add a xmap decorator into reader module for optimizing performance #2242

wanghaoshuang · 2017-05-23T16:29:08Z

Add flowers dataset reader for image classification model.
Add a xmap decorator into reader module for optimizing performance of image data reader.
Fix #2241

wangkuiyi · 2017-05-23T16:57:12Z

python/paddle/v2/dataset/flowers.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+CIFAR dataset.


I think we already have a cifar.py.

Sorry. I forgot to delete this line. Actually,it is flowers dataset which has more class dimensions.

wangkuiyi · 2017-05-23T17:10:12Z

python/paddle/v2/dataset/flowers.py

+SETID_MD5 = 'a5357ecc9cb78c4bef273ce3793fc85c'
+
+
+def extract_file(tarFile):


Let's try if we can read the data without untarring the tarball file. This is important because we will run these demos on Paddle Cloud, and distributed filesystems like CephFS do not favor many small files, but like few big files. This determines the efficiency of disk I/O.

An good example that doesn't extract all files is at here: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py#L56

Another good one is this: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/cifar.py#L60

Please notice that

tarfile.extractall extracts all files in a tarball into the current working directory, whereas

[tarfile.extractfile)[https://docs.python.org/2/library/tarfile.html#tarfile.TarFile.extractfile) doesn't extract files, but opens a TarFile object representing the file.

Get it. Thanks for the important suggestion. I will optimize my code.

wangkuiyi · 2017-05-23T17:13:45Z

python/paddle/v2/dataset/flowers.py

+    '''
+    map image bytes data to type needed by model input layer
+    '''
+    img, label = sample


This module seems reading many images from the tarball. If so, it might be great if we can call tarfile.next(), which returns a TarFile objects like tarfile.extractfile. But tarfile.next() reads files in the tarball one-by-one. This reduces the amount of disk seeks which reduces the number of moves of the magnetic head of our disk.

wangkuiyi · 2017-05-23T17:15:28Z

python/paddle/v2/reader/decorator.py

+
+def xmap(mapper, reader, process_num, buffer_size):
+    """
+    Use multiprocess to map samples from reader by a mapper defined by user.


I vaguely remember that @helinwang had a function which uses multiprocess to accelerate loading. Could @helinwang please confirm?

Yes, it's here: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/decorator.py#L162

the buffered decorator will use a background thread to fetch the data. If you want map with multi-thread to speed up read, you can put a map decorator on top of the buffered decorator.

hi,@helinwang:

The buffered decorator is not a thread-safe data provider.
So i can't put a multi-thread map decorator on top of the buffered decorator directly.

To guarantee that handle workers read data from reader safely, multi-threads map decorator hold a queue whihin it, with which there is no need to use bufferd decorator.

helinwang · 2017-05-23T17:19:44Z

python/paddle/v2/reader/decorator.py

+
+def xmap(mapper, reader, process_num, buffer_size):
+    """
+    Use multiprocess to map samples from reader by a mapper defined by user.


I think it's better to change multiprocess to multithread.

Yes,you're right!
I used multiprocess to take advantage of multiple CPUs, but synchronization between processes cost so much time.
Experiments indicate that multithread is better than multiprocess in my application.

qingqing01 · 2017-05-24T02:39:46Z

python/paddle/v2/dataset/flowers.py

+    return paddle.reader.xmap(mapper, reader, cpu_count(), 1024 * 8)
+
+
+def create_batch(data_dir,


This is a common function used to make batched data for images. I think it can be moved to v2/image.py.

ok，i will rewrite this function to read imags from tar file directly.

qingqing01 · 2017-05-24T02:51:10Z

python/paddle/v2/dataset/flowers.py

+        data = []
+        labellist = []
+        for index in indexes[start:end]:
+            img_name = "%s/jpg/image_%05d.jpg" % (data_dir, index)


If move this function to v2/image.py, the img_name should be modified for more general use.

Get it. Thx.

qingqing01 · 2017-05-24T02:55:15Z

python/paddle/v2/image.py

+
+    .. code-block:: python
+        with open('cat.jpg') as f:
+            im = load_image(f.read())


The example usage is not correct.

Sorry.It's my fault.

qingqing01 · 2017-06-05T02:24:06Z

python/paddle/v2/image.py

+except ImportError:
+    cv2 = None
+
+from cv2 import resize


去掉这行吧，下面显示的用cv2.resize吧，这样没安装cv2，import paddle.v2 as paddle时，也不会报错吧。

get it. thx.

qingqing01 · 2017-06-05T02:55:52Z

python/paddle/v2/reader/decorator.py

+    pass
+
+
+def xmap(mapper, reader, process_num, buffer_size):


xmap -> xmap_readers吧，名字更形象一些~

ok, i have renamed it.

images reader: read the data without untarring the tarball file. image.py: move batch function from reader to image.py

…age.py

qingqing01 · 2017-06-06T02:26:11Z

LGTM.

wanghaoshuang requested a review from qingqing01 May 23, 2017 16:29

wangkuiyi reviewed May 23, 2017

View reviewed changes

helinwang reviewed May 23, 2017

View reviewed changes

qingqing01 reviewed May 24, 2017

View reviewed changes

wanghaoshuang force-pushed the flowers_reader branch 2 times, most recently from 369ee7b to 2800239 Compare June 2, 2017 02:52

qingqing01 reviewed Jun 5, 2017

View reviewed changes

[email protected] and others added 3 commits June 5, 2017 16:34

Add flowers dataset for image classification model

2799b0e

xmap: change multiprocess to multithread.

e62a4d7

images reader: read the data without untarring the tarball file. image.py: move batch function from reader to image.py

rename xmap to xmap_readers and remove 'from cv2 import resize' in im…

990b7d7

…age.py

wanghaoshuang force-pushed the flowers_reader branch from 1a8fffa to 990b7d7 Compare June 5, 2017 08:44

qingqing01 approved these changes Jun 6, 2017

View reviewed changes

wanghaoshuang merged commit 3d7a613 into PaddlePaddle:develop Jun 6, 2017

wanghaoshuang deleted the flowers_reader branch June 6, 2017 03:46

wanghaoshuang changed the title ~~Add flowers dataset for image classification model~~ Add a xmap decorator into reader module for optimizing performance Aug 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a xmap decorator into reader module for optimizing performance #2242

Add a xmap decorator into reader module for optimizing performance #2242

wanghaoshuang commented May 23, 2017

wangkuiyi May 23, 2017

wanghaoshuang May 24, 2017

wangkuiyi May 23, 2017

wanghaoshuang May 24, 2017

wangkuiyi May 23, 2017

wangkuiyi May 23, 2017

helinwang May 23, 2017 •

edited

Loading

wanghaoshuang May 24, 2017 •

edited

Loading

helinwang May 23, 2017 •

edited

Loading

wanghaoshuang May 24, 2017

qingqing01 May 24, 2017

wanghaoshuang May 25, 2017

qingqing01 May 24, 2017

wanghaoshuang May 25, 2017

qingqing01 May 24, 2017

wanghaoshuang May 25, 2017

qingqing01 Jun 5, 2017

wanghaoshuang Jun 5, 2017

qingqing01 Jun 5, 2017

wanghaoshuang Jun 5, 2017

qingqing01 commented Jun 6, 2017

		SETID_MD5 = 'a5357ecc9cb78c4bef273ce3793fc85c'


		def extract_file(tarFile):

		return paddle.reader.xmap(mapper, reader, cpu_count(), 1024 * 8)


		def create_batch(data_dir,

Add a xmap decorator into reader module for optimizing performance #2242

Add a xmap decorator into reader module for optimizing performance #2242

Conversation

wanghaoshuang commented May 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang May 23, 2017 • edited Loading

Choose a reason for hiding this comment

wanghaoshuang May 24, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang May 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingqing01 commented Jun 6, 2017

helinwang May 23, 2017 •

edited

Loading

wanghaoshuang May 24, 2017 •

edited

Loading

helinwang May 23, 2017 •

edited

Loading