Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add recordio doc #248

Merged
merged 6 commits into from
Jul 26, 2017
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions doc/tutorial_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,12 @@ cd ..
paddlecloud submit -jobname fit-a-line -cpu 1 -gpu 1 -parallelism 1 -entry "python train.py" fit_a_line/
```

可以看到在提交任务的时候,我们指定了任务的名称`-jobname fit-a-line`、使用的CPU资源`-cpu 1`、
使用的GPU资源`-gpu 1`、并行度`-parallelism 1`(训练节点个数),启动命令`-entry "python train.py"`
和任务程序目录`fit_a_line/`。
可以看到在提交任务的时候,我们指定了以下参数:
- `-jobname fit-a-line`, 任务名称
- `-cpu 1`, 使用的CPU资源
- `-parallelism 1`, 并行度(训练节点个数)
- `-entry "python train.py"`, 启动命令
- `fit_a_line` 任务程序目录

***说明1:*** 如果希望查看完整的任务提交参数说明,可以执行`paddlecloud submit -h`。

Expand Down Expand Up @@ -148,3 +151,6 @@ paddlecloud file get /pfs/dlnel/home/[email protected]/jobs/fit_a_line/output/pas
```back
paddlecloud kill fit-a-line
```

---
详细使用文档见:[中文使用文档](./usage_cn.md)
41 changes: 41 additions & 0 deletions doc/usage_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,47 @@ scp -r my_training_data_dir/ user@tunnel-server:/mnt/hdfs_mulan/idl/idl-dl/mydir

在训练任务提交后,每个训练节点会把HDFS挂载在`/pfs/[datacenter_name]/home/[username]/`目录下这样训练程序即可使用这个路径读取训练数据并开始训练。

### 使用[RecordIO](https://github.com/PaddlePaddle/recordio)对训练数据进行预处理
用户可以在本地将数据预先处理为RecordIO的格式,再上传至集群进行训练。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户可以=>用户需要

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- 使用RecordIO库进行数据预处理
```python
import paddle.v2.dataset as dataset
dataset.convert(output_path = "./dataset",
reader = dataset.uci_housing.train(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户可能不理解reader的意思,或者需要再去看dataset.uci_housing.train(),是否直接把例子贴到这里?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with a link to reader.

num_shards = 10,
name_prefix = "uci_housing_train")
```
- `output_path` 输出路径
- `reader` 用户自定义的[reader](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader),实现方法可以参考[paddle.v2.dataset.uci_housing.train()](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/uci_housing.py#L74)
- `num_shards` 生成的文件数量
- `num_prefix` 生成的文件名前缀

执行成功后会在本地生成如下文件:
```bash
.
./dataset
./dataset/uci_houseing_train-00000-of-00009
./dataset/uci_houseing_train-00001-of-00009
./dataset/uci_houseing_train-00002-of-00009
./dataset/uci_houseing_train-00003-of-00009
...
```

- 编写reader来读取RecordIO格式的文件
```python
import cPickle as pickle
def cluster_creator(filename):
import recordio
def reader():
r = recordio.reader("./dataset/uci_housing_train*")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有master的reader是需要在reader端dispatch文件的。这个例子还是按照dispatch的方法吧,master功能稳定之后再更新。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

while True:
d = r.read()
if not d:
break
yield pickle.loads(d)
return reader
```

### 使用paddlecloud上传训练数据

paddlecloud命令集成了上传数据的功能,目前仅针对存储系统是CephFS的环境。如果希望上传,执行:
Expand Down