-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add recordio doc #248
add recordio doc #248
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -148,3 +148,6 @@ paddlecloud file get /pfs/dlnel/home/[email protected]/jobs/fit_a_line/output/pas | |
```back | ||
paddlecloud kill fit-a-line | ||
``` | ||
|
||
--- | ||
详细使用文档见:[中文使用文档](./usage_cn.md) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,6 +68,47 @@ scp -r my_training_data_dir/ user@tunnel-server:/mnt/hdfs_mulan/idl/idl-dl/mydir | |
|
||
在训练任务提交后,每个训练节点会把HDFS挂载在`/pfs/[datacenter_name]/home/[username]/`目录下这样训练程序即可使用这个路径读取训练数据并开始训练。 | ||
|
||
### 使用[RecordIO](https://github.com/PaddlePaddle/recordio)对训练数据进行预处理 | ||
用户可以在本地将数据预先处理为RecordIO的格式,再上传至集群进行训练。 | ||
- 使用RecordIO库进行数据预处理 | ||
```python | ||
import paddle.v2.dataset as dataset | ||
dataset.convert(output_path = "./dataset", | ||
reader = dataset.uci_housing.train(), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 用户可能不理解 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done with a link to reader. |
||
num_shards = 10, | ||
name_prefix = "uci_housing_train") | ||
``` | ||
- `output_path` 输出路径 | ||
- `reader` 用户自定义的reader | ||
- `num_shards` 生成的文件数量 | ||
- `num_prefix` 生成的文件名前缀 | ||
|
||
执行成功后会在本地生成如下文件: | ||
```bash | ||
. | ||
./dataset | ||
./dataset/uci_houseing_train-00000-of-00009 | ||
./dataset/uci_houseing_train-00001-of-00009 | ||
./dataset/uci_houseing_train-00002-of-00009 | ||
./dataset/uci_houseing_train-00003-of-00009 | ||
... | ||
``` | ||
|
||
- 编写reader来读取RecordIO格式的文件 | ||
```python | ||
import cPickle as pickle | ||
def cluster_creator(filename): | ||
import recordio | ||
def reader(): | ||
r = recordio.reader("./dataset/uci_housing_train*") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 没有master的reader是需要在reader端dispatch文件的。这个例子还是按照dispatch的方法吧,master功能稳定之后再更新。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. Thanks. |
||
while True: | ||
d = r.read() | ||
if not d: | ||
break | ||
yield pickle.loads(d) | ||
return reader | ||
``` | ||
|
||
### 使用paddlecloud上传训练数据 | ||
|
||
paddlecloud命令集成了上传数据的功能,目前仅针对存储系统是CephFS的环境。如果希望上传,执行: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用户可以=>用户需要
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.