Refine fluid_benchmark.py by chengduoZH · Pull Request #11118 · PaddlePaddle/Paddle

chengduoZH · 2018-06-01T07:31:59Z

No description provided.

JiayiFeng · 2018-06-01T08:39:52Z

benchmark/fluid/models/resnet.py

+            "./flowers_1.recordio", "./flowers_1.recordio",
+            "./flowers_1.recordio", "./flowers_1.recordio",
+            "./flowers_1.recordio", "./flowers_1.recordio"
+        ]


file_list = ["./flowers_1.recordio"] * 8

luotao1 · 2018-06-01T08:54:15Z

benchmark/fluid/fluid_benchmark.py

-            train_losses.append(loss)
-            print("Pass: %d, Iter: %d, Loss: %f\n" %
-                  (pass_id, iters, np.mean(train_losses)))
+        if args.use_recordio:


请问use_recordio对CPU训练有效果么？

recordio现在主要是为了加速GPU上数据的读取，对于CPU可能作用不大，并且现在parallel executor还不支持CPU。

guochaorong · 2018-06-03T09:19:51Z

benchmark/fluid/fluid_benchmark.py

-        for batch_id, data in enumerate(train_reader()):
+    train_losses = []
+    if args.use_recordio:
+        pass_id = 0


成舵老师， use_recordio 场景只是想训练1个pass 么？

用recordio乐意训练很多个pass的，只是需要训练多少个pass是在这里指定的，这里的pass_id其实是没有什么用的。

guochaorong · 2018-06-03T09:21:03Z

benchmark/fluid/fluid_benchmark.py

+    examples_per_sec = num_samples / train_elapsed
+    print('Total examples: %d, total time: %.5f, %.5f examples/sec' %
+          (num_samples, train_elapsed, examples_per_sec))
+    print("Pass: %d, Loss: %f" % (pass_id, np.mean(train_losses)))


上面几行似乎缩进不对，

@chengduoZH does the code still runs?

@panyx0718 @guochaorong This code can work, but line 265 is unnecessary.

guochaorong · 2018-06-03T09:22:18Z

benchmark/fluid/fluid_benchmark.py

+          (num_samples, train_elapsed, examples_per_sec))
+    if not args.no_test and batch_acc != None:
+        test_acc = test(startup_exe, infer_prog, test_reader, feeder, batch_acc)
+        print("Pass: %d, Test Accuracy: %f" % (pass_id, test_acc))


上面6,7行缩进也不对吧

guochaorong · 2018-06-03T09:30:39Z

benchmark/fluid/models/resnet.py

+            fluid.layers.data(
+                name='label', shape=[1], dtype='int64'),
+        ],
+        place=fluid.CPUPlace())


place 是cpu还是cuda

guochaorong · 2018-06-03T09:36:22Z

benchmark/fluid/models/resnet.py

+
+
 def get_model(args):
-    model = resnet_cifar10


这一行去掉了，倒数第3行 model 可能未定义。

不会的，model是有定义的

panyx0718 · 2018-06-03T10:57:54Z

benchmark/fluid/fluid_benchmark.py

+    startup_exe = fluid.Executor(place)
+    startup_exe.run(startup_prog)
+    strategy = fluid.ExecutionStrategy()
+    strategy.num_threads = 1


why is this default to 1?

Because I saw that strategy.num_threads is set to 1 in train_parallel, and @Yancey1989 ever said that the distributed program will be hung when strategy.num_threads is greater than 1.

panyx0718 · 2018-06-03T10:58:59Z

benchmark/fluid/fluid_benchmark.py

+    startup_exe.run(startup_prog)
+    strategy = fluid.ExecutionStrategy()
+    strategy.num_threads = 1
+    strategy.allow_op_delay = False


should this be default to false if not often used?

You are right, this line is unnecessary in fact because the default value is false.

panyx0718 · 2018-06-03T10:59:28Z

benchmark/fluid/fluid_benchmark.py

-        train_losses = []
-        for batch_id, data in enumerate(train_reader()):
+    train_losses = []
+    if args.use_recordio:


why the original training loop not able to use_recordio? Seems hacking the use_recordio into this place.

Because the number pass has been set here, so the for pass_id in range(args.pass_num) is meanless for use_recordio.

panyx0718 · 2018-06-03T11:00:50Z

benchmark/fluid/fluid_benchmark.py

+    examples_per_sec = num_samples / train_elapsed
+    print('Total examples: %d, total time: %.5f, %.5f examples/sec' %
+          (num_samples, train_elapsed, examples_per_sec))
+    print("Pass: %d, Loss: %f" % (pass_id, np.mean(train_losses)))


@chengduoZH does the code still runs?

panyx0718 · 2018-06-03T11:01:58Z

benchmark/fluid/fluid_benchmark.py


 def main():
    args = parse_args()
+    cards = os.getenv("CUDA_VISIBLE_DEVICES") or ""


Use --gpus?

panyx0718 · 2018-06-03T11:02:53Z

benchmark/fluid/fluid_benchmark.py

    # only
    nccl_id_var, num_trainers, trainer_id = (
-        None, 1, int(os.getenv("PADDLE_TRAINER_ID", "-1")))
+        None, 1, int(os.getenv("PADDLE_TRAINER_ID", "0")))


Because trainer_id is the last parameter of ParallelExecutor but the last parameter's type is size_t, so it cannot be -1.

Paddle/paddle/fluid/pybind/pybind.cc

Lines 550 to 555 in d3e99ae

pe.def(py::init<const std::vector<platform::Place> &,

const std::unordered_set<std::string> &,

const std::unordered_set<std::string> &, const ProgramDesc &,

const std::string &, Scope *, std::vector<Scope *> &,

const ExecutionStrategy &, const BuildStrategy &, size_t,

size_t>())

I see. I made this a illegal default number so that PADDLE_TRAINER_ID must be set to posiive for ParallelExecutor.

panyx0718 · 2018-06-03T11:05:47Z

benchmark/fluid/models/resnet.py

-    input = fluid.layers.data(name='data', shape=dshape, dtype='float32')
-    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    if args.use_recordio:
+        recordio_name = './cifar10_1.recordio' if args.data_set == 'cifar10' else './flowers_1.recordio'


line too long?

panyx0718 · 2018-06-03T11:09:49Z

benchmark/fluid/fluid_benchmark.py

+    num_samples = 0
+    start_time = time.time()
+
+    if args.use_recordio:


pass_num doesn't work when use_recordio?

guochaorong

please fix the logical problems

typhoonzero

Thought this is the same work as #11121?

typhoonzero · 2018-06-04T01:09:39Z

benchmark/fluid/fluid_benchmark.py

        help='The model to run benchmark with.')
    parser.add_argument(
        '--batch_size', type=int, default=32, help='The minibatch size.')
+    parser.add_argument(


Just change the meaning of --batch_size is OK

Thanks, done.

typhoonzero · 2018-06-04T01:17:19Z

benchmark/fluid/fluid_benchmark.py

-            num_samples += len(data)
-            iters += 1
-            if batch_id % 1 == 0:
+            num_samples += args.batch_size  # dev_cnt * args.batch_size?


The last batch size may be different

You are right, but currently, we didn't get the current batch size when we use recordio. This is the problem.
The batch size is set here, if the last batch size of one pass is less than args.batch_size and the current pass is not last, recordio will read data for the next pass to make up the batch.

panyx0718 · 2018-06-07T01:50:37Z

benchmark/fluid/fluid_benchmark.py

+    startup_exe = fluid.Executor(place)
+    startup_exe.run(startup_prog)
+    strategy = fluid.ExecutionStrategy()
+    strategy.num_threads = 1


panyx0718 · 2018-06-07T01:52:47Z

benchmark/fluid/fluid_benchmark.py

    # only
    nccl_id_var, num_trainers, trainer_id = (
-        None, 1, int(os.getenv("PADDLE_TRAINER_ID", "-1")))
+        None, 1, int(os.getenv("PADDLE_TRAINER_ID", "0")))


I see. I made this a illegal default number so that PADDLE_TRAINER_ID must be set to posiive for ParallelExecutor.

panyx0718 · 2018-06-07T02:04:39Z

benchmark/fluid/models/resnet.py

+            generate_recordio(dshape, data_set_iterator, recordio_name)
+
+        batch_size_per_gpu = args.batch_size / args.gpus
+        file_list = [recordio_name] * 8


Would it be better to generate 8 sharded files?
In current way, it's easy to have duplicated data if shuffling buffer is not big enough?

You are right. This script currently only takes care of the performance. If the length of file_list and thread_num are equal to the device count, the program will be faster.

panyx0718 · 2018-06-08T06:02:38Z

benchmark/fluid/fluid_benchmark.py

 def main():
    args = parse_args()
+    gpus = os.getenv("CUDA_VISIBLE_DEVICES") or ""
+    args.gpus = len(gpus.split(","))


It seems use can use --gpus instead of CUDA_VISIBLE_DEVCIES

guochaorong · 2018-06-09T09:03:06Z

benchmark/fluid/fluid_benchmark.py

+                train_exe.bcast_params()
+
+            num_samples += args.batch_size
+            if iters % 1 == 0:


the line seems no necessary

guochaorong · 2018-06-09T09:04:27Z

benchmark/fluid/fluid_benchmark.py

+    num_samples, start_time = 0, time.time()
+
+    if args.use_recordio:
+        for iters in xrange(args.iterations):


seems add a pass_num control will be better

chengduoZH · 2018-06-11T02:06:58Z

This PR is only used by @kolinwei for getting the performance of the current Fluid, now the mission of this PR has completed. Meanwhile, #11121 had given a better benchmark script, so this PR can be closed.

chengduoZH force-pushed the refine_benchmark_py branch from 8820ab8 to 3b7157f Compare June 1, 2018 07:58

refine benchmark py

2e61c48

chengduoZH force-pushed the refine_benchmark_py branch from 3b7157f to 2e61c48 Compare June 1, 2018 08:01

chengduoZH requested a review from JiayiFeng June 1, 2018 08:06

JiayiFeng reviewed Jun 1, 2018

View reviewed changes

luotao1 reviewed Jun 1, 2018

View reviewed changes

chengduoZH force-pushed the refine_benchmark_py branch 2 times, most recently from edfcdd6 to 5ee39eb Compare June 1, 2018 10:52

chengduoZH requested review from dzhwinter, guochaorong, panyx0718 and typhoonzero June 1, 2018 10:56

chengduoZH force-pushed the refine_benchmark_py branch from 5ee39eb to 4ff3ed3 Compare June 1, 2018 11:15

code refine and follow comments

8bb6942

chengduoZH force-pushed the refine_benchmark_py branch from 4ff3ed3 to 8bb6942 Compare June 1, 2018 11:16

kolinwei previously approved these changes Jun 1, 2018

View reviewed changes

guochaorong reviewed Jun 3, 2018

View reviewed changes

panyx0718 requested changes Jun 3, 2018

View reviewed changes

panyx0718 reviewed Jun 3, 2018

View reviewed changes

guochaorong requested changes Jun 3, 2018

View reviewed changes

typhoonzero reviewed Jun 4, 2018

View reviewed changes

chengduoZH dismissed kolinwei’s stale review via 994d875 June 4, 2018 04:47

chengduoZH force-pushed the refine_benchmark_py branch 4 times, most recently from 6e5e0ea to 041140a Compare June 4, 2018 05:09

chengduoZH force-pushed the refine_benchmark_py branch 5 times, most recently from 2a7a1dd to f7414a5 Compare June 4, 2018 05:43

follow comments

e131716

chengduoZH force-pushed the refine_benchmark_py branch from f7414a5 to e131716 Compare June 4, 2018 05:56

panyx0718 approved these changes Jun 8, 2018

View reviewed changes

guochaorong reviewed Jun 9, 2018

View reviewed changes

guochaorong approved these changes Jun 9, 2018

View reviewed changes

chengduoZH closed this Jun 11, 2018

	pe.def(py::init<const std::vector<platform::Place> &,
	const std::unordered_set<std::string> &,
	const std::unordered_set<std::string> &, const ProgramDesc &,
	const std::string &, Scope , std::vector<Scope > &,
	const ExecutionStrategy &, const BuildStrategy &, size_t,
	size_t>())

Conversation

chengduoZH commented Jun 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengduoZH Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengduoZH Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guochaorong left a comment

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengduoZH commented Jun 11, 2018

Labels

7 participants

chengduoZH Jun 4, 2018 •

edited

Loading

chengduoZH Jun 4, 2018 •

edited

Loading