add checkpoint util class and implement#10532
add checkpoint util class and implement#10532seiriosPlus merged 59 commits intoPaddlePaddle:developfrom
Conversation
| @@ -0,0 +1,214 @@ | |||
| /* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
I see there's load_op and load_combine_op and corresponding saving ops, on python side, you can also use fluid.io.save_persistables to save all persistable variables.
In order to make save_persistables equal to save a checkpoint, make sure that the state variables are all "persistable" like step counters, learning rates, learning_rate moments etc.
So can you reuse those ops instead of writing some new one?
There was a problem hiding this comment.
load_op and save_op are designed for LodTensor variable, But checkpoint will save variables not only LodTensor, and checkpoint has some arguments particular.
At present, checkpoint load/save op and load/save op have no clear-cut distinction.
There was a problem hiding this comment.
I think it's better to reuse current operators maybe check the variable type will be fine.
So what the other variable types are saved in the checkpoint? "RAW" types and "feed" "fetch" may not need to be saved.
| for (auto &var : sparse_vars) { | ||
| var->GetMutable<framework::SelectedRows>()->mutable_rows()->clear(); | ||
| } | ||
|
|
There was a problem hiding this comment.
This change may not be nessesary.
There was a problem hiding this comment.
I am sorry about it, I will revert it later.
|
|
||
| if checkpoint_dir and self.is_chief: | ||
| program.global_block().create_var( | ||
| name=SERIAL_VAR_NAME, |
There was a problem hiding this comment.
serial is a serial number, like 0,1,2...100, each time when paddle needs to save checkpoint, the serial will auto incrementally.
If everything goes well, the biggest serial number will be used in load checkpoint.
There was a problem hiding this comment.
Call it a checkpoint ID maybe good for understanding
|
|
||
| save_vars = [] | ||
| for var in self.origin_program.list_vars(): | ||
| if self._is_persistable(var): |
There was a problem hiding this comment.
can use fluid.io.save_persistables
|
|
||
| serial_number = self._get_lastest_checkpoint_dir(checkpoint_load_dir) | ||
|
|
||
| s_prog.global_block().append_op( |
There was a problem hiding this comment.
How do current parameter server know which parameter block to load?
| # is_chief (no.0 triner) for checkpoint | ||
| # the no.0 trainer will save all variables and its own reader offset to checkpoint | ||
| # other trianers will save its own reader offset to checkpoint | ||
| self.is_chief = trainer_id == 0 |
There was a problem hiding this comment.
I will fix it.
| except ValueError: | ||
| return -1 | ||
|
|
||
| success_path = os.path.join(checkpoint_dir, cur_dir, SUCCESS) |
There was a problem hiding this comment.
what is success_path used for?
There was a problem hiding this comment.
We need a tag to indicate that: the checkpoint content is right. So, I define a tag named _SUCCESS, when the checkpoint_save_op save all need variables successfully, it will write an empty file named _SUCCESS at last.
Because of this, when pserver/trainer need to load checkpoint, it will check _SUCCESS at first.
There was a problem hiding this comment.
If you only need to know when the saving ends, just wait for the executor returns, raise any error that may cause the saving fail is OK I think.
There was a problem hiding this comment.
If some exceptions happened in executor running, the executor does not have any information return, how do we know saving is OK.
There was a problem hiding this comment.
You can catch that exception I think, please give it a try, this will make the code more simple.
python/paddle/fluid/io.py
Outdated
| save_secs=600, | ||
| main_program=None): | ||
| """ | ||
| Save Variables to Checkpint Dir |
There was a problem hiding this comment.
Checkpint => Checkpoint
Dir => Directory
python/paddle/fluid/io.py
Outdated
|
|
||
| def save_checkpoint(executor, | ||
| dirname, | ||
| keep_max=3, |
There was a problem hiding this comment.
keep_max => max_num_checkpoints
python/paddle/fluid/io.py
Outdated
| def save_checkpoint(executor, | ||
| dirname, | ||
| keep_max=3, | ||
| save_secs=600, |
python/paddle/fluid/io.py
Outdated
|
|
||
| serial = _get_lastest_checkpoint_dir(dirname) + 1 | ||
| cur_dir = os.path.join(dirname, str(serial)) | ||
| # save_persistables(executor, cur_dir, main_program) |
There was a problem hiding this comment.
No commented out codes please.
python/paddle/fluid/io.py
Outdated
|
|
||
|
|
||
| def save_checkpoint(executor, | ||
| dirname, |
There was a problem hiding this comment.
Can use current working directory as default?
python/paddle/fluid/io.py
Outdated
| return get_parameter_value(var, executor) | ||
|
|
||
|
|
||
| SUCCESS = "_SUCCESS" |
There was a problem hiding this comment.
SUCCESS = SUCCESS_MARK_FILENAME
python/paddle/fluid/io.py
Outdated
| if not os.path.isdir(dirname): | ||
| os.makedirs(dirname) | ||
|
|
||
| global BEGIN_SECS |
There was a problem hiding this comment.
try not to use global please
| serial = _get_lastest_checkpoint_dir(dirname) + 1 | ||
| cur_dir = os.path.join(dirname, str(serial)) | ||
| # save_persistables(executor, cur_dir, main_program) | ||
| save_vars( |
There was a problem hiding this comment.
why call save_vars instead of save_psersistables?
There was a problem hiding this comment.
save_psersistables can not filter out gradient variables.
python/paddle/fluid/io.py
Outdated
| 'get_inference_program', | ||
| 'save_vars', 'save_params', 'save_persistables', 'load_vars', 'load_params', | ||
| 'load_persistables', 'save_inference_model', 'load_inference_model', | ||
| 'get_inference_program', 'save_checkpoint', 'restore_checkpoint' |
There was a problem hiding this comment.
maybe it's better to be named as load_checkpoint or restore_from_checkpoint?
There was a problem hiding this comment.
I will use load_checkpoint , restore_from_checkpoint is too long I think.
python/paddle/fluid/io.py
Outdated
| return True | ||
|
|
||
|
|
||
| def _lru_delete(dirname, keep_max=3): |
There was a problem hiding this comment.
keep_max => max_num_checkpoints
python/paddle/fluid/io.py
Outdated
|
|
||
| def _lru_delete(dirname, keep_max=3): | ||
| """ | ||
| retain checkpoint nums with keep_max |
There was a problem hiding this comment.
keep_max => max_num_checkpoints
python/paddle/fluid/io.py
Outdated
|
|
||
| def _write_success(dirname): | ||
| """ | ||
| write _SUCCESS to checkpoint dir |
There was a problem hiding this comment.
_SUCCESS is the file name ,not the var name
python/paddle/fluid/io.py
Outdated
|
|
||
| def _get_lastest_checkpoint_dir(checkpoint_dir): | ||
| """ | ||
| get the biggest number in checkpoint_dir, which has _SUCCESS |
There was a problem hiding this comment.
_SUCCESS is the file name, not the var name
python/paddle/fluid/io.py
Outdated
| Save Checkpoint will save persistable LodTensor variables from main_program in checkpoint directory, | ||
| directory named by serial number from 0 to (n -1), save_checkpoint use LRU strategy | ||
| to keep numbers of checkpoint directory, the numbers of checkpoint directory are max_num_checkpoints at most, | ||
| The interval time between two save_checkpoint must great than or equal to save_interval_secs. |
There was a problem hiding this comment.
The interval between two saved checkpoints must greater than save_interval_secs.
python/paddle/fluid/io.py
Outdated
|
|
||
| def load_checkpoint(executor, dirname=None, main_program=None): | ||
| """ | ||
| Load checkpoint from directory by executor, |
There was a problem hiding this comment.
directory => one directory
python/paddle/fluid/io.py
Outdated
| def load_checkpoint(executor, dirname=None, main_program=None): | ||
| """ | ||
| Load checkpoint from directory by executor, | ||
| it will find lastest checkpoint file and load it auto. |
There was a problem hiding this comment.
latest => the most recent saved checkpoint
python/paddle/fluid/io.py
Outdated
| os.path.join(dirname, str(serial)), save_interval_secs): | ||
| return | ||
|
|
||
| serial = serial + 1 |
python/paddle/fluid/io.py
Outdated
| return | ||
|
|
||
| serial = serial + 1 | ||
| cur_dir = os.path.join(dirname, str(serial)) |
There was a problem hiding this comment.
The checkpoint directories will be saved in "1", "2", etc. which may not make sense to users, I think it's better to be like "checkpoint_1", "checkpoint_2", and the "_SUCCESS" file can save a timestamp when the checkpoint is saved.
Add new feature about #10376