Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Parallel mode for monte-carlo simulations #619

Draft
wants to merge 61 commits into
base: develop
Choose a base branch
from

Conversation

brunosorban
Copy link
Collaborator

@brunosorban brunosorban commented Jun 9, 2024

This pull request implements the option to run simulations in parallel to the MonteCarlo class. The feature is using a context manager named MonteCarloManager to centralize all workers and shared objects, ensuring proper termination of the sub-processes.

A second feature is the possibility to export (close to) all simulation inputs and outputs to an .h5 file. The file can be visualized via HDF View (or similar) software. Since it's a not so conventional file, method to read and a structure to post-process multiple simulations was also added under rocketpy/stochastic/post_processing. There's a cache handling the data manipulation where a 3D numpy array is returned with all simulations, the shape corresponds to (simulation_index, time_index, column). column is reserved for vector data, where x,y and z, for example, may be available under the same data. For example, under cache.read_inputs('motors/thrust_source') time and thrust will be found.

Pull request type

  • Code changes (bugfix, features)

Checklist

  • Tests for the changes have been added (if needed)
  • Docs have been reviewed and added / updated
  • Lint (black rocketpy/ tests/) has passed locally
  • All tests (pytest tests -m slow --runslow) have passed locally
  • CHANGELOG.md has been updated (if relevant)

Current behavior

In the current moment, montecarlo simulations must run in parallel and all outputs a txt file

New behavior

The montecarlo simulations may now be executed in parallel and all outputs may be exported to a txt or an h5 file, saving some key data or everything.

Breaking change

  • Yes
  • No

Additional information

None

@brunosorban brunosorban requested a review from a team as a code owner June 9, 2024 13:27
@brunosorban brunosorban changed the title Parallel mode for monte-carlo simulations ENH: Parallel mode for monte-carlo simulations Jun 9, 2024
@brunosorban
Copy link
Collaborator Author

brunosorban commented Jun 9, 2024

Benchmark of the results. A machine with 6 cores(12 threads) was used.

workers_performance

Copy link
Collaborator

@phmbressan phmbressan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing feature, as the results show the MonteCarlo class has great potential for parallelization.

The only blocking issue I see with this PR is the serialization code. It still does not support all of rocketpy features and requires a lot of maintanance and updates on our end.

Do you see any other option for performing the serialization of inputs?

@Gui-FernandesBR
Copy link
Member

Amazing feature, as the results show the MonteCarlo class has great potential for parallelization.

The only blocking issue I see with this PR is the serialization code. It still does not support all of rocketpy features and requires a lot of maintanance and updates on our end.

Do you see any other option for performing the serialization of inputs?

@phmbressan we should make all the classes json serializable, it's an open issue at #522 . In the meantime, maybe we could still use the _encoders module to serialize inputs.

I agree with you that implementing flight class serialization within this PR may conflict create maintenance issues for us. The simplest solution would be to delete the flightv1_serializer (and similar) function.

Copy link
Member

@MateusStano MateusStano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a temporary review, I intend on finishing it later

The monte_carlo_class_usage notebook currently does not work with parallel, I did not have time to look into it, and so I did not review the parallel part of the code

Overall, it seems there are a lot of structural changes to the class that does not seem necessary for parallel and might make maintaining it difficult. These aren't big changes, so it should be easy to adjust them

The parallel structure seems good, but it is not working currently

I will get back into this as soon as other priorities are dealt with

rocketpy/simulation/monte_carlo.py Outdated Show resolved Hide resolved
Comment on lines 254 to 255
# Create data files for inputs, outputs and error logging
open_mode = "r+" if append else "w+"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use r+ or w+? This changes the way the file is written on, since the stream is positioned at the start of the file instead of the end. This changes the behavior of the code in an undesired way, I believe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream is positioned in the beginning of the file solely for the purpose of the readlines() method (if it were at the end, it always return 0). I guess this check of id_i and id_o might be optional? What is your suggestion here?

The issue is that the MonteCarlo class only knows whether it should overwrite or append to the files when the simulate method is called, since it has the append parameter.

rocketpy/simulation/monte_carlo.py Outdated Show resolved Hide resolved
rocketpy/simulation/monte_carlo.py Outdated Show resolved Hide resolved
Comment on lines 406 to 418
@staticmethod
def __sim_producer(
sto_env,
sto_rocket,
sto_flight,
sim_monitor,
export_list,
export_sample_time,
export_queue,
error_file,
mutex,
seed,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there all these functions as @staticmethod? Its just unnecessary I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pickler for creating separate processes does not work on true methods, only staticmethods or common functions. Therefore every method relating to the Process building must be static.

Copy link
Collaborator

@phmbressan phmbressan Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget what I mentioned, the issue of the pickler is not with true methods, but with some nested functions that it cannot handle. After the refactor I made in the earlier commits, those functions were removed.

Thanks to your comment, I was motivated to take another look at it and it worked. Good job on reviewing. I refactored removing the clutter of staticmethods in the latest commit. The main thing to keep in mind when dealing with those processes is that each one of them will have a different self.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems these new changes now lead to this error when running parallel:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[20], [line 1](vscode-notebook-cell:?execution_count=20&line=1)
----> [1](vscode-notebook-cell:?execution_count=20&line=1) test_dispersion.simulate(
      [2](vscode-notebook-cell:?execution_count=20&line=2)     number_of_simulations=100,
      [3](vscode-notebook-cell:?execution_count=20&line=3)     append=False,
      [4](vscode-notebook-cell:?execution_count=20&line=4)     parallel=True,
      [5](vscode-notebook-cell:?execution_count=20&line=5)     n_workers=5,
      [6](vscode-notebook-cell:?execution_count=20&line=6) )

File c:\mateus\github\rocketpy\rocketpy\simulation\monte_carlo.py:235, in MonteCarlo.simulate(self, number_of_simulations, append, parallel, n_workers)
    [233](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:233) # Run simulations
    [234](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:234) if parallel:
--> [235](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:235)     self.__run_in_parallel(n_workers)
    [236](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:236) else:
    [237](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:237)     self.__run_in_serial()

File c:\mateus\github\rocketpy\rocketpy\simulation\monte_carlo.py:365, in MonteCarlo.__run_in_parallel(self, n_workers)
    [362](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:362)     processes.append(sim_producer)
    [364](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:364) for sim_producer in processes:
--> [365](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:365)     sim_producer.start()
    [367](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:367) sim_consumer = mp.Process(
    [368](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:368)     target=self.__sim_consumer,
    [369](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:369)     args=(
   (...)
    [373](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:373)     ),
    [374](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:374) )
    [376](file:///C:/mateus/github/rocketpy/rocketpy/simulation/monte_carlo.py:376) sim_consumer.start()

File c:\Users\mateu\.conda\envs\rpy\Lib\multiprocessing\process.py:121, in BaseProcess.start(self)
    [118](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/process.py:118) assert not _current_process._config.get('daemon'), \
    [119](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/process.py:119)        'daemonic processes are not allowed to have children'
    [120](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/process.py:120) _cleanup()
--> [121](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/process.py:121) self._popen = self._Popen(self)
    [122](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/process.py:122) self._sentinel = self._popen.sentinel
    [123](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/process.py:123) # Avoid a refcycle if the target function holds an indirect
    [124](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/process.py:124) # reference to the process object (see bpo-30775)

File c:\Users\mateu\.conda\envs\rpy\Lib\multiprocessing\context.py:224, in Process._Popen(process_obj)
    [222](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/context.py:222) @staticmethod
    [223](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/context.py:223) def _Popen(process_obj):
--> [224](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/context.py:224)     return _default_context.get_context().Process._Popen(process_obj)

File c:\Users\mateu\.conda\envs\rpy\Lib\multiprocessing\context.py:337, in SpawnProcess._Popen(process_obj)
    [334](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/context.py:334) @staticmethod
    [335](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/context.py:335) def _Popen(process_obj):
    [336](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/context.py:336)     from .popen_spawn_win32 import Popen
--> [337](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/context.py:337)     return Popen(process_obj)

File c:\Users\mateu\.conda\envs\rpy\Lib\multiprocessing\popen_spawn_win32.py:94, in Popen.__init__(self, process_obj)
     [92](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/popen_spawn_win32.py:92) try:
     [93](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/popen_spawn_win32.py:93)     reduction.dump(prep_data, to_child)
---> [94](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/popen_spawn_win32.py:94)     reduction.dump(process_obj, to_child)
     [95](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/popen_spawn_win32.py:95) finally:
     [96](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/popen_spawn_win32.py:96)     set_spawning_popen(None)

File c:\Users\mateu\.conda\envs\rpy\Lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol)
     [58](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/reduction.py:58) def dump(obj, file, protocol=None):
     [59](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/reduction.py:59)     '''Replacement for pickle.dump() using ForkingPickler.'''
---> [60](file:///C:/Users/mateu/.conda/envs/rpy/Lib/multiprocessing/reduction.py:60)     ForkingPickler(file, protocol).dump(obj)

AttributeError: Can't pickle local object 'Function.__set_interpolation_func.<locals>.spline_interpolation'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just noticed that, I am on it.

Comment on lines 91 to 98
self,
filename,
environment,
rocket,
flight,
export_list=None,
batch_path=None,
export_sample_time=0.1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is batch_path for? It does not seem to do anything new

Comment on lines 568 to 570
inputs_dict = MonteCarlo.prepare_export_data(
inputs_dict, export_sample_time, remove_functions=True
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the functions removed here? They are just ignored for the inputs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I did not change from the first Parallel implementation, since I was most concerned with the Parallel architecture itself.

But now seems a good moment to discuss it:

  • If this is set to False the same discretization is applied to every lambda Function. Which is not reasonable and takes a huge time to process and generates large inputs (e.g. 100 MB).

  • if this is set to True, the Functions are ignored.

This is quite related to the Encoders feature, so I don't think a robust solution is this PR is needed, but the False option is almost unusable right now.

@phmbressan
Copy link
Collaborator

The monte_carlo_class_usage notebook currently does not work with parallel, I did not have time to look into it, and so I did not review the parallel part of the code

I know your review was just temporary, but could you be a bit more specific on the parallel side not working? It might be an OS related issue that we should fix of course, but here things were working fine.

@MateusStano
Copy link
Member

I know your review was just temporary, but could you be a bit more specific on the parallel side not working? It might be an OS related issue that we should fix of course, but here things were working fine.

Open the monte_carlo_class_usage.ipynb and run all cells.

The parameter parallel is set to True, so the simulation runs in parallel.

After the sim is done, nothing is saved to the .inputs.txt or .outputs.txt files

If you set parallel to False instead, the results are saved correctly

Comment on lines 1219 to 1246
@staticmethod
def _reprint(msg, end="\n", flush=False):
"""
Prints a message on the same line as the previous one and replaces the
previous message with the new one, deleting the extra characters from
the previous message.

Parameters
----------
msg : str
Message to be printed.
end : str, optional
String appended after the message. Default is a new line.
flush : bool, optional
If True, the output is flushed. Default is False.

Returns
-------
None
"""
padding = ""

if len(msg) < _SimMonitor._last_print_len:
padding = " " * (_SimMonitor._last_print_len - len(msg))

_SimMonitor._last_print_len = len(msg)

print(msg + padding, end=end, flush=flush)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does this method need to be a static method?

  2. Does it also need to be part of _SimMonitor? It seems to only be used outside of it

I am guessing this structure is because of the parallelism. If not I suggest we change this

Copy link
Member

@Gui-FernandesBR Gui-FernandesBR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phmbressan really good modifications to this PR. Great work.

Before merging, please run 1000 simulations so the example becomes better illustrated on the documentation, please.

rocketpy/simulation/monte_carlo.py Show resolved Hide resolved
rocketpy/simulation/monte_carlo.py Outdated Show resolved Hide resolved
rocketpy/simulation/monte_carlo.py Outdated Show resolved Hide resolved
rocketpy/simulation/monte_carlo.py Outdated Show resolved Hide resolved
rocketpy/simulation/monte_carlo.py Outdated Show resolved Hide resolved
rocketpy/simulation/monte_carlo.py Show resolved Hide resolved
@phmbressan
Copy link
Collaborator

phmbressan commented Aug 23, 2024

I have pushed a fix for the issue on file writing when running on Windows (more accurately on processes spawn mode). I have tested it on a Windows machine and it was running correctly, but I invite reviewers to test also in different OS configs.

Issues solved by this PR:

  • MonteCarlo simulations have a parallel mode;
  • Both the simulation execution and data saving are executed in parallel (producer - consumer);
  • There are performance gains on large simulations;
  • The serial simulations can be executed in the same fasion and the outputs of both ways are compatible.

Points of Improvement:

  • Soft Interrupts of parallel simulations (e.g. an exception or Ctrl-C) are only effective on Linux. Spawned processes (Windows) currently are hard stopping.
  • On Windows, the Jupyter notebook will not show the status update prints (running the simulations in a terminal is fine). This seems to be a OS level std output change that is not easily solved.

Some of these points could become issues of the repository. Stating them here for proper PR documentation.

Future Considerations:

  • Python 3.14 and forward will make the spawn the default start method for all OS. We could change RocketPy start method stay as fork on Linux if this undermines too much the performance;
  • The Python GIL should be removed some years from now (PEP703), this could bring performance benefits, since Threads are generally faster to start.

@Gui-FernandesBR
Copy link
Member

@phmbressan I like the way this PR was refactored. Many thanks for your effort.

Please fix the pylint errors and solve all the open conversations in this PR so we can approve and merge it onto develop!

Optionally, try to rebase the PR to get the latest commits from develop.

Comment on lines +292 to +296
if n_workers is None or n_workers > os.cpu_count():
n_workers = os.cpu_count()

if n_workers < 2:
raise ValueError("Number of workers must be at least 2 for parallel mode.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should print the number of workers being used with _SimMonitor.reprint here.


sim_consumer.start()

for seed in seeds:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but I think the consumer should start after the producers

Comment on lines +332 to +337
)
processes.append(sim_producer)

for sim_producer in processes:
sim_producer.start()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra for loop here

Suggested change
)
processes.append(sim_producer)
for sim_producer in processes:
sim_producer.start()
)
processes.append(sim_producer)
sim_producer.start()

Comment on lines +385 to +391
while sim_monitor.keep_simulating():
sim_idx = sim_monitor.increment() - 1

self.environment._set_stochastic(seed)
self.rocket._set_stochastic(seed)
self.flight._set_stochastic(seed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every single iteration needs to be re-seeded?

If this was done before the while loop, wouldn't it be enough?

@@ -253,114 +491,52 @@ def __run_single_simulation(self, input_file, output_file):
]
for item in d.items()
)
inputs_dict["idx"] = sim_idx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inputs_dict["idx"] = sim_idx
inputs_dict["index"] = sim_idx

For clarity on the files

Comment on lines +496 to +499
outputs_dict = {
export_item: getattr(monte_carlo_flight, export_item)
for export_item in self.export_list
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
outputs_dict = {
export_item: getattr(monte_carlo_flight, export_item)
for export_item in self.export_list
}
outputs_dict = {
export_item: getattr(monte_carlo_flight, export_item)
for export_item in self.export_list
}
outputs_dict["index"] = sim_idx

Really useful to have index on both input and output

@Gui-FernandesBR
Copy link
Member

Converted to draft until you solve the remaining issues, specially the random number generation problem,
@phmbressan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request, including adjustments in current codes Monte Carlo Monte Carlo and related contents
Projects
Status: Next Version
Development

Successfully merging this pull request may close these issues.

4 participants