occasional multiprocessing errors

Post Reply
derekja
Posts: 3
Joined: Wed Nov 07, 2018 3:16 am
company / institution: university of victoria

occasional multiprocessing errors

Post by derekja »

Hi,

Thanks for Polymer. Very nice package. I am running against copernicus imagery in a high-performance computing environment, so I am running polymer in many dozens of containers at once. In most cases it is working perfectly, but in about 5% or less of cases I get an error from within the multiprocessing stack. It usually surfaces as a bus error or a segmentation fault followed by a broken pipe. It seems to occur more frequently in restricted memory situations - up to 15% of the time when I am running on only a couple of cores with 32G of memory, down to about 5% of the runs on 125G of memory and 32 cores. When I run the file it crashes on again it generally completes just fine. This is using polymer version 4.9, ubuntu xenial with python 3.5. Is this something that you have seen before? info from faulthandler below.

Thanks!

/spectral/derekja/pol4.9/polymer/level2_nc.py:112: MaskedArrayFutureWarning: setting an item on a masked array which has a shared mask will not copy the mask and also change the original mask array in the future.
Check the NumPy 1.11 release notes for more information.
data[np.isnan(data)] = fill_value
Fatal Python error: Segmentation fault

Thread 0x00002b69973b9700 (most recent call first):
File "/usr/lib/python3.5/multiprocessing/connection.py", line 379 in _recv
File "/usr/lib/python3.5/multiprocessing/connection.py", line 407 in _recv_bytes
File "/usr/lib/python3.5/multiprocessing/connection.py", line 250 in recv
File "/usr/lib/python3.5/multiprocessing/pool.py", line 429 in _handle_results
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Current thread 0x00002b69971b8700 (most recent call first):
File "/spectral/derekja/pol4.9/polymer/level1_olci.py", line 212 in read_band
File "/spectral/derekja/pol4.9/polymer/level1_olci.py", line 264 in read_block
File "/spectral/derekja/pol4.9/polymer/level1_olci.py", line 350 in blocks
File "/spectral/derekja/pol4.9/polymer/main.py", line 422 in blockiterator
File "/usr/lib/python3.5/multiprocessing/pool.py", line 305 in <genexpr>
File "/usr/lib/python3.5/multiprocessing/pool.py", line 380 in _handle_tasks
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00002b6996fb7700 (most recent call first):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 367 in _handle_workers
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00002b698b343100 (most recent call first):
File "/spectral/derekja/pol4.9/polymer/level2_nc.py", line 115 in write_block
File "/spectral/derekja/pol4.9/polymer/level2.py", line 119 in write
File "/spectral/derekja/pol4.9/polymer/main.py", line 506 in run_atm_corr
File "./process2017.py", line 72 in <module>
Segmentation fault (core dumped)
Process ForkPoolWorker-50:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 398, in _send_bytes
self._send(buf)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-51:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkPoolWorker-49:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 130, in worker
put((job, i, (False, wrapped)))
File "/usr/lib/python3.5/multiprocessing/queues.py", line 355, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
User avatar
fsteinmetz
Site Admin
Posts: 306
Joined: Fri Sep 07, 2018 1:34 pm
company / institution: Hygeos
Location: Lille, France
Contact:

Re: occasional multiprocessing errors

Post by fsteinmetz »

Dear Derek,

Did you use the "multiprocessing" option ? If so, I suggest you try to leave the default (multiprocessing=0), where Polymer should not use the multiprocessing module at all, and see if the problem still appears.
I have indeed occasionnally encountered some segfaults when using the multiprocessing option *and* netcdf4 output (although these two features are not supposed to be related) - but reproducibility was unpredictable, and seemingly depended on particular library versions. I haven't encountered this crash for quite some time.

François
derekja
Posts: 3
Joined: Wed Nov 07, 2018 3:16 am
company / institution: university of victoria

Re: occasional multiprocessing errors

Post by derekja »

The error does not occur when multiprocessing is disabled, but the detrimental impact on processing time is greater than simply re-running those instances that have generated the error.

Thank you, François, I will investigate different netcdf versions and will report back.
User avatar
fsteinmetz
Site Admin
Posts: 306
Joined: Fri Sep 07, 2018 1:34 pm
company / institution: Hygeos
Location: Lille, France
Contact:

Re: occasional multiprocessing errors

Post by fsteinmetz »

Ok, processing is indeed a lot faster with multiprocessing, but out of curiosity, do you think this option is still beneficial in an HPC environment ?

For your information, I have attached a list of the python packages I am currently using, and I don't encounter this bug ; you may use it to create an anaconda environment.
You do not have the required permissions to view the files attached to this post.
derekja
Posts: 3
Joined: Wed Nov 07, 2018 3:16 am
company / institution: university of victoria

Re: occasional multiprocessing errors

Post by derekja »

Great, I can confirm that upgrading netcdf from version 4.4.0 to version 4.6.1 solved the issue. Thanks.

Normally, I wouldn't care about multiprocessing and would just request single core nodes. However, our job scheduler de-prioritizes jobs that run over 3 hours, which is often the case without multiprocessing. The geographers also sometimes submit a job with only a single image to process, which of course is preferable to use multiprocessing for.

Thanks for your help, François!
User avatar
fsteinmetz
Site Admin
Posts: 306
Joined: Fri Sep 07, 2018 1:34 pm
company / institution: Hygeos
Location: Lille, France
Contact:

Re: occasional multiprocessing errors

Post by fsteinmetz »

Good news, thanks for your feedback :)
Post Reply