Skip to content

WeeklyTelcon_20190205

Geoffrey Paulsen edited this page Mar 12, 2019 · 2 revisions

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen
  • Jeff Squyres
  • David Bernholdt
  • Geoffroy Vallee
  • Matias Cabral
  • Matthew Dosanjh
  • Ralph Castain
  • Todd Kordenbrock
  • Josh Hursey
  • George

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh
  • Brian Barrett
  • Edgar Gabriel
  • Howard Pritchard
  • Josh Hursey
  • Xin Zhao
  • Aravind Gopalakrishnan (Intel)
  • Joshua Ladd
  • Nathan Hjelm
  • Dan Topa (LANL)
  • Thomas Naughton
  • Akshay Venkatesh (nVidia)
  • Arm (UTK)
  • Peter Gottesman (Cisco)
  • mohan

Agenda/New Business

  • The HostGator web site (open-mpi.org) is coming up for renewal. We need to decide what we are going to do about it

    • Expires in Summer (Start in May) Expires July 27th.
    • Need to move domain names. (Who owns that?)
    • It'd be nice to move to AWS.
    • DNS should be owned by SPI. Still need to transfer that.
  • Next face to face is San Jose - April 23-April25 @ Cisco -San Jose.

Minutes

Review v3.0.x Milestones v3.0.3

Review v3.1.x Milestones v3.1.0

Review v4.0.x Milestones v4.0.1

  • Schedule: Need a quick turn around for a v4.0.1
  • v4.0.0
  • Merged in PMIx update.
  • Adding OSHMEM API - bugfix. Need to rev .so versions correctly
  • Some Fixes in onesided datatype in past week or two, not sure if this went in.
  • There have been other non-blocker fixes:
    • hwloc macros, libfabric, ompi-io issues fixed in master
  • https://github.com/open-mpi/ompi/issues/6278
    • Removed symbols and nice message on master and v4.0.x does not give a compile time error. What do we want?
      • Do we want compile time error? Or just removed symbol and linker error
      • Could add a Check for C11, and use 'static assert' for nice message.
      • For older compilers could just NOT declare the function.
        • but that doesn't work for v4.0.x since the symbols in the library will be there, and the comiler will only issue a warning that about no prototype, but will succeed and link correctly.
        • It was decided that this is okay, if the C11 static assert check is in mpi.h. Most users set 'no prototype' as an error.
    • Tests on v4.0.x started passing, but possibly false positives. We will look at how the ibm tests are passing with #6278 issue on master and v4.0.x
  • Should resolve https://github.com/open-mpi/ompi/issues/6198 before releasing
  • OOB TCP is ignorning virtual interfaces.
    • Will create an issue and solve over email with code, rather than solving on phone.
    • Plan is tell people to use mca parameter if no network parameter to specify loopback. In future continue working on reachability framework. 5.0 features
  • PR6306 - RegEx - they want to push into v4.0.x. Problem is that any RegEx we come up with has a problem in a special case. Worried about getting into a mode where fixing something for one, there will be a node-name convention that will break it. PRTE threw this framework out, and just use a PMIx parser. Because this PR would cause the PMIX parser to get out of sync. Want to have same answer out of both parsers. Need to Open an issue on Open MPI to ensure we don't continue breaking patterns. Some ideas:
    • Don't try to do Reg-ex, and instead do compression.
    • Use a 3rd party existing reg-ex generator (generate a reg-ex from a list of hostnames)
    • PR6350 is replacement PR - Ralph updating to fix a bug Giles found. After that Ralph will commit PR6306, and then delete it with this PR. It will be a framework but with disabled components.
    • Framework is opal_compress with bzip and gzip components.
      • zlib only - required for daemons.

v5.0.0

  • Any Schedule for this yet? Summer of 2019
  • Discussion of schedule depends on scope discussion
    • if we want to seperate Orte out for that? Might delay a bit
    • Giles has a prototype of PRTE replacing ORTE
  • Want to open up release-manager elections.
    • Is anyone interested?
    • Need to decide by end of the month

Master

  • Fixed some big bugs dumping lots of cores on Cisco MTT last week.
  • Cisco seeing a fair amount of timeouts.
  • Cisco getting consistant SEGV in INT-Overflow

PMIx

MTT

New topics

  • PMIX direct call / PRTE replacement for ORTE.
  • George
  • Howard has been changing OMPI or OPAL places that call the PMIx framework,
    • to use PMIx data structures directly in the code.
    • Doesn't look like Howard would step on Ralph's toes.
  • March 4th is next MPI Forum (then June)
  • We have a new open-mpi SLACK channel for Open MPI developers.
    • Not for users, just developers...
    • email Jeff If you're interested in being added.

face to face -

  • how do we get more participation, and make MTT more meaningful

Review Master Master Pull Requests

  • didn't discuss today.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally