Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Style guide #3

Open
milancurcic opened this issue Dec 14, 2019 · 71 comments
Open

Style guide #3

milancurcic opened this issue Dec 14, 2019 · 71 comments
Labels
meta Related to this repository

Comments

@milancurcic
Copy link
Member

Use this issue to discuss the code style for the stdlib.

The most widely supported elements of style will eventually be merged into the Style Guide for contributors.

@milancurcic milancurcic changed the title Style Guide Style guide Dec 14, 2019
This was referenced Dec 14, 2019
@certik
Copy link
Member

certik commented Dec 16, 2019

@ivan-pi wrote in #11 (comment):

For example for the function that checks whether a character is a letter some of the possible names are:

  • is_alpha
  • isalpha (C/C++)
  • isAlpha (D language, but in Fortran it's the same as the previous suggestion)

I have discussed this particular issue with several people many years ago and we eventually converged towards the following compromise:

https://www.fortran90.org/src/best-practices.html#naming-convention

So in the above case, isAlpha would not be used, but both is_alpha and isalpha would be allowed, and in this case isalpha would be preferred (only two syllables). If the name was is_char_alpha vs ischaralpha, then the underscore version would be preferred because it is three syllables. However, if the API had lots of methods like is_alpha, is_char_alpha, is_not_alpha, then for consistency it should be named is_* with underscore.

Let me know what you think.

P.S. Some examples from current Fortran. No underscore: matmul, maxloc, minloc, iachar, adjustl, adjustr, maxexponent, minexponent, rrspacing, ishftc. Underscore: set_exponent, dot_product, bit_size, cpu_time. The standard is not consistent (e.g., maxexponent vs set_exponent) but it seems to be very close to the rule of 2 syllables -> no underscore, otherwise use underscore (e.g. matmul vs dot_product).

@ivan-pi
Copy link
Member

ivan-pi commented Dec 17, 2019

isalpha contains three syllables.

In general I agree with your suggested naming conventions and also the ones in Fortran Best Practices. To stay close to the standard, I think it's best if we avoid CamelCase.

Specifically, for the ascii functions there are 11 functions starting with is, some are short (is_alpha), but some are long (is_hex_digit, is_octal_digit). I suppose it's easiest to just follow the C library and get rid of the underscores and shorten the names to 7-8 characters (isalpha, iszdigit - z is used in hexadecimal boz constants, isodigit - o is used in octal boz constants).

@certik
Copy link
Member

certik commented Dec 17, 2019 via email

@milancurcic
Copy link
Member Author

I'm on the same page here. I like all lowercase with underscores universally. For short variable and procedure names, dropping the underscore is OK I think. It's nice to have a universal rule to follow (always use underscores to separate words), but for some short names I think it's OK to make an exception.

I don't think the Standard being inconsistent (maxexponent and set_exponent) should be an excuse for us to be "consistently inconsistent". There's also a difference between names that are made of whole words (e.g. dot_product is nicer than dotproduct), vs. portmanteaus like matmul (instead of mat_mul which seems awkward).

In this specific case, even though iszdigit and isodigit are short and nice, there's a risk of them being less clear what they mean. To my eye, is_hex_digit and is_octal_digit were obvious. For iszdigit and isodigit, I had to read Ivan's explanation to understand. So I slightly prefer the underscore variants here, but I'm not hell-bent on it.

@zbeekman
Copy link
Member

Yes, I generally agree but for assertion-y stuff I think separating the verb with an underscore makes sense is_alpha. The legibility is much improved and client code can always consume it as a different name through use association.

use fortran_stdlib, only: is_alpha => isalpha

@zbeekman
Copy link
Member

Automation of pedantry

I think that we should also adopt solutions like Editorconfig, possibly findent cmake-format (if we use CMake) and clang-format (for any C code). Tools like these that integrate with IDEs and editors will help get everyone on the same page allowing them to focus on semantics and code rather than style and formatting. In addition, I've really been appreciating and enjoying pre-commit to codify conventions on a per-project basis, and help catch silly issues before developers even commit/push code.

@certik
Copy link
Member

certik commented Dec 17, 2019

Yes, I agree with @zbeekman we should try to automate as much as possible (formatting, release notes, etc.).

And rather use our time and effort to discuss the actual API, not how it is formatted.

I agree that is_alpha looks better in this case. I would caution against recommending use fortran_stdlib, only: is_alpha => isalpha, because that is a pain to have to do that --- rather, let's do our best to choose the name well, and recommend people to use it. That way all the codes will use the same name and it will be easy for people reading the code to understand.

@smwesten-usgs
Copy link

In this specific case, even though iszdigit and isodigit are short and nice, there's a risk of them being less clear what they mean. To my eye, is_hex_digit and is_octal_digit were obvious. For iszdigit and isodigit, I had to read Ivan's explanation to understand. So I slightly prefer the underscore variants here, but I'm not hell-bent on it.

Should I get a vote, I also prefer more verbose but hopefully clearer names, and prefer lowercase and underscores universally:
is_hex_digit > iszdigit
is_octal_digit > isodigit

Regarding names of string functions/subroutines, I'd definitely prefer more verbose and explicit over the short and cryptic. Thus:
as_integer > atoi
as_double > atof

@certik
Copy link
Member

certik commented Dec 17, 2019

@smwesten-usgs thanks for the feedback. Yes, I agree on the examples you gave that underscores look better.

Here are some examples where no underscores I think look better (from NumPy and Matlab):

  • linspace vs lin_space
  • meshgrid vs mesh_grid
  • argsort vs arg_sort

@longb
Copy link

longb commented Dec 17, 2019

Just speculation, but I suspect the difference between the underscore usage between maxexponent and set_exponent is that max is already the name of a different intrinsic function.

@zbeekman
Copy link
Member

I agree, and for the record, I wasn't suggesting we pick whatever and let people rename through use association; Good names from the get-go are important.

@zbeekman
Copy link
Member

Can we make some choices about indentation and tabs vs spaces? My preference is indent all indent-able things by two spaces; never use tabs. We can codify this in an editor-config file.

@certik
Copy link
Member

certik commented Dec 22, 2019

My preference is 4 spaces, no tabs.

@zbeekman
Copy link
Member

A quick note: I'm happy with 2 or 4 but with the 132 character line limit, you can start to run out of real estate pretty quickly with 4 spaces.

@certik
Copy link
Member

certik commented Dec 22, 2019 via email

@zbeekman
Copy link
Member

80 characters per line and 4 spaces?! That seems like there will be an awful lot of line continuations and leading blanks... findent defaults to 2 spaces. The standard caps line length at 132 chars. I think we should shorten the line length allowed XOR pick 4 spaces of indentation, but I'd be weary to use both at once.

@certik
Copy link
Member

certik commented Dec 22, 2019

Yes, I've been using 80 characters and 4 spaces in all my projects.

But whatever we decide is fine with me. Here are my priorities for stdlib:

  • The Fortran sources themselves: simple, well organized modules, super well designed API, building with "any" compiler on "any" platform, with no warnings, and excellent performance in Release mode.

  • A build system that works on all platforms (including Windows) and is "simple"

  • Good tests

  • Good CI infrastructure that supports the above

How we format the files are low on my priorities list, so whatever makes us work together works for me.

@ivan-pi
Copy link
Member

ivan-pi commented Dec 22, 2019

I generally use 4 spaces (same as Python and most C++ codes I've seen), but I also don't mind if we agree to use 2 and I agree with the points of @certik. Generally, I try to stick to an 80 character limit, but I am happy to break the rule if it makes the indentation nicer.

@zbeekman
Copy link
Member

OK, let's go with 4 then, and we could admonish users to be within 80 characters (or 100, or whatever) but enforce a hard limit of 132 during CI. I'm working on getting a style document updated with comments from here and an .editorconfig file as well as CI testing and enforcement.

@longb
Copy link

longb commented Dec 22, 2019

I generally use whatever the number of spaces created by emacs when I hit "tab". Note that the line length limit will probably get really long (10,000 characters) in the next standard, so debates over 80 versus 132 are a bit dated. Personally I prefer to be able to read a line of code in a terminal window without horizontal scrolling. But I think that most "style guides" include a lot of unnecessary "nanny" rules that don't matter that much. I think a rule like "a sequence of lines that form a block in Fortran should be indented compared to the code lines that delineate the block" is sufficient. Much more relevant (in my view) would be code style rules such as "do not use include lines", or "never use preprocessing directives", or "never use tools like configure, cmake, or spack" would be more useful and relevant to Fortran.

@zbeekman
Copy link
Member

@longb haha "never use tools like ... cmake"

Sometimes there are necessary evils.

I agree about reading things in the terminal. Thats the spirit of creating a limit.

Concerning emacs, it's easy to setup to respect .editorconfig.

Here is what I have in my .emacs (I use use-package)

(use-package editorconfig
  :ensure t
  :after ws-butler
  :init
  (setq editorconfig-trim-whitespaces-mode
         'ws-butler-mode)
  :config
  (editorconfig-mode 1)
  ;; Always use tabs for Makefiles
  (add-hook 'editorconfig-hack-properties-functions
	    '(lambda (props)
	       (when (derived-mode-p 'makefile-mode)
		 (puthash 'indent_style "tab" props)))))

zbeekman added a commit to zbeekman/stdlib that referenced this issue Dec 22, 2019
@zbeekman
Copy link
Member

I've started trying to capture this discussion in PR #42

@milancurcic
Copy link
Member Author

I prefer 2 spaces to indent, and try (but sometimes fail) to stay within 80-character line length. This is not for compatibility reasons, but for the same reason as Python recommendation -- I work in terminals of similar width and don't like the lines to wrap around.

I think consistency is more important than any specific style "rule". I also don't think we need style enforcement (just yet). A brief style guide with recommendations and common sense go a long way.

@longb
Copy link

longb commented Dec 23, 2019

Some style suggestions that are probably obvious, but seem to be violated in some user code:

  1. Do not use language features that have been deleted (PAUSE, Arithmetic IF, ...).
  2. Avoid using language features that are obsolescent (such as COMMON).
  3. Avoid vendor syntax extensions (such as REAL*8), especially when there are standard alternatives.
  4. Avoid vendor-specific, nonstandard intrinsic procedures (such as ETIME).

Violating these leads to portability problems.

@longb
Copy link

longb commented Jan 30, 2020 via email

@longb
Copy link

longb commented Jan 30, 2020 via email

@certik
Copy link
Member

certik commented Jan 30, 2020

Comparison with Python or Matlab does not seem relevant. In both cases, native code is very slow, and they have to resort to libraries for any reasonable performance.

They do not do it for performance reasons, but for convenience reasons. If you look at NumPy's implementation:

https://github.com/numpy/numpy/blob/d9b1e32cb8ef90d6b4a47853241db2a28146a57d/numpy/core/_methods.py#L134

this is not more performing than doing it by hand in NumPy. So performance is not the reason why NumPy has the API. Rather, I suspect, the reason is that users like to use such a function.

@milancurcic
Copy link
Member Author

@longb The comparison to Python and MATLAB is quite relevant because it informs our API design. We share much of the target audience with those and some other languages.

@leonfoks
Copy link

Welford's method takes care of two things. Possible numerical instability; it's a single pass algorithm. It's a choose your poison scenario as with most robust computation algorithms.

Regardless of the examples we use, and implementation we finally decide on for a stdlib that is generally applicable, I also think a lot of people will appreciate a centralized statistics based module. More and more people are using Python and Matlab over Fortran, following their direction at this point is absolutely the way to go, and saves us a lot of effort too!

Discussions about individual algorithms is key to developing the best choice in each case. If the choice of any one algorithm is a bottleneck in a user's code, hopefully they give us some feedback, but they are also free to roll their own version of that algorithm.

@gronki
Copy link

gronki commented Jan 31, 2020 via email

@gronki
Copy link

gronki commented Jan 31, 2020 via email

@certik
Copy link
Member

certik commented Jan 31, 2020 via email

@urbanjost
Copy link

Even simple routines should consider robustness not just speed; and in a perfect world avoid pitfalls users are not always considering, such as data conditioning. Are you more impressed by a compiler whose SUM() function returns the value of OUTLIER() on the first call or one that generates the last value calculated no matter where the OUTLIER() value is placed in the ARR() array?

program testit
   integer,parameter :: elements=10000000
   real    :: arr(elements)
   real    :: summary
   real    :: outlier=huge(summary)/10.0**30
   integer :: i
   arr=1.0
   arr(1)=outlier
   write(*,*)'biggest=',arr(1)

   summary=0.0
   do i=1,elements
      summary=summary+arr(i)
   enddo
   write(*,*)'loop sum biggest first=',summary
   write(*,*)'biggest first sum=',sum(arr)

   summary=0.0
   arr(1)=1.0
   arr(elements)=outlier
   do i=1,elements
      summary=summary+arr(i)
   enddo
   write(*,*)'loop sum biggest last=',summary
   write(*,*)'biggest last sum=',sum(arr)
end program testit

Do you get this:
biggest= 340282336.
loop sum biggest first= 340282336.
biggest first sum= 340282336.
loop sum biggest last= 350282336.
biggest last sum= 350282336.

or is "biggest first sum" 350282336. ?

This is an artificial case, but reflects real-world errors that often happen in real-world problems. So I can easily imagine a MEAN() function that is a lot more substantial than just replacing a single line;
and a MASK ability is very useful but can run into similar issues like having a mask that says X.ne.REAL_VALUE, for example. Does speed outweigh robustness in stdlib? I have to admit I have seen fewer intrinsic implementations that condition data lately. Is the cost of such checks so high we not want them for the "typical" datasets?

@certik
Copy link
Member

certik commented Jan 31, 2020

There can be different versions of sum, e.g., Julia has a regular sum and also sum_kbn for a much slower, but more accurate sum algorithm.

(All my production applications do not depend on the order of summation, but for some users that can be helpful, and in that case they can call such different method.)

Anyway, this is not related to the style guide. Please open separate issues for that. And even better, let's start discussing and submitting PRs for all this functionality, so that we can get towards our goal sooner.

@jvdp1
Copy link
Member

jvdp1 commented Jan 31, 2020

As you can see, the “simple” method is faster.

If a statistic or any kind of numerical algorithm is to make it in the
stdlib, I think it should be highly optimized or not implemented at all
(since, as Bill Long said, Fortran has enough intrinsic to not need a
function for average). If the algorithm is sub-optimal and not using the
best vectorization possible then nobody will use it and it will be dead.

In a general way, IMO the algorithms must be first robust and accurate (at least as mush as the intrinsic functions; see comment) for a vast majority of the cases available in the community (and not for a personal/specific case). What is the point to have a fast and very efficient algorithm that returns a wrong answer in some general cases? E.g., the "simple" method in comment is faster (while I think most of you could optimize the other one), but in many of my real cases this "faster" function may return a wrong answer. So, there is a trade-off between accuracy/robustness and efficiency, and this may vary depending on the real scenarios. We need to weight advantages and disavantages of both (e.g, I would be happy if the faster method is implemented but the spec should mention its limitations if its robustness is below the one of intrinsic functions).
Finally, I think stdlib should try to answer the need of the community that probably includes a large number of "average" Fortran users , and not the need of the current stdlib developers (who have most likely already something like stdlib for their own use). "Average" Fortran users would probably prefer (real case in my field):

integer(int8) :: x(:,:)
print*, mean(x, x < 3) !dp result

over

integer(int8) :: x(:,:)
print*, sum(real(x,dp), x < 3) / real(count(x < 3, kind = int64), dp)

If you are interested in the discussion about descriptive statistics and want to participate to it, please see #113 and #128 .
for discussion on a possible trade-off between effeciency and robustness/accuracy, see #134.

I am sorry to have opened a discussion about utility/usefulness/efficiency of specific procedures (mean, variance) (it was not my aim). I hope this issue will restart where it was, i.e., here

@ivan-pi
Copy link
Member

ivan-pi commented Feb 15, 2021

Should we start a Wiki page for the style guide? I believe we have already converged to some common practices in the current stdlib modules.

Edit: I forgot we already have the STYLE_GUIDE.md. I will see to create a pull request once I have something.

I've been using the Google C++ Style Guide lately, and found it very helpful to maintain clean code and refresh my memory on best practices. The table of contents is very practical to quickly locate the desired language feature.

A few similar (project-specific) Fortran style guides are:

Other inspiration:

@longb
Copy link

longb commented Feb 15, 2021 via email

awvwgk pushed a commit that referenced this issue Jun 3, 2021
Update of the specs for stdlib_sorting
awvwgk pushed a commit that referenced this issue Jun 12, 2021
Add general tester against intrinsic array slice
@wyphan
Copy link
Contributor

wyphan commented Jun 17, 2021

What's the current consensus on the space in an end block keyword, e.g. enddo or end do?

@certik
Copy link
Member

certik commented Jun 21, 2021

I think the consensus is to use end do, and simply use the amount of spaces that is already used in a given file for consistency.

milancurcic pushed a commit that referenced this issue Aug 22, 2021
renamed strings_format_string to string_format_string
@awvwgk awvwgk added the meta Related to this repository label Sep 18, 2021
@tueda
Copy link

tueda commented Dec 3, 2021

I have just found a few unnecessary semicolons at the end of statements in stdlib_ascii.fypp, for example,

is_octal_digit = (c >= '0') .and. (c <= '7');

which may be a style issue for those who use both Fortran and C-like languages and hopefully be avoided by the use of some linters in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Related to this repository
Projects
None yet
Development

No branches or pull requests