-
-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chess.pgn.read_headers
inserts empty header entries related to newlines and empty movetext
#1087
Comments
Investigating a bit further, there seems to be some issue related to newlines between games. For example: testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]
{ Both Chinese players were late to the board for game two and were
defaulted }
0-1
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]
1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""
import io
f = io.StringIO(testcase)
games = []
while True:
headers = chess.pgn.read_headers(f)
games.append(headers)
if headers == None:
break Leads to games being (note the empty [Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
Headers(),
Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
None] While file from original issue: testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]
{ Both Chinese players were late to the board for game two and were
defaulted }
0-1
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]
1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""
import io
f = io.StringIO(testcase)
games = []
while True:
headers = chess.pgn.read_headers(f)
games.append(headers)
if headers == None:
break Leads to games being (again plenty of empties): [Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
Headers(),
Headers(),
Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
Headers(),
None] So my code in original issue is slightly wrong: it looks at headers being false-ish: if not headers:
break instead of comparing them to None: if headers is None:
break However this is probably still bug in library, since empty line probably shouldn't be empty game. Additionaly it's somehow related to movetext being empty, since if we provide it we get different return: testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]
1. e4 e5 0-1
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]
1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""
import io
f = io.StringIO(testcase)
games = []
while True:
headers = chess.pgn.read_headers(f)
games.append(headers)
if headers == None:
break Leads to (note that now there is no [Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
Headers(),
None] |
chess.pgn.read_headers
stops reading headers after game with empty movetextchess.pgn.read_headers
inserts empty header entries related to newlines and empty movetext
I have also had this problem. If I put a blank line between the games, it works. So: Example 1, BAD, does only parse the first game, no blank line between games:
Example 2, GOOD, does parse both games, a blank line between the games:
I think that both examples should work. |
This is tricky to deal with for
a decision has to be made:
Currently the parser always does the latter. This is not a bug, because the PGN is invalid anyway, but maybe some heuristics can be added to better deal with it. Robustly handling all of this would require changing the API, so that the parser can look ahead one line, without necessarily consuming it. Pushing this back to 2.x, for that reason. |
Hey niklasf, here we actually have result marker - we are missing movetext. I guess minimal testcase would be:
I didn't try it though, since I don't have python on this computer. |
I've written a class that can assist with looking ahead in a PGN without necessarily consuming the line. There are two methods that can be used to address the lookahead difficulties.
Here's the code with some usage example below. Let me know if this could be useful. from typing import Iterable, Optional
class PreviewIterator:
def __init__(self, source: Iterable[str]) -> None:
self.source = iter(source)
self.putback_line: Optional[str] = None
def __iter__(self) -> Iterable[str]:
return self
def __next__(self) -> str:
if self.putback_line is not None:
line = self.putback_line
self.putback_line = None
return line
else:
return next(self.source)
def putback(self, line: str) -> None:
self.putback_line = line
def lookahead(self) -> Optional[str]:
try:
line = next(self)
except StopIteration:
return None
self.putback(line)
return line
lines = ["first", "second", "third repeat", "fourth"]
line_iterator = PreviewIterator(lines)
for line in line_iterator:
print(line)
if line.endswith("repeat"):
line_iterator.putback(line.removesuffix("repeat"))
print("")
line_iterator_2 = PreviewIterator(lines)
for line in line_iterator_2:
print(line)
look_ahead = line_iterator_2.lookahead()
if look_ahead and look_ahead.endswith("repeat"):
print("+++") Output:
|
Yes. I think for 2.x I'd like to replace the stateless class PgnReader:
def __init__(self, f: file): ...
def read_game(self) -> Optional[Game]: ... that can internally use a |
I am trying to parse a largeish (7,000,000 games) pgn using
read_headers
. However, I only managed to scan 84,039 games before it stopped as if it finished (no error message).I managed to narrow it down to this testcase:
Which prints:
The text was updated successfully, but these errors were encountered: