Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset: code comprehension #805

Merged
merged 32 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
c8fa4fb
add code comprehension dataset
SiQube Sep 18, 2024
20b69ab
add code comprehension dataset tests
SiQube Sep 18, 2024
268966f
update docstring of utils downloads (#838)
SiQube Sep 27, 2024
14a03f8
update docstring of copco dataset definition (#821)
SiQube Sep 27, 2024
580eae7
update docstring of dataset definition (#820)
SiQube Sep 27, 2024
4532de2
update docstring of fakenews dataset definition (#824)
SiQube Sep 27, 2024
6251849
update docstring of gaze screen (#837)
SiQube Sep 27, 2024
5e0297b
update docstring of gaze experiment (#836)
SiQube Sep 27, 2024
be0501c
update docstring of events processing (#835)
SiQube Sep 27, 2024
d3eecee
update docstring of toy_dataset_eyelink dataset definition (#834)
SiQube Sep 27, 2024
fd8dad4
update docstring of toy_dataset dataset definition (#833)
SiQube Sep 27, 2024
1ad8db7
update docstring of sb_sat dataset definition (#832)
SiQube Sep 27, 2024
5903c93
update docstring of potec dataset definition (#831)
SiQube Sep 27, 2024
d9e2b2a
update docstring of judo1000 dataset definition (#830)
SiQube Sep 27, 2024
fdab5de
update docstring of hbn dataset definition (#829)
SiQube Sep 27, 2024
242aabe
update docstring of gazebasevr dataset definition (#828)
SiQube Sep 27, 2024
02f701f
update docstring of gazebase dataset definition (#827)
SiQube Sep 27, 2024
fa3025d
update docstring of gazes_on_faces dataset definition (#826)
SiQube Sep 27, 2024
6d22c5a
update docstring of gaze_graph dataset definition (#825)
SiQube Sep 28, 2024
d31b236
update docstring of emtec dataset definition (#823)
SiQube Sep 28, 2024
eba7020
update docstring of didec dataset definition (#822)
SiQube Sep 28, 2024
cc67adb
update transforms tests to account for adding single line (#815)
SiQube Sep 28, 2024
4792904
update binocular example, space raises converting problems for polars…
SiQube Sep 28, 2024
ba07738
missed ColumnNotFoundError (#816)
SiQube Sep 28, 2024
81b8518
upgrade to polars 1+ (#809)
SiQube Sep 28, 2024
08a1399
manually update pre-commit config (#818)
SiQube Sep 28, 2024
ee44006
build: update nbconvert requirement from <7.14,>=7.0.0 to >=7.16.4,<7…
dependabot[bot] Sep 28, 2024
eee4f76
update pydoclint in pre-commit config (#819)
SiQube Sep 29, 2024
dbad9c1
feat!: Custom patterns for parsing logged metadata in ASC files (#767)
saeub Sep 29, 2024
8500aa9
enable downloading precomputed events for copco dataset (#840)
SiQube Sep 30, 2024
504e398
update sampling rate due to mutually exclusive sampling rate None and…
SiQube Sep 30, 2024
cdaa8fb
Merge branch 'main' into dataset-code-comprehension
SiQube Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions docs/source/bibliography.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
@article{CodeComprehension,
author = {Alakmeh, Tarek and Reich, David and J\"{a}ger, Lena and Fritz, Thomas},
title = {Predicting Code Comprehension: A Novel Approach to Align Human Gaze with Code using Deep Neural Networks},
year = {2024},
issue_date = {July 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {1},
number = {FSE},
url = {https://doi.org/10.1145/3660795},
doi = {10.1145/3660795},
abstract = {The better the code quality and the less complex the code, the easier it is for software developers to comprehend and evolve it. Yet, how do we best detect quality concerns in the code? Existing measures to assess code quality, such as McCabe’s cyclomatic complexity, are decades old and neglect the human aspect. Research has shown that considering how a developer reads and experiences the code can be an indicator of its quality. In our research, we built on these insights and designed, trained, and evaluated the first deep neural network that aligns a developer’s eye gaze with the code tokens the developer looks at to predict code comprehension and perceived difficulty. To train and analyze our approach, we performed an experiment in which 27 participants worked on a range of 16 short code comprehension tasks while we collected fine-grained gaze data using an eye tracker. The results of our evaluation show that our deep neural sequence model that integrates both the human gaze and the stimulus code, can predict (a) code comprehension and (b) the perceived code difficulty significantly better than current state-of-the-art reference methods. We also show that aligning human gaze with code leads to better performance than models that rely solely on either code or human gaze. We discuss potential applications and propose future work to build better human-inclusive code evaluation systems.},
journal = {Proc. ACM Softw. Eng.},
month = {jul},
articleno = {88},
numpages = {23},
keywords = {code comprehension, code-fixation attention, eye-tracking, lab experiment, neural networks}
}

@inproceedings{CopCoL1Hollenstein,
title = "The Copenhagen Corpus of Eye Tracking Recordings from Natural Reading of {D}anish Texts",
author = {Hollenstein, Nora and
Expand Down
3 changes: 3 additions & 0 deletions src/pymovements/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
:toctree:
:template: class.rst

pymovements.datasets.CodeComprehension
pymovements.datasets.CopCo
pymovements.datasets.DIDEC
pymovements.datasets.EMTeC
Expand All @@ -47,6 +48,7 @@
pymovements.datasets.ToyDataset
pymovements.datasets.ToyDatasetEyeLink
"""
from pymovements.datasets.codecomprehension import CodeComprehension
from pymovements.datasets.copco import CopCo
from pymovements.datasets.didec import DIDEC
from pymovements.datasets.emtec import EMTeC
Expand All @@ -64,6 +66,7 @@


__all__ = [
'CodeComprehension',
'CopCo',
'DIDEC',
'EMTeC',
Expand Down
198 changes: 198 additions & 0 deletions src/pymovements/datasets/codecomprehension.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Copyright (c) 2022-2024 The pymovements Project Authors
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
"""Provides a definition for the CodeComprehension dataset."""
from __future__ import annotations

from dataclasses import dataclass
from dataclasses import field
from typing import Any

import polars as pl

from pymovements.dataset.dataset_definition import DatasetDefinition
from pymovements.dataset.dataset_library import register_dataset
from pymovements.gaze.experiment import Experiment


@dataclass
@register_dataset
class CodeComprehension(DatasetDefinition):
"""CodeComprehension dataset :cite:p:`CodeComprehension`.

This dataset includes eye-tracking-while-code-reading data from a participants in a single
session. Eye movements are recorded at a sampling frequency of 1,000 Hz using an
EyeLink 1000 eye tracker and are provided as pixel coordinates.

The participant is instructed to read the code snippet and answer a code comprehension question.

Attributes
----------
name: str
The name of the dataset.

has_files: dict[str, bool]
Indicate whether the dataset contains 'gaze', 'precomputed_events', and
'precomputed_reading_measures'.

mirrors: dict[str, tuple[str, ...]]
A tuple of mirrors of the dataset. Each entry must be of type `str` and end with a '/'.

resources: dict[str, tuple[dict[str, str], ...]]
A tuple of dataset gaze_resources. Each list entry must be a dictionary with the following
keys:
- `resource`: The url suffix of the resource. This will be concatenated with the mirror.
- `filename`: The filename under which the file is saved as.
- `md5`: The MD5 checksum of the respective file.

extract: dict[str, bool]
Decide whether to extract the data.

experiment: Experiment
The experiment definition.

filename_format: dict[str, str]
Regular expression which will be matched before trying to load the file. Namedgroups will
appear in the `fileinfo` dataframe.

filename_format_schema_overrides: dict[str, dict[str, type]]
If named groups are present in the `filename_format`, this makes it possible to cast
specific named groups to a particular datatype.

trial_columns: list[str]
The name of the trial columns in the input data frame. If the list is empty or None,
the input data frame is assumed to contain only one trial. If the list is not empty,
the input data frame is assumed to contain multiple trials and the transformation
methods will be applied to each trial separately.

time_column: str
The name of the timestamp column in the input data frame. This column will be renamed to
``time``.

time_unit: str
The unit of the timestamps in the timestamp column in the input data frame. Supported
units are 's' for seconds, 'ms' for milliseconds and 'step' for steps. If the unit is
'step' the experiment definition must be specified. All timestamps will be converted to
milliseconds.

pixel_columns: list[str]
The name of the pixel position columns in the input data frame. These columns will be
nested into the column ``pixel``. If the list is empty or None, the nested ``pixel``
column will not be created.

column_map: dict[str, str]
The keys are the columns to read, the values are the names to which they should be renamed.

custom_read_kwargs: dict[str, dict[str, Any]]
If specified, these keyword arguments will be passed to the file reading function.

Examples
--------
Initialize your :py:class:`~pymovements.PublicDataset` object with the
:py:class:`~pymovements.CodeComprehension` definition:

>>> import pymovements as pm
>>>
>>> dataset = pm.Dataset("CodeComprehension", path='data/CodeComprehension')

Download the dataset resources:

>>> dataset.download()# doctest: +SKIP

Load the data into memory:

>>> dataset.load()# doctest: +SKIP
"""

# pylint: disable=similarities
# The PublicDatasetDefinition child classes potentially share code chunks for definitions.

name: str = 'CodeComprehension'

has_files: dict[str, bool] = field(
default_factory=lambda: {
'gaze': False,
'precomputed_events': True,
'precomputed_reading_measures': False,
},
)

mirrors: dict[str, tuple[str, ...]] = field(
default_factory=lambda: {
'precomputed_events': ('https://zenodo.org/',),
},
)

resources: dict[str, tuple[dict[str, str], ...]] = field(
default_factory=lambda: {
'precomputed_events': (
{
'resource':
'records/11123101/files/Predicting%20Code%20Comprehension%20Package'
'.zip?download=1',
'filename': 'data.zip',
'md5': '3a3c6fb96550bc2c2ddcf5d458fb12a2',
},
),
},
)

extract: dict[str, bool] = field(default_factory=lambda: {'precomputed_events': True})

experiment: Experiment = Experiment(
screen_width_px=None,
screen_height_px=None,
screen_width_cm=None,
screen_height_cm=None,
distance_cm=None,
origin=None,
sampling_rate=2000,
)

filename_format: dict[str, str] = field(
default_factory=lambda: {
'precomputed_events': r'fix_report_P{subject_id:s}.txt',
},
)

filename_format_schema_overrides: dict[str, dict[str, type]] = field(
default_factory=lambda: {
'precomputed_events': {'subject_id': pl.Utf8},
},
)

trial_columns: list[str] = field(default_factory=lambda: [])

time_column: str = ''

time_unit: str = ''

pixel_columns: list[str] = field(default_factory=lambda: [])

column_map: dict[str, str] = field(default_factory=lambda: {})

custom_read_kwargs: dict[str, dict[str, Any]] = field(
default_factory=lambda: {
'precomputed_events': {
'separator': '\t',
'null_values': '.',
'quote_char': '"',
},
},
)
1 change: 1 addition & 0 deletions tests/unit/datasets/datasets_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
('public_dataset', 'dataset_name'),
# XXX: add public dataset in alphabetical order
[
pytest.param(pm.datasets.CodeComprehension, 'CodeComprehension', id='CodeComprehension'),
pytest.param(pm.datasets.CopCo, 'CopCo', id='CopCo'),
pytest.param(pm.datasets.DIDEC, 'DIDEC', id='DIDEC'),
pytest.param(pm.datasets.EMTeC, 'EMTeC', id='EMTeC'),
Expand Down
Loading