Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting it to work with multibyte names #70

Open
sinnbeck opened this issue Feb 9, 2024 · 11 comments
Open

Getting it to work with multibyte names #70

sinnbeck opened this issue Feb 9, 2024 · 11 comments

Comments

@sinnbeck
Copy link

sinnbeck commented Feb 9, 2024

Has anyone had any luck getting this package to work with multi byte names?
If I fix the regex to allow it I get errors like

error: `/run/user/1000/doc/e43f6ab4/test.sav' near offset 0x9f640: Unrecognized record type 1443443282.

The names include danish letters like æøå/ÆØÅ which are valid in SPSS variable names. Example from pspp
image

I have tried making my own version where I replaced all php methods to the mb_ version (like mb_strlen), but sadly it does not seem to work. :/

@sinnbeck
Copy link
Author

Ok it seems I managed to get it working. I just had to use strlen() instead of mb_strlen(). More specifically I changed this one to strlen() and it works now

https://github.com/tiamo/spss/blob/master/src/Sav/Record/Info/VariableAttributes.php#L47

Would this change be considered if I PR it?

@lestcape
Copy link
Collaborator

As you already note, the usage of mb_strlen seem to be wrong there. Info are in general stored as ASCII. See as how is used in other extended classes of Info:
https://github.com/search?q=repo%3Atiamo%2Fspss%20-%3EdataCount&type=code

So, yes it should be: \strlen. Will be good to have a test case for that, but not worried if you can not add it in your PR.

Have a good day and thanks, for point this out.

@sinnbeck
Copy link
Author

Thanks for the heads up. I am still a bit new to the whole SPSS thing,but I will take a look at the current tests and see if I can make something work.I notice other sub classes of Info are also using mb_strlen() (LongStringMissingValues and LongStringValueLabels) so I am curious if they also should be changed to strlen(). I havent come across any issues with them, but it might be worth testing out.

I will try making a PR tomorrow :)

@lestcape
Copy link
Collaborator

Yes, Info are in general stored as ASCII. We really don't have a detailed specification for when one is used and when not. Testing what happens in practice is the safest thing for now. I have not notice any problem in the Value labels as they are right now, also i used there some Latin characters without a problem. So, I hope it's fine the way it is.

The Value labels have several problems that need a fix, but it's for now unrelated with the size. The problems are related with used the same instances of the ValueLabel in more than one variable.

@lestcape
Copy link
Collaborator

Really I think that only mb_strlen is wrong, because that assume the encode as UTF-8 and this might not be the case. See: #63, but that not means that it should be \strlen in all cases. It should probably be mb_strlen($data, $encode) where $encode is the real file encode and not the default encode.

I will see if i have the time to work on something that fixed the problematic in a better way. Specially for what occurs here: https://github.com/tiamo/spss/blob/master/src/Sav/Record/ValueLabel.php#L99-L106

@lestcape
Copy link
Collaborator

I opened a pull request #72 and there i try to make more clear what is needed. The point is that:

All sizes should be calculate in bytes, but the internal data is not necessary encoded in ASCII. So, cut the string in bytes means that we can cut the last character in the wrong position, because for example: In UTF-8 one character can have more than one byte.

The idea is then to find the cut in bytes that guarantees that no character becomes invalid when the string is saved in the charset of the resulting file (by default UTF-8).

That need to be tested for a while and then if there are not problem, can be considered to be merged.

@sinnbeck
Copy link
Author

sinnbeck commented Feb 13, 2024

Ah now that is quite interesting. Both seem like great PR's. I will pull them and see if I can break them in any ways :)

Great work!

Edit: Which version of php are you using while developing? I notice that I cannot install composer dependencies with php 8. I can do a PR to update it to work with php ^8.0

@lestcape
Copy link
Collaborator

I don' t know what @tiamo want to do about the new php versions, it become to be tricky to support all with the same code and i don' t know what can be dropped. So, i don' t want to take any decision on that for now. I have PHP 8.2.7, but i use my own copy of the library. I have a wrapper class that use the libray internally and in that way the things are easy to me to be integrate any changes without depend of an specific version of the library and without to break the rest of the code. So, i really don' t care what version of the php the library support.

@sinnbeck
Copy link
Author

Yeah I have it running in a PHP 8.2 project as well and it works just fine. The issue is that when working on a PR I am forced to use an old version of PHP to install the dependencies. I think I will just make a temporary docker container with php 7.4 for running tests while making PR's :)

@lestcape
Copy link
Collaborator

lestcape commented Feb 13, 2024

Well, in reality there is only one dependency and in Debian it is installed as easy as executing the command: sudo apt install php-mbstring. The bcmath, is really not used. I don't know why it is there as a dependency, but also it can be installed by just one command: sudo apt install php-bcmath. If you want to run the tests, it is easy as install phpunit: sudo apt install phpunit, then, it can be run by: phpunit --bootstrap ./bootstrap.php ./ and a /vendor/autoload.php to execute the tests should be:

<?php
$rootPathSPSS = dirname(dirname(__FILE__));

require_once($rootPathSPSS."/tests/TestCase.php");

require_once($rootPathSPSS."/src/Utils.php");
require_once($rootPathSPSS."/src/Exception.php");
require_once($rootPathSPSS."/src/Buffer.php");
require_once($rootPathSPSS."/src/Sav/Variable.php");
require_once($rootPathSPSS."/src/Sav/RecordInterface.php");
require_once($rootPathSPSS."/src/Sav/Record.php");
require_once($rootPathSPSS."/src/Sav/Record/InfoCollection.php");
require_once($rootPathSPSS."/src/Sav/Record/Header.php");
require_once($rootPathSPSS."/src/Sav/Record/Variable.php");
require_once($rootPathSPSS."/src/Sav/Record/Data.php");
require_once($rootPathSPSS."/src/Sav/Record/ValueLabel.php");
require_once($rootPathSPSS."/src/Sav/Record/Document.php");
require_once($rootPathSPSS."/src/Sav/Record/Info.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/MachineInteger.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/MachineFloatingPoint.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/VariableDisplayParam.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/LongVariableNames.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/VeryLongString.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/ExtendedNumberOfCases.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/Unknown.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/DataFileAttributes.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/VariableAttributes.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/CharacterEncoding.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/LongStringMissingValues.php");
require_once($rootPathSPSS."/src/Sav/Record/Info/LongStringValueLabels.php");

require_once($rootPathSPSS."/src/Sav/Reader.php");
require_once($rootPathSPSS."/src/Sav/Writer.php");
?>

@sinnbeck
Copy link
Author

Thanks for the suggestion. I just made a simple Dockerfile instead :)

FROM php:7.4-cli

RUN curl -sSL https://github.com/mlocati/docker-php-extension-installer/releases/latest/download/install-php-extensions -o - | sh -s \
      bcmath zip

COPY --from=composer:latest /usr/bin/composer /usr/local/bin/composer

RUN apt-get -y update && apt-get -y install git

WORKDIR 'app'
docker build -t spss-app .
docker run -it --rm -v ./:/app spss-app bash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants