Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

May be a BOM issue? #184

Closed
harikt opened this issue Oct 1, 2016 · 10 comments
Closed

May be a BOM issue? #184

harikt opened this issue Oct 1, 2016 · 10 comments

Comments

@harikt
Copy link
Contributor

harikt commented Oct 1, 2016

Issue summary

Hi,

I have been playing with importing a csv file. It has came to my notice that there is some sort of issue with μ character. May be it is something else I am wrong.

System informations

Information Description
League\Csv version 8.1.1
PHP/HHVM version PHP 7.0.8-0ubuntu0.16.04.1 (cli) ( NTS )
OS Platform ubuntu0.16.04.1

Standalone code, or other way to reproduce the problem

You can download the csv file from http://www.data-explorer.com/content/data/standard-model-of-particle-physics-csv.zip

<?php
require __DIR__ . '/vendor/autoload.php';

use League\Csv\Reader;

$file = "/path/to/physics/file.csv";

$reader = Reader::createFromPath($file);
$reader->stripBom(true);

echo "<pre>";

//get the first row, usually the CSV header
$headers = $reader->fetchOne(0);
print_r($headers);

$reader->addFilter(function ($row, $index) {
    return $index > 0;
});

$rows = $reader->fetch();

foreach ($rows as $row) {
    $content = array_combine($headers, $row);
    $json = json_encode($content);

    if ($json === FALSE) {
        echo var_dump($content);
        exit;
    }
}

Expected result

I am not sure what would be the appropriate character for μ .

Actual result

array(21) {
  ["ID"]=>
  string(3) "9"
  ["Name"]=>
  string(9) "Muon"
  ["Symbol"]=>
  string(3) "��"
 ....
}

There is actually more on the same csv which have similar issues.

Thank you.

@harikt harikt changed the title May be a BOM issue with μ character ? May be a BOM issue? Oct 1, 2016
@nyamsprod
Copy link
Member

@harikt there's nothing wrong with the library. By default the library thinks you are using a UTF-8 document. In your case the document is not UTF-8. Here's what I did:

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
//in your case $input_bom === Reader::BOM_UTF16_LE
//so I've just added this following line (I assume that the iconv extension is installed on your server)
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8'); //converting to UTF-8
$reader->stripBOM(true);
echo "<pre>";
//get the first row, usually the CSV header
$headers = $reader->fetchOne(0);
$rows = $reader->setOffset(1)->fetchAssoc($headers); //Simpler than using the Reader::addFilter method
foreach ($rows as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}

@harikt
Copy link
Contributor Author

harikt commented Oct 1, 2016

@nyamsprod Thank you for your quick help.

A few things I noticed are

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
$reader->stripBOM(true);
$headers = $reader->fetchOne(0);

Adding $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8'); creates a warning as below.

PHP Notice:  Uninitialized string offset: 0 in /var/www/projects/vendor/league/csv/src/Modifier/QueryFilter.php on line 225

Notice: Uninitialized string offset: 0 in /var/www/projects/vendor/league/csv/src/Modifier/QueryFilter.php on line 22

Also closely looking the id keys got missing. See below.

{
    "": "17",

Thank you

@nyamsprod
Copy link
Member

@harikt strange I failed to reproduce this bug and I'm also using the 16.04 LTS

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/test.csv";

$reader = Reader::createFromPath($file);
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
$reader->stripBOM(true);

echo "<pre>";
foreach ($reader->fetchAssoc(0) as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}

@harikt
Copy link
Contributor Author

harikt commented Oct 1, 2016

I hope you could re-produce this now. This is when you call $reader->getInputBOM(); .

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/test.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
if ( $input_bom === Reader::BOM_UTF16_LE || $input_bom === Reader::BOM_UTF16_BE ) {
    echo "BOM UTF-16 " . PHP_EOL;
    $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
}
$reader->stripBOM(true);
foreach ($reader->fetchAssoc(0) as $row) {
    // echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}

@nyamsprod
Copy link
Member

@harikt thanks I found the issue ... I'll release a patch to fix this when I have time 👍

@harikt
Copy link
Contributor Author

harikt commented Oct 1, 2016

Good to hear you was able to reproduce the same.

And thank you very much for the valuable support you have provided.

I wasn't able to find the reason to send a PR. Sorry about that :( .

@harikt
Copy link
Contributor Author

harikt commented Oct 1, 2016

Also in the case as I mentioned when calling $reader->getInputBOM() the ID was null on json_encode data.

@nyamsprod
Copy link
Member

Also in the case as I mentioned when calling $reader->getInputBOM() the ID was null on json_encode data.

I've looked into this issue and in fact again there's no issue. The thing is if you use stream filter you don't need to explicitly call the stripBOM method.
The conversion is done prior to generate the array row from the CSV document so the BOM sequence is already removed. (I didn't know this, since I've never really used the stream filter on BOM document).

So the code should be:

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
if ($input_bom === Reader::BOM_UTF16_LE || $input_bom === Reader::BOM_UTF16_BE) {
    $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
}
foreach ($reader->fetchAssoc(0) as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}

The stripBOM method should only be used if stream filter can not be used to convert your CSV document to UTF-8.

I should document this into the library documentation.. or you could do it 👍

nyamsprod added a commit that referenced this issue Oct 3, 2016
@harikt
Copy link
Contributor Author

harikt commented Oct 3, 2016

Thank you @nyamsprod .

I will look into sending a PR regarding the same. Please give me some time.

Thank you.

@nyamsprod
Copy link
Member

version 8.1.2 is released with the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants