Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

passing containers of char * (e.g. std::vector<char *>) does not work #2245

Open
codypiersall opened this issue Jun 6, 2020 · 15 comments · May be fixed by #2303
Open

passing containers of char * (e.g. std::vector<char *>) does not work #2245

codypiersall opened this issue Jun 6, 2020 · 15 comments · May be fixed by #2303
Labels

Comments

@codypiersall
Copy link

Issue description

pybind11 does not seem to correctly pass containers of char * to Python. I'm using the example of std::array<const char *, N>, but std::vectors have the same behavior. The char * received by C++ seems to be pointing to a random memory location. The memory is still within the process, as there are no segfaults (that I have seen, anyway).

Reproducible example code

This demo class has two separate constructors. One takes the type std::array<const char *, 2>, and the other takes the types const char *, const char*. The std::array version does the wrong thing, and the char *, when printed, just prints out garbage. The char *, char * version does the right thing, and prints the passed in string as expected.

I'm using pybind v2.5.0. I went through the docs on strings but didn't notice anything. Just guessing, but I would guess the problem might be shallow copies of pointers somewhere, since this only happens with containers. But I really have no clue.

This problem does not surface whenever std::string is used, which is another reason I think it may be related to shallow copies of pointers (since std::string will copy its data when copied).

#include <vector>
#include <stdio.h>

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
namespace py = pybind11;

class Demo {
public:
    Demo(std::array<const char *, 2> strs) {
        printf("Got %s %s\n", strs[0], strs[1]);
    }
    Demo(const char * str1, const char * str2) {
        printf("Got %s %s\n", str1, str2);
    }
};

PYBIND11_MODULE(demo, m) {
    // the real deal here
    py::class_<Demo>(m, "Demo")
        .def(py::init<std::array<const char *, 2>>())
        .def(py::init<const char *, const char *>())
    ;
}

After building, the following Python code demonstrates the issue:

import demo

# prints random garbage
demo.Demo(['string 1', 'string 2'])

# prints the right thing
demo.Demo('string1', 'string2')
@molpopgen
Copy link

You probably need to make the container type opaque. The manual describes how to do so

@codypiersall
Copy link
Author

Thanks for the pointer, @molpopgen. In my case I can modify the wrapped class (Demo in the example code) to just take std::string, but in general it may not be possible for people to do that.

I can think of three behaviors that would be preferable to the current behavior:

  1. Do The Right Thing™ by converting the passed in Python str to char *.
  2. Refuse to compile
  3. Throw a runtime exception whenever the method is called

As a reminder, the current behavior is that the bound code receives a pointer to seemingly random data. Silently corrupt data is never fun.

@molpopgen
Copy link

I gave you bad advice! Sorry. Pybind11 converts python strings to c++ strings for good reason, to avoid memory leaks. Passing in pointer as you are doing makes ownership unclear, and therefore up to you. You can write a custom init to handle these cases, taking care to avoid memory leaks.

Failing to compile wouldn't be correct because it makes assumptions about ownership. Auto conversion may not be correct because the char * may not be null-terminated. Etc.. It is up to you to avoid UB and leaks here.

@codypiersall
Copy link
Author

Hmmm, maybe I'm not stating the problem clearly enough. pybind11 does the expected thing (with some caveats) when a function's parameter is a char *. pybind11 will encode it as utf-8, according to the docs here: https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html#passing-python-strings-to-c. To quote the docs:

When a Python str is passed from Python to a C++ function that accepts std::string or char * as arguments, pybind11 will encode the Python string to UTF-8. All Python str can be encoded in UTF-8, so this operation does not fail.

I think it's a bug that it doesn't do that when the parameter is a container of char *; I like the bare char * pybind11 behavior, and think pybind11 should do the same thing with char * containers.

I assume that for bare char * pybind11 copies the Python data to create a temporary bare char * and frees the memory after the memory after the function call completes?

Sorry for the delay in response—I'm in the middle of buying and selling houses/renovation.

@molpopgen
Copy link

I see what you are saying. I just see a vector of char * to be a red flag in this case. If you define a c++ function taking such a vector, and want pybind11 to convert python data for you, then there may be no destructors for the pointers? I've not looked at the internal code to check, though.

@codypiersall
Copy link
Author

Why would pybind11 do something different for a bare char * as it does for a vector of them? Can't it just do the same thing, i.e. encode the Python strs, pass the encoded data, then free the data? pybind11 is already special casing char * to treat it as a string in some cases and it should be consistent.

Regardless, the current behavior is just plain wrong: a vector of char* ends up with the char * pointing to complete garbage.... have you run the demo code?

@molpopgen
Copy link

Why would pybind11 do something different for a bare char * as it does for a vector of them?

You'd have to look at the code, but it is likely working in a loop and so the temporary is going out of scope. The current behavior is wrong, and probably either shouldn't compile or should warn and say that it isn't doing what you want.

@molpopgen
Copy link

For a class that cannot change from vector of pointers to strings, etc., this is a workaround, although we're not correctly addressing ownership here:

#include <vector>
#include <stdio.h>

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
namespace py = pybind11;

class Demo
{
  public:
    Demo(std::array<const char *, 2> strs)
    {
        printf("Got %s %s\n", strs[0], strs[1]);
    }
    Demo(const char *str1, const char *str2)
    {
        printf("Got %s %s\n", str1, str2);
    }
};

PYBIND11_MODULE(demo, m)
{
    // the real deal here
    py::class_<Demo>(m, "Demo")
        .def(py::init([](std::array<std::string, 2> a) {
            // NOTE: the behavior in here depends on what the class
            // is actually doing, as we're in memory leak territoy
            // in general.  This works for this specific example, though.
            return Demo(std::array<const char *, 2>{a[0].c_str(), a[1].c_str()});
        }))
        .def(py::init<const char *, const char *>());
}

@codypiersall
Copy link
Author

Thanks for the feedback, @molpopgen. That's a nice clean workaround.

I'm not trying to be dense, but I don't see where a memory leak could come from in the std::array<...> example: won't the array of std::string get destroyed when the __init__ call finishes? I see potential for other undefined behavior though: if the Demo class expects the caller to keep the array of char *s alive, and accesses them later, that's undefined behavior land... a sad place to be.

@molpopgen
Copy link

molpopgen commented Jun 19, 2020

it all depends on what the fate of that container is, and on how the pointers were originally create (.c_str, new, malloc from some C lib?). You can get UB like you said, or a memory leak if somewhere in the path to making the class we need deep copies, etc.. It really depends on the details. To me, a container of bare pointers is a red flag. That's all I'm saying.

@codypiersall
Copy link
Author

To me, a container of bare pointers is a red flag. That's all I'm saying.

Gotchya. I probably reached for it because of my relative familiarity with C instead of C++... I'm afraid my code may have a distinctly C-ish smell to it :-. Hopefully getting better every day though.

@bstaletic
Copy link
Collaborator

Pybind doesn't try to work with C style arrays by design. You need a bit of cleverness to make it work, but it's certainly possible.

Is there still an issue here or should we close this?

@codypiersall
Copy link
Author

codypiersall commented Jul 7, 2020

The issue is that a bare char * function parameter works as explained in https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html#passing-python-strings-to-c. pybind11 automatically interprets a char * as a UTF-8-encoded string, but does not do the same with containers of char *. In my opinion, either refusing to compile a container of char * or working the same as when passing a bare char * would be good. As is, whatever method takes a container of char * ends up just getting a pointer to random junk (probably use after free, but I haven't tried with valgrind or anything).

@YannickJadoul
Copy link
Collaborator

The issue here is that the type caster for lists (to std::vector, std::array, etc) recursively calls the type caster of its nested type:

for (auto it : s) {
value_conv conv;
if (!conv.load(it, convert))
return false;
value.push_back(cast_op<Value &&>(std::move(conv)));
}

In the case of char*, the type caster stores a string object and passes a pointer to its data (const_casting the c_str()'s const away as well). So the moment this type caster for the nested char* goes out of scope, the data disappears as well.

This is definitely a bug, but I'm not immediately sure what's the best way of fixing it.

@zh794390558
Copy link

class Algorithm
    {
    public:
        Algorithm(const char *text, int length)
            :m_text(text), m_pos(0),
            m_text_length(length),
            m_tmp_words_i(0),
            m_match_cache_i(0)
        {
            for (int i = 0; i < match_cache_size; ++i)
                m_match_cache[i].first = -1;
        }

    private:
        const char *m_text;


py::class_<rmmseg::Algorithm>(m, "Algorithm")
        //.def(py::init<const char *, int>(),  py::keep_alive<1, 2>())
        .def(py::init([](std::string str, int length){
            return rmmseg::Algorithm(str, str.size());
        }))
        .def("get_text", [](rmmseg::Algorithm &self){
            return self.get_text();
        }, py::return_value_policy::reference)
        .def("next_token", [](rmmseg::Algorithm &self){
            return self.next_token();
        });

m_text will changed by gc, how to fix this? Since below code works will.

class Algorithm
    {
    public:
        Algorithm(const std::string text, int length)
            :m_text(text.c_str()), m_text_(text), m_pos(0),
            m_text_length(length),
            m_tmp_words_i(0),
            m_match_cache_i(0)
        {
            for (int i = 0; i < match_cache_size; ++i)
                m_match_cache[i].first = -1;
        }

    private:
        const char *m_text;
        const std::string m_text_;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants