Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning a copy collection to avoid corrupted iterator #3913

Merged
merged 1 commit into from
Mar 4, 2024

Conversation

ntisseyre
Copy link
Contributor

@ntisseyre ntisseyre commented Aug 10, 2023

In a highly concurrent environment, within a single JanusGraph transaction, where multiple thousands of vertices are being created, I have been observing an exception:

throwable.cause.class="java.lang.IllegalArgumentException",throwable.cause.msg="expected one element but was: <Artifact, Artifact>",throwable.cause.stack="
        at com.google.common.collect.Iterators.getOnlyElement(Iterators.java:315)
        at com.google.common.collect.Iterators.getOnlyElement(Iterators.java:327)
        at com.google.common.collect.Iterables.getOnlyElement(Iterables.java:268)
        at org.janusgraph.graphdb.vertices.AbstractVertex.getVertexLabelInternal(AbstractVertex.java:126)
        at org.janusgraph.graphdb.vertices.AbstractVertex.vertexLabel(AbstractVertex.java:135)
        at org.janusgraph.graphdb.vertices.AbstractVertex.label(AbstractVertex.java:121)
        at org.janusgraph.graphdb.types.system.ImplicitKey.computeProperty(ImplicitKey.java:83)
        at org.janusgraph.graphdb.query.vertex.BasicVertexCentricQueryBuilder.executeImplicitKeyQuery(BasicVertexCentricQueryBuilder.java:211)
        at org.janusgraph.graphdb.query.vertex.VertexCentricQueryBuilder.properties(VertexCentricQueryBuilder.java:99)
        at org.janusgraph.graphdb.util.ElementHelper.getValues(ElementHelper.java:41)
        at org.janusgraph.graphdb.query.condition.PredicateCondition.evaluate(PredicateCondition.java:68)
        at org.janusgraph.graphdb.query.condition.And.evaluate(And.java:55)
        at org.janusgraph.graphdb.query.graph.GraphCentricQuery.matches(GraphCentricQuery.java:153)
        at com.google.common.collect.Iterators$5.computeNext(Iterators.java:637)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
        at org.janusgraph.graphdb.query.QueryProcessor.getUnfoldedIterator(QueryProcessor.java:98)
        at org.janusgraph.graphdb.query.QueryProcessor.iterator(QueryProcessor.java:67)
        at org.janusgraph.graphdb.query.graph.GraphCentricQueryBuilder$1.iterator(GraphCentricQueryBuilder.java:204)
        at org.janusgraph.graphdb.query.graph.GraphCentricQueryBuilder$1.iterator(GraphCentricQueryBuilder.java:201)
        at org.janusgraph.graphdb.tinkerpop.optimize.JanusGraphStep.executeGraphCentricQuery(JanusGraphStep.java:160)
        at org.janusgraph.graphdb.tinkerpop.optimize.JanusGraphStep.lambda$new$1(JanusGraphStep.java:95)
        at java.lang.Iterable.forEach(Iterable.java:75)
        at org.janusgraph.graphdb.tinkerpop.optimize.JanusGraphStep.lambda$new$2(JanusGraphStep.java:95)
        at org.apache.tinkerpop.gremlin.process.traversal.step.map.GraphStep.processNextStart(GraphStep.java:157)
        at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.hasNext(AbstractStep.java:144)

After an investigation, I have found that the container of vertex relationships might be modified and iterated in parallel, causing a corrupted iterator.

To solve an issue, I have made the function to return a copy of the current iterator state.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 10, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: ntisseyre / name: Natalia Tisseyre (0feb953)

@janusgraph-bot janusgraph-bot added the cla: no This PR is not compliant with the CLA label Aug 10, 2023
@janusgraph-bot
Copy link

Please verify the committer name, email, and GitHub username association are all correct and match CLA records.

Copy link
Member

@porunov porunov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ntisseyre ! Happy to see you here 😄

I added some comments regarding this PR, because I don't know the purpose of the changes. I assume they might be needed for latter contributions.

Besides that, you will need to sign either ICLA or CCLA, so that we could merge that work into JanusGraph.
You can check the contribution guide here: https://github.com/JanusGraph/janusgraph/blob/master/CONTRIBUTING.md#sign-the-cla

Also, each commit has to be signed by you. You can update your commit signature via git commit --amend -s. See instructions about signing commits here: https://github.com/JanusGraph/janusgraph/blob/master/CONTRIBUTING.md#commit-changes-and-sign-the-developer-certificate-of-origin

Please, let me know if you need any help there and I can jump in.

Comment on lines 47 to 48
@Override
public synchronized Collection<InternalRelation> getAll() {
return super.getAll();
return new ArrayList<>(super.getAll());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see getAll() method is used in a single place (to pass added relations into the commit logic).
I'm thinking concurrent access to this collection could happen only in the situation when we are in the process of committing the transaction but also another thread is adding new relations into this transaction.
I'm can't think about cases when adding new relations while committing the transaction makes sense. I could miss some picture here and could be that this part might be needed for future optimizations / features, but I just can't think about the case when wrapping this collection into mutable collection is needed. Right now super.getAll() returns unmodifiable collection which can't be mutated by the caller.
Perhaps you could give some example why this change is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only usage I found when getAll() is called is when we commit the transaction. I that case, I believe, only a single thread will be committing a transaction. I'm not sure committing the same transaction OR mutating elements when another thread is committing a transaction is a safe operation.
I do understand that the class is called ConcurrentAddedRelations and it should be assumed that all method under this class will return thread-safe collections, but it seems it's an unnecessary operation in this case.
What I was thinking is that we can change the naming of this interface method from getAll() to getAllUnsafe() which would clarify to the caller that the product of this method is not thread-safe.

However, I'm OK if you want to leave it as is because this method is called only during the transaction commit operation. Thus, shouldn't be expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the method to getAllUnsafe and removed the copy

Comment on lines 42 to 43
@Override
public synchronized Iterable<InternalRelation> getView(final Predicate<InternalRelation> filter) {
return super.getView(filter);
return copyView(super.getView(filter));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this method is called from several places, but I can't figure out if there is a bug or not.
Would you be able to explain why this change is needed? Ideally it would be great to have a test to reproduce the bug, but in some cases it's complicated to write a meaningful test for a bug. In such case a simple description of the problem could be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the description of the problem to the PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this part with @ntisseyre previously.
Current concern is that this method might be called too often which may increase pressure on GC due to creation of many copied arrays in memory.

Below are my observations.

Observation 1.
As I'm checking right now, this method is called for each added Vertex Property. I.e. the below snipped will call this method 3 times:

Vertex vertex = tx.addVertex();
vertex.property("foo1", "bar"); // first `getView` call
vertex.property("foo2", "bar2"); // second `getView` call
vertex.property("foo3", "bar3"); // third `getView` call

The addedRelations (ConcurrentAddedRelations) is used in two separate places. One addedRelations for the whole transaction and the second addedRelations collection for a each vertex. At the beginning I was afraid that it might cause getView call to the addedRelation of the Transaction which would cause a lot of objects to be copied in-memory (i.e. all relations for the whole transaction). However, upon checking I see that it's not the case. When we are adding property that relation is added into both Transaction addedRelation collection and StandardVertex addedRelation collection, but the getView call happens for the StandardVertex addedRelation only and not for the whole transaction.
Essentially, it's not a big concern, because properties of a single vertex are copied, and not all properties of all vertices.

Observation 2.
It also seems that when we are executing queries we also call getNew method of StandardJanusGraphTx which essentially calls getView of the Transaction addedRelation (i.e. whole transaction). I assume, if you execute many read transaction and also adding many relations in that transaction it might be a problem because with this implementation it will result in many auxiliary arrays to be allocated in-memory to hold all the references.
That said, looking at getNew implementation it seems we already copy those elements into a Set (in one case, but passing a simple reference in another case). And looking at getUnfoldedIterator() which is used to process query it seems that the elements returned from getNew are copied again into other collections.
As such, the time & space complexity of this flow doesn't actually change. The operations amount change, but amortized complexity stays the same. I.e. if we have N elements in that collection the amount of operations increases from N2 to N3 which I believe isn't a big problem. If we want to optimize that then it might be better to optimize N instead of constant.
Overall, I don't think this will cause many problems. The only usage pattern affected I see is when you are adding a huge amount of properties in a single transaction as well as querying some data in parallel for the same transaction. I will assume this isn't the most popular usage pattern.

Conclusion: I think it's OK to use copyView here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@ntisseyre
Copy link
Contributor Author

Hey @ntisseyre ! Happy to see you here 😄

I added some comments regarding this PR, because I don't know the purpose of the changes. I assume they might be needed for latter contributions.

Besides that, you will need to sign either ICLA or CCLA, so that we could merge that work into JanusGraph. You can check the contribution guide here: https://github.com/JanusGraph/janusgraph/blob/master/CONTRIBUTING.md#sign-the-cla

Also, each commit has to be signed by you. You can update your commit signature via git commit --amend -s. See instructions about signing commits here: https://github.com/JanusGraph/janusgraph/blob/master/CONTRIBUTING.md#commit-changes-and-sign-the-developer-certificate-of-origin

Please, let me know if you need any help there and I can jump in.

Thank you! I'm working on my CLA

@janusgraph-bot janusgraph-bot added cla: external Externally-managed CLA and removed cla: no This PR is not compliant with the CLA labels Feb 25, 2024
Copy link
Member

@porunov porunov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ntisseyre for the contribution!
LGTM, however, see the comments I left above #3913 (comment)

I would prefer to simply changing the name of getAll method to the one below instead of copying that collection into a new ArrayList. However, I'm OK if you want to leave it as is.

@Override
public synchronized Collection<InternalRelation> getAllUnsafe() {
    return super.getAllUnsafe();
}

@porunov
Copy link
Member

porunov commented Feb 26, 2024

@JanusGraph/committers I will be merging this PR in a week following lazy consensus unless anyone else jumps in for the review.

@porunov
Copy link
Member

porunov commented Mar 4, 2024

Merging by following lazy consensus. Backporting to 1.0.

@porunov porunov merged commit c9e0e27 into JanusGraph:master Mar 4, 2024
108 checks passed
@janusgraph-automations
Copy link

💚 All backports created successfully

Status Branch Result
v1.0

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/v1.0 cla: external Externally-managed CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants