Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebase 1.10.0 #19137

Merged
merged 40 commits into from
Apr 9, 2018
Merged

rebase 1.10.0 #19137

merged 40 commits into from
Apr 9, 2018

Conversation

liggitt
Copy link
Contributor

@liggitt liggitt commented Mar 28, 2018

based on:

  • builds
    • openshift binaries
    • hyperkube
    • cross build
  • starts
    • openshift start
    • openshift start --master-config=... --node-config=...
    • openshift start master --config=...
    • openshift start node --config=...
    • openshift start master api --config=...
    • openshift start master controllers --config=...
    • oc cluster up
  • verify dependency levels
    • runc
    • cadvisor
    • etcd, grpc, bboltdb
    • spf13, cobra
    • docker/distribution, docker/docker, fsouza
  • known upstream picks
    • 61962 - avoid data races in unit tests
    • 61808 - fix printing
    • 61949 - tolerate openapi discovery errors
    • 61985 - print kind when printing multiple types
    • 62074 - narrow interface consumed by scale client
  • tests
    • verify
    • unit
    • cmd
    • integration
    • extended_conformance_gce
    • extended_conformance_install
    • extended_builds
    • extended_image_registry
    • e2e
    • extended_clusterup
    • extended_image_ecosystem
    • extended_networking
  • finalize

follow-ups:

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2018
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 28, 2018
@openshift-bot openshift-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 31, 2018
@liggitt
Copy link
Contributor Author

liggitt commented Apr 3, 2018

@sjenning seeing a kubelet startup failure in our e2e setup that seems caused by https://github.com/kubernetes/kubernetes/pull/59769/files#diff-bf28da68f62a8df6e99e447c4351122dR1331

comparing origin.log from a good e2e run from master (kube 1.9.1) and a bad e2e run on this PR (kube 1.10.0), both appear to have hit the failed to get device for dir error

in 1.9. it was logged as an error repeatedly:

E0403 02:12:09.714804 8384 container_manager_linux.go:584] [ContainerManager]: Fail to get rootfs information failed to get device for dir "/tmp/openshift/test-end-to-end/cluster-up/openshift.local.volumes": could not find device with major: 0, minor: 38 in cached partitions map

in 1.10, this error is now fatal:

F0403 03:06:53.509202 8818 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: failed to get device for dir "/tmp/openshift/test-end-to-end/cluster-up/openshift.local.volumes": could not find device with major: 0, minor: 38 in cached partitions map

questions:

  1. is it correct for this error to be fatal?
  2. if so, how can we correct our e2e env to avoid this error? is it related to e2e running a containerized kubelet?

@@ -169,6 +168,22 @@ func ClientMapperFromConfig(config *rest.Config) resource.ClientMapperFunc {
})
}

// setKubernetesDefaults sets default values on the provided client config for accessing the
// Kubernetes API or returns an error if any of the defaults are impossible or invalid.
func setKubernetesDefaults(config *rest.Config) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think this is safer than exposing the function upstream?

Copy link
Contributor Author

@liggitt liggitt Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I expect further shenanigans in legacyscheme usage upstream

@@ -7,7 +7,7 @@ import (
"github.com/golang/glog"

controllerapp "k8s.io/kubernetes/cmd/kube-controller-manager/app"
_ "k8s.io/kubernetes/plugin/pkg/scheduler/algorithmprovider"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is tragic and so symptomatic of the scheduler problems.

newFunc := func(protocol api.Protocol, ip net.IP, port int) (userspace.ProxySocket, error) {
return newUnidlerSocket(protocol, ip, port, signaler)
}
return userspace.NewCustomProxier(loadBalancer, listenIP, iptables, exec, pr, syncPeriod, minSyncPeriod, udpIdleTimeout, newFunc)
return userspace.NewCustomProxier(loadBalancer, listenIP, iptables, exec, pr, syncPeriod, minSyncPeriod, udpIdleTimeout, nodePortAddresses, newFunc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik
Copy link
Contributor

mfojtik commented Apr 3, 2018

@liggitt https://github.com/liggitt/origin/commit/be4d28cd7f303d13c0dc95fb530f386739c5b306

(i was able to spawn 1.10 cluster via cluster up with this, guess it will make some tests pass)

@@ -420,6 +421,7 @@ func (c *DeploymentController) makeDeployerPod(deployment *v1.ReplicationControl
RestartPolicy: v1.RestartPolicyNever,
ServiceAccountName: c.serviceAccount,
TerminationGracePeriodSeconds: &gracePeriod,
ShareProcessNamespace: &shareProcessNamespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a thing? Is the default bad somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the deployer pod is single-container Pod... From what I saw in kube, this is still an alpha feature that is disabled by default anyway ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the default explicit, had to set something or explicitly ignore the field in the unit test, which I didn't want to do

@@ -109,12 +109,9 @@ function os::build::version::kubernetes_vars() {
# Try to match the "git describe" output to a regex to try to extract
# the "major" and "minor" versions and whether this is the exact tagged
# version or whether the tree is between two tagged versions.
if [[ "${KUBE_GIT_VERSION}" =~ ^v([0-9]+)\.([0-9]+)(\.[0-9]+)*([-].*)?$ ]]; then
if [[ "${KUBE_GIT_VERSION}" =~ ^v([0-9]+)\.([0-9]+)\. ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it didn't actually work against the output of git describe, and all we care about is the major/minor bits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give the output of git describe? IIRC this will break stuff in OSE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give the output of git describe?

v1.10.0-46-g9070269

IIRC this will break stuff in OSE

nope, this is setting kube minor/major version, which are unset today. this isn't touching anything openshift-related

hack/test-cmd.sh Outdated
@@ -19,7 +19,7 @@ function find_tests() {
local full_test_list=()
local selected_tests=()

full_test_list=( $(find "${OS_ROOT}/test/cmd" -name '*.sh') )
full_test_list=( $(find "${OS_ROOT}/test/cmd" -name '*.sh' | sort) )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer unsorted. Can you randomize instead please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I like knowing how much time I have left in my tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found a lot of bugs once we stopped sorting in the past, people were not cleaning up cluster resources in test X and not questioning the state of the universe in test X+N, then writing test X+N to only work when run after text X...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomizing without being reproducible leads to flakes that magically pass on a subsequent retry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an issue today? You're arguing to change the status quo :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an issue today?

yeah, hit it while running this suite.

You're arguing to change the status quo :)

will push a revert with the next batch of changes. I don't think the current random+unreproducible state is helpful, but I don't care enough to argue.

@@ -129,7 +129,7 @@ func NewCmdDebug(fullName string, f *clientcmd.Factory, in io.Reader, out, errou
}

cmd := &cobra.Command{
Use: "debug RESOURCE/NAME [ENV1=VAL1 ...] [-c CONTAINER] [options] [-- COMMAND]",
Use: "debug RESOURCE/NAME [ENV1=VAL1 ...] [-c CONTAINER] [flags] [-- COMMAND]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I object, but why do I care about this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because cobra changed to auto append [flags] if you don't include it in your usage, which screws up usage for things like rsh

@@ -117,7 +117,7 @@ echo "certs: ok"
os::test::junit::declare_suite_end

os::test::junit::declare_suite_start "cmd/admin/groups"
os::cmd::expect_success_and_text 'oc adm groups new shortoutputgroup -o name' 'groups/shortoutputgroup'
os::cmd::expect_success_and_text 'oc adm groups new shortoutputgroup -o name' 'group/shortoutputgroup'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that's neat. The command needs updating. Printer looks off

@@ -92,7 +92,7 @@ echo "pods: ok"
os::test::junit::declare_suite_end

os::test::junit::declare_suite_start "cmd/basicresources/label"
os::cmd::expect_success_and_text 'oc create -f examples/hello-openshift/hello-pod.json -o name' 'pod/hello-openshift'
os::cmd::expect_success_and_text 'oc create -f examples/hello-openshift/hello-pod.json -o name' 'pod.*/hello-openshift'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not obvious to me. What are the characters between pod and slash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


# Test that infos printer supports all outputFormat options
os::cmd::expect_success_and_text 'oc new-app node -o yaml | oc set env -f - MYVAR=value' 'deploymentconfig "node" updated'
os::cmd::expect_success 'oc new-app node -o yaml | oc set env -f - MYVAR=value -o custom-colums="NAME:.metadata.name"'
os::cmd::expect_success_and_text 'oc new-app node -o yaml | oc set env -f - MYVAR=value -o yaml' 'apiVersion: v1'
os::cmd::expect_success_and_text 'oc new-app node -o yaml | oc set env -f - MYVAR=value -o json' '"apiVersion": "v1"'
os::cmd::expect_success_and_text 'oc new-app node -o yaml | oc set env -f - MYVAR=value -o wide' 'node'
os::cmd::expect_success_and_text 'oc new-app node -o yaml | oc set env -f - MYVAR=value -o name' 'deploymentconfigs/node'
os::cmd::expect_success_and_text 'oc new-app node -o yaml | oc set env -f - MYVAR=value -o name' 'deploymentconfig/node'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanvallejo @soltysh another command that needs updating.

@@ -84,7 +84,7 @@ os::cmd::expect_success 'oc adm policy add-scc-to-user privileged -z ipfailover'
os::cmd::expect_success_and_text 'oc adm ipfailover --virtual-ips="1.2.3.4" --dry-run' 'Creating IP failover'
os::cmd::expect_success_and_text 'oc adm ipfailover --virtual-ips="1.2.3.4" --dry-run' 'Success \(dry run\)'
os::cmd::expect_success_and_text 'oc adm ipfailover --virtual-ips="1.2.3.4" --dry-run -o yaml' 'name: ipfailover'
os::cmd::expect_success_and_text 'oc adm ipfailover --virtual-ips="1.2.3.4" --dry-run -o name' 'deploymentconfig/ipfailover'
os::cmd::expect_success_and_text 'oc adm ipfailover --virtual-ips="1.2.3.4" --dry-run -o name' 'deploymentconfig.*/ipfailover'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specificity please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -19,7 +19,7 @@ os::cmd::expect_success_and_text 'oc status --suggest' 'dc/simple-deployment has
os::cmd::expect_failure_and_text 'oc set probe dc/simple-deployment --liveness --get-url=http://google.com:80 --local' 'You must provide one or more resources by argument or filename'
# test --dry-run flag with -o formats
os::cmd::expect_success_and_text 'oc set probe dc/simple-deployment --liveness --get-url=http://google.com:80 --dry-run' 'simple-deployment'
os::cmd::expect_success_and_text 'oc set probe dc/simple-deployment --liveness --get-url=http://google.com:80 --dry-run -o name' 'deploymentconfigs/simple-deployment'
os::cmd::expect_success_and_text 'oc set probe dc/simple-deployment --liveness --get-url=http://google.com:80 --dry-run -o name' 'deploymentconfig/simple-deployment'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanvallejo @soltysh another command to update

@@ -117,7 +117,7 @@ echo "certs: ok"
os::test::junit::declare_suite_end

os::test::junit::declare_suite_start "cmd/admin/groups"
os::cmd::expect_success_and_text 'oc adm groups new shortoutputgroup -o name' 'groups/shortoutputgroup'
os::cmd::expect_success_and_text 'oc adm groups new shortoutputgroup -o name' 'group/shortoutputgroup'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanvallejo @soltysh another command to update

@deads2k
Copy link
Contributor

deads2k commented Apr 3, 2018

interesting: webhook auth resolver change lgtm
interesting: policy updates (adds daemonset permissions) origin bits lgtm
interesting: node config changes fail on swap is probably going to cause a problem. I don't think I'd bother messing with the rest though.
interesting: fix upstream unit test selection lgtm
interesting: controller manager wiring lgtm - also, go me! - pull out separately with the other easier wiring and we'll merge early
interesting: scheduler wiring lgtm - also, go me! :)
interesting: node wiring lgtm - also, go me! :) @eparis told you so :)
interesting: stop marking optional flags required - you sure these are optional? pull out separately and merge early?
interesting: inject print handlers lgtm
interesting: test/cmd fixes lgetm - followups needed
interesting: e2e fixes lgtm, but I'd squash the test-cmd one
interesting: integration test fixes lgtm, but just that?
interesting: make runtime-config kube-compatible ugly, but ok
interesting: fix export to use internal objects lgtm, looks like one to backport

And I'm going to take a break. Someone is going to help fix the apiserver wiring, right.... right...?

scaler, _ := kubectl.ScalerFor(kapi.Kind("ReplicationController"), client)

// TODO: implement for RC?
var scalesGetter scaleclient.ScalesGetter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now panicking the deployer pods...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in a FIXME commit in a WIP PR… not quite ready for review :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik fixed

@liggitt liggitt mentioned this pull request Apr 3, 2018
@liggitt
Copy link
Contributor Author

liggitt commented Apr 3, 2018

interesting: node config changes fail on swap is probably going to cause a problem. I don't think I'd bother messing with the rest though.

yeah, openshift/kubernetes@3ca675e effectively changed the default back to false. will do the same for the rebase and follow up with long term plan

interesting: fix export to use internal objects lgtm, looks like one to backport

no backport needed, realized this was fixing up an incorrect choice I made when picking replacements for the factory-provided decoder that went away.

picked things that can go into master now into #19200

@openshift openshift deleted a comment from mfojtik Apr 3, 2018
@openshift openshift deleted a comment from mfojtik Apr 3, 2018
@sjenning
Copy link
Contributor

sjenning commented Apr 4, 2018

re @liggitt

is it correct for this error to be fatal?

I don't think so

if so, how can we correct our e2e env to avoid this error? is it related to e2e running a containerized kubelet?

Upstream used to tolerate this error with retries. There is no indication on the PR that changed it why this was done.

fyi @derekwaynecarr

@derekwaynecarr
Copy link
Member

@liggitt @sjenning - the issue described from the kubelet is tied to changes from LocalStorageCapacityIsolation feature, which unfortunately even when disabled still has kubelet do actions to figure out rootfs configuration. i am trying to figure out if we can change cAdvisor to handle /tmpfs as a --root-dir, will report back.

@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2018

/retest

1 similar comment
@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2018

/retest

@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2018

new commits that need reviewing (in service of fixing the "no kind Namespace found in v1" error the cluster-up tests were hitting when /oapi was unavailable)

UPSTREAM: 62196: Remove need for server connections for dry-run create
UPSTREAM: 62199: Make priority rest mapper handle partial discovery
UPSTREAM: 62234: Handle partial group and resource responses consistently

@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2018

seeing flakes in gcp (1-2 different failures in successive runs)

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/19137/test_pull_request_origin_extended_conformance_gce/18737/

  • The HAProxy router should expose prometheus metrics for a route

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/19137/test_pull_request_origin_extended_conformance_gce/18751/

  • The HAProxy router converges when multiple routers are writing conflicting status
  • prune builds based on settings in the buildconfig successful, failed, canceled, and errored builds should be pruned based on settings in the build config

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/19137/test_pull_request_origin_extended_conformance_gce/18728/

  • deploymentconfigs keep the deployer pod invariant valid [Conformance] should deal with cancellation after deployer pod succeeded

@liggitt
Copy link
Contributor Author

liggitt commented Apr 8, 2018

/retest

@mfojtik
Copy link
Contributor

mfojtik commented Apr 9, 2018

@liggitt i believe that flakes are pre-existing

@mfojtik
Copy link
Contributor

mfojtik commented Apr 9, 2018

@liggitt @deads2k green \o/ I vote for merging this and deal with the follow ups

@soltysh
Copy link
Contributor

soltysh commented Apr 9, 2018

👍 from me to merging as is

@mfojtik
Copy link
Contributor

mfojtik commented Apr 9, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, mfojtik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mfojtik
Copy link
Contributor

mfojtik commented Apr 9, 2018

@Kargakis the tide does not seem to handle merging of this PR, should we merge this by hand or is there some special label that enable merging for this one ?

@0xmichalis
Copy link
Contributor

By hand for now.

@mfojtik
Copy link
Contributor

mfojtik commented Apr 9, 2018

Green merging this after talking to @Kargakis
The tests are green, the queue is disabled, so there is no conflict. Fingers crossed.

@mfojtik mfojtik merged commit d67b8ce into openshift:master Apr 9, 2018
@mfojtik
Copy link
Contributor

mfojtik commented Apr 9, 2018

@liggitt awesome job!

@sttts
Copy link
Contributor

sttts commented Apr 9, 2018

awesome job!

Indeed!

@liggitt
Copy link
Contributor Author

liggitt commented Apr 9, 2018

Go ahead and pause the origin->kubernetes publishing until we record the origin sha and adjust the bot to publish to the origin-3.10-kubernetes-1.10.0 branch

@mfojtik
Copy link
Contributor

mfojtik commented Apr 9, 2018

@liggitt paused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. queue/critical-fix size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.