Error when running TensorFlow Serving with Kubernetes - python

I have been following the steps on how to use Tensorflow Serving with Kubernetes from this link: https://www.tensorflow.org/tfx/serving/serving_kubernetes
However when I run the code docker run --rm tensorflow/serving:nightly-devel python tensorflow_serving/example/resnet_client_grpc.py --server=192.168.0.156:32505
I get this error:
status = StatusCode.FAILED_PRECONDITION
details = "Attempting to use uninitialized value resnet_model/dense/bias
[[{{node resnet_model/dense/bias/read}}]]"
debug_error_string = "{"created":"#1555068239.875845800",
"description":"Error received from peer",
"file":"src/core/lib/surface/call.cc",
"file_line":1039,
"grpc_message":"Attempting to use uninitialized value
resnet_model/dense/bias\n\t [[{{node
resnet_model/dense/bias/read}}]]","grpc_status":9}"
I'm using Windows 10 btw.

Related

google ml engine scale-tier not running in remote distributed training

When running tensorflow with REMOTE distributed command:
after specifying "scale-tier STANDARD_1". Batch failed to run....
Now, I can ONLY run with simple SINGLE NODE "scale-tier=BASIC"
gcloud ml-engine jobs submit training census_20171005_162623
--stream-logs --scale-tier STANDARD_1 --staging-bucket gs://dextest --runtime-version 1.2 --job-
dir gs://dextest/census_20171005_162623 --module-name trainer.task --package-path trainer/ --
region us-central1 -- --train-files
gs://cloudml-public/census/data/adult.data.csv --eval-files
gs://cloudml-public/census/data/adult.test.csv --train-steps 1000
--eval-steps 100
The error I am getting is
The replica worker 1 exited with a non-zero status of 1. Termination reason: Error.
From the log:
Retrying after gsutil exception Command '['gsutil', '-q', 'cp',
u'gs://dextest/census_20171005_161531/2211a814b974edbc3defee855046dd8e801393b7ff8154b084b081322167fe90/trainer-0.0.0.tar.gz',
u'trainer-0.0.0.tar.gz']' returned non-zero exit status 1.
The Master did SUCCESSFULLY initialized and copued the package "trainer-0.0.0.tar.gz"
However, issue happened when replicas copying package for the run.
It seems that the ML workflow FAILED to handle the cleanup in replicas....
The replicas tried to clean the job directory
"gs://dextest/census_20171005_162623"
again before running.
the error happened after master copy the package and
the replicas FAILED to pick up the package for running.
I CommandException: No URLs matched:
gs://dextest/census_20171005_161531/2211a814b974edbc3defee855046dd8e801393b7ff8154b084b081322167fe90/trainer-0.0.0.tar.gz
E Retrying after gsutil exception Command '['gsutil', '-q', 'cp',
u'gs://dextest/census_20171005_161531/2211a814b974edbc3defee855046dd8e801393b7ff8154b084b081322167fe90/trainer-0.0.0.tar.gz',
u'trainer-0.0.0.tar.gz']' returned non-zero exit status 1.
undefined
It is evident from your error message ("The replica worker 1 exited with a non-zero status of 1. Termination reason: Error.") that you have some programming error (syntax, undefined etc).
Check the return code table
Return code -------------Meaning--------------- Cloud ML Engine response
0 Successful completion Shuts down and releases job resources.
1 - 128 Unrecoverable error Ends the job and logs the error.
Find the bug and fix it and then try again.
I recommend run your task locally (if your configuration supports) before you submit in cloud. If you find any bug, you can fix easily in your local machine.

Inspecting a stopped docker container on python using inspect_container

I'm coding tests with python.
I want to make a method that outputs the status of a container (running/exited).
import docker
class Container:
def __init__(self, name, image, *, command=[], links={}):
self._docker = docker.DockerClient(base_url='unix://var/run/docker.sock')
def get_status(self):
inspection = self._docker.api.inspect_container(self.id)
return inspection['State']['Status']
this method (get_status) works when container is running
but fails when container is stopped, with this error message:
E docker.errors.NotFound: 404 Client Error: Not Found ("No such container: 2457e5a283e5cb4add4fdb36pb465437b21bb21f768be405fe40615e25442d6e
"docker inspect" cli command works on the instance when it is stopped, but I need to do it through python
any ideas?
You are using a older version of docker-py. Do below
pip uninstall docker-py
pip install docker
then run the code
import docker
client = docker.client.DockerClient()
container = client.containers.get("2457e5a283e5cb4add4fdb36pb465437b21bb21f768be405fe40615e25442d6e")
Now it should work

Why does Kubernetes apiserver present a bad certificate to the etcd server?

Running Kubernetes on CoreOS on an AWS EC2 instance, I am unable to execute apiserver via a hyperkube Docker container successfully. The problem is that the etcd server refuses connections due to a bad certificate.
What happens is this:
$ docker run -v /etc/ssl/etcd:/etc/ssl/etcd:ro gcr.io/google_containers/hyperkube:v1.1.2 /hyperkube apiserver --bind-address=0.0.0.0 --insecure-bind-address=127.0.0.1 --etcd-servers=https://172.31.29.111:2379 --allow-privileged=true --service-cluster-ip-range=10.3.0.0/24 --secure-port=443 --advertise-address=172.31.29.111 --admission-control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota --tls-cert-file=/etc/ssl/etcd/master1-master-client.pem --tls-private-key-file=/etc/ssl/etcd/master1-master-client-key.pem --client-ca-file=/etc/ssl/etcd/ca.pem --kubelet-certificate-authority=/etc/ssl/etcd/ca.pem --kubelet-client-certificate=/etc/ssl/etcd/master1-master-client.pem --kubelet-client-key=/etc/ssl/etcd/master1-master-client-key.pem --kubelet-https=true
I0227 17:07:34.117098 1 plugins.go:71] No cloud provider specified.
I0227 17:07:34.549806 1 master.go:368] Node port range unspecified. Defaulting to 30000-32767.
[restful] 2016/02/27 17:07:34 log.go:30: [restful/swagger] listing is available at https://172.31.29.111:443/swaggerapi/
[restful] 2016/02/27 17:07:34 log.go:30: [restful/swagger] https://172.31.29.111:443/swaggerui/ is mapped to folder /swagger-ui/
E0227 17:07:34.659701 1 cacher.go:149] unexpected ListAndWatch error: pkg/storage/cacher.go:115: Failed to list *api.Pod: 501: All the given peers are not reachable (failed to propose on members [https://172.31.29.111:2379] twice [last error: Get https://172.31.29.111:2379/v2/keys/registry/pods?quorum=false&recursive=true&sorted=true: remote error: bad certificate]) [0]
The certificate should be good though. If I execute an interactive shell within that Docker image, I can get the etcd URL via curl without any issues. So, what is going wrong in this case and how do I fix it?
I found I could solve this by using --etcd-config instead of --etcd-servers:
docker run -p 443:443 -v /etc/kubernetes:/etc/kubernetes:ro -v /etc/ssl/etcd:/etc/ssl/etcd:ro gcr.io/google_containers/hyperkube:v1.1.2 /hyperkube apiserver --bind-address=0.0.0.0 --insecure-bind-address=127.0.0.1 --etcd-config=/etc/kubernetes/etcd.client.conf --allow-privileged=true --service-cluster-ip-range=10.3.0.0/24 --secure-port=443 --advertise-address=172.31.29.111 --admission-control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota --kubelet-certificate-authority=/etc/ssl/etcd/ca.pem --kubelet-client-certificate=/etc/ssl/etcd/master1-master-client.pem --kubelet-client-key=/etc/ssl/etcd/master1-master-client-key.pem --client-ca-file=/etc/ssl/etcd/ca.pem --tls-cert-file=/etc/ssl/etcd/master1-master-client.pem --tls-private-key-file=/etc/ssl/etcd/master1-master-client-key.pem
etcd.client.conf:
{
"cluster": {
"machines": [ "https://172.31.29.111:2379" ]
},
"config": {
"certFile": "/etc/ssl/etcd/master1-master-client.pem",
"keyFile": "/etc/ssl/etcd/master1-master-client-key.pem"
}
}

“bosh deploy” failed and reporting “Fetching package blob: Getting blob from inner blobstore: SHA1 mismatch.”

I am deploying Cloud Foundry on VirtualBox on my MacBook Pro with a 8G memory. By default, the VirtualBox will start a "bosh-lite_default_xxxxxxx" VM with more than 6G base memory and 4 CPU processors. However, this setting will lead my Mac hang while executing "bosh deploy".
I know the practice is using a machine with memory >=16G for Cloud Foundry deployment. But this is the machine I only have.
So I changed Vagrantfile for bosh-lite like below and changed the base memory to 3G and only use 2 CPU processors.
config.vm.provider :virtualbox do |v, override|
config.vm.synced_folder ".", "/vagrant", mount_options: ["dmode=777"] # ensure any VM user can create files in subfolders - eg, /vagrant/tmp
override.vm.box_version = '9000.91.0' # ci:replace
# To use a different IP address for the bosh-lite director, uncomment this line:
# override.vm.network :private_network, ip: '192.168.59.4', id: :local
v.memory = 3144
v.cpus = 2
end
After the change, the "bosh-lite_default_xxxxxx" VM can still be started successfully after running "vagrant reload" after saving Vagrantfile.
Then when I run "bosh deploy", the machine is not hang any more.But it failed and reported:
Started updating job api_z1 > api_z1/0 (68c148e3-2c89-4f75-86c7-ed0945cd1158). Failed: Action Failed get_task: Task 98721dd2-35f1-49ea-6429-15ec2373a9d2 result: Applying: Applying job cloud_controller_ng: Applying package buildpack_php for job cloud_controller_ng: Fetching package blob: Getting blob from inner blobstore: SHA1 mismatch. Expected 7ef00b2e07b20b07fbf50d133bdaca6ac5164ee4, got b79493b83a7241f685ca027d667898f452e7d592 for blob /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get861882783 (00:01:11)
Error 450001: Action Failed get_task: Task 98721dd2-35f1-49ea-6429-15ec2373a9d2 result: Applying: Applying job cloud_controller_ng: Applying package buildpack_php for job cloud_controller_ng: Fetching package blob: Getting blob from inner blobstore: SHA1 mismatch. Expected 7ef00b2e07b20b07fbf50d133bdaca6ac5164ee4, got b79493b83a7241f685ca027d667898f452e7d592 for blob /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get861882783
Task 90 error
I googled "Fetching package blob: Getting blob from inner blobstore: SHA1 mismatch" and found someone said increasing microbosh volume from 4G to 8G resolved this error (see https://groups.google.com/a/cloudfoundry.org/forum/#!topic/bosh-users/Rgx_HFCHenA). But I don't know how to change the microbes volume.
Anyone can share some light to me for this error? And how to change the microbes volume?
Thanks much!

error while running cap deploy:cold

i get the following error after running cap deploy:cold, to deploy my rails app in amazon ec2
failed: "/bin/bash -l -c 'cd /mnt/Best-production/releases && tar xzf /tmp/20130305142552.tar.gz && rm /tmp/20130305142552.tar.gz'" on db01.best.com,app01.best.com,web01.best.com
also, i get the following error when i just decide to proceed with creating web_tools {ALIAS=tools ROLES=web_tools cap rubber:create}
Rubber[ERROR]: Unable to read rubber configuration from ./config/rubber/rubber.yml
/usr/lib/ruby/1.9.1/psych.rb:203:in `parse': (<unknown>): did not find expected key while parsing a block mapping at line 2 column 1 (Psych::SyntaxError)
please help. am stuck over here
i followed railscast http://railscasts.com/episodes?utf8=%E2%9C%93&search=ec2 to the letter
question 2.
do i need an operating system instance in ec2 to successfully run rubber?

Resources