-
Notifications
You must be signed in to change notification settings - Fork 277
Troubleshooting
We all love AppScale, but like all software, it once in a while has problems. This post outlines what to do when you run into a problem with AppScale, how to debug it, and how to fix it. Of course, you can always ask us for help on IRC (#appscale on freenode.net). Let's start off with some common problems we've seen people run into, how to get past those, and then look at what to do when the going gets tough.
AppScale runs many processes with each of these processes taking up memory. If there is not enough, the OOM Killer will come along and start killing processes and AppScale will start acting very weird. If AppScale is not working correctly make sure that you didn't run out of memory. Check '/var/log/kern.log' and '/var/log/syslog' on your AppScale nodes to make sure this is not the case.
$ tail /var/log/kern.log
Feb 14 00:10:54 appscale-image0 kernel: [203916.804124] Out of memory: Kill process 28026 (python) score 182 or sacrifice child
Feb 14 00:10:54 appscale-image0 kernel: [203916.804320] Killed process 28036 (python) total-vm:810672kB, anon-rss:550012kB, file-rss:0kB
root@appscale-image0:/# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda1 8256952 7784864 52660 100% /
udev 3806468 12 3806456 1% /dev
tmpfs 1525896 212 1525684 1% /run
none 5120 0 5120 0% /run/lock
none 3814732 160 3814572 1% /run/shm
/dev/xvdb 30956028 176196 29207352 1% /mnt
You have some options here to clear up disk space:
-
Free up ZooKeeper usage (answered in FAQ)
-
Run the AppScale groomer to do disk garbage collection
-
Remove logs or do log rolling in /var/logs/appscale/
-
Run the cassandra nodetool repair/cleanup
AppScale uses monit to monitor all the processes on the node. If it gets killed off, it will no longer restart downed processes. Is it running? Did it get killed off for some reason?
$ ps aux | grep monit
root 25043 0.1 0.0 103532 2796 ? Sl Feb12 3:32 /usr/local/bin/monit
root 28906 0.0 0.0 9396 900 pts/0 S+ 14:57 0:00 grep --color=auto monit
If you do "appscale status" do you get "[Errno 111] Connection refused"? If so, that generally means that the AppController is no longer running. This could be a bug in the AppController, check the logs. Most commonly, its because it was killed off by the OOM killer.
To bring the processes back up, just restart monit.
$ service monit start
Starting monit daemon with http interface at [*:2812]
If monit is already running try running:
$ monit summary
To see running processes.
Monit makes sure that apps don't go over a certain memory limit. You can set this in your AppScale file as such:
max_memory: 600
Where the number is in megabytes (MB). Check the monit logs to see if it indeed monit restarting your app over and over again in /var/log/appscale/monit.log
[EST Feb 12 22:40:07] error : 'app___memhungry-20004' total mem amount of 882676kB matches resource limit [tot
al mem amount>512001kB]
[EST Feb 12 22:40:07] info : 'app___memhungry-20004' trying to restart
[EST Feb 12 22:40:07] info : 'app___memhungry-20004' stop: /usr/bin/python
[EST Feb 12 22:40:08] info : 'app___memhungry-20004' start: /bin/bash
Don't set your max_memory too high, otherwise you may run out of memory on the nodes hosting your applications.
These two are critical services for data storage. You can see their logs in /var/log/zookeeper and /var/log/cassandra. Monit will monitor these services and restart them if they are down. Sometimes you run into bugs in cassandra or zookeeper, that need manual intervention (monit will keep trying to start them but they fail on restart repeatedly).
If you suspect/observe slow response times from the Datastore, one or more of the database nodes might be running compactions. Run: /root/appscale/AppDB/cassandra/cassandra/bin/nodetool compactionstats
to see more details.
You can also run a stress test on a particular database node to determine latency:
cd /root/appscale/AppDB/cassandra/; python stress.py
Also make sure to test the Zookeeper node for disk IO latency by running:
echo stat | nc 127.0.0.1 2181
If you have problems logging in or using the Users API, try going to your datastore nodes and killing the user/apps soap server (monit will restart it). It has been seen to get stuck and do so silently.
ps aux | grep soap_server.py | grep -v grep | awk {'print $2'} | xargs kill -9
If you ran "appscale up" to start AppScale and it didn't start, it could have failed for any of the following reasons:
- (VirtualBox) AppScale hung at "Please wait for AppScale to start your machines."
- (EC2) You're using Spot Instances but AppScale is hung at "Waiting for machines to become available."
- (Eucalyptus) AppScale hung at "Waiting for machines to become available."
Let's look at each of these individually.
When running AppScale on VirtualBox, we've seen problems when VirtualBox 4.1.X is used. Specifically, the AppScale Tools will start up the AppController on port 17443 and then hang at "Please wait for AppScale to start your machines." In this case, the AppScale Tools are waiting for port 17443 to open on the VM, but can't actually reach the VM, which has that port open. Upgrade to VirtualBox 4.2 or newer and that should fix the problem.
Another common problem is that the wrong IP is given in your AppScalefile. AppScale will complain that the node could not be found in the nodelist. Please check your IPs and try again from a clean state (run 'appscale clean').
If you're using Spot Instances (you've set "use_spot_instances : True" in your AppScalefile), there is a possibility that Amazon won't have any spare machines available at the price and instance type you requested. Typically it takes us about 5 minutes to get a Spot Instance, so if it takes you substantially longer than that (say, 10 minutes), then you can log into the AWS Dashboard, click on EC2, and then click on Spot Instances. There, you can see why your machines aren't available. You can cancel your Spot Instance Request and try again with a higher price or a different instance type, depending on the message the dashboard reports.
When running on Eucalyptus, if there are no virtual machines available, AppScale won't be able to start up. For example, if you tell AppScale to run over 8 machines, and you only have 6 available, then that won't work! In this case, you'll see a message from the tools saying "Spawning 7 virtual machines" (since we spawn one machine and delegate the responsibility of starting up the other 7 to it), and the tools will eventually crash, since the AppController won't be able to get the remaining 7 machines. In this case, the solution is simple - make sure you have enough virtual machines available before you start AppScale! In Eucalyptus, an administrator can find out how many virtual machines are free by running "euca-describe-availability-zones verbose".
If, for some reason, running "appscale down" isn't able to terminate your AppScale deployment, you can bring your VMs back to a pristine state by running:
appscale clean
This script forcefully kills all of the AppScale-related processes.
So you've ran into a problem we don't normally run into - how do you find out what's going on? For this case, we have a special command you can run. On the machine that you've got the AppScale tools installed on, run "appscale logs ~/Desktop/baz" and this will copy over all of the logs from each machine in your AppScale deployment to ~/Desktop/baz (of course, change that path if you want your logs copied somewhere else). If this doesn't work for some reason, you can always use "scp" to copy over the contents of the "/var/log/appscale" directory on each machine.
Logs you will find interesting include:
- controller-17443.log: The most interesting log! This log belongs to the AppController, our provisioning daemon. Since it sets up every other service in AppScale, this log can throw exceptions if Cassandra couldn't be started, if the autoscaling algorithm ran into problems, and so on. This is the first place you want to look in if you're having problems with AppScale. You'll find one of these on each machine in an AppScale deployment, since this service runs on all machines.
- app___app_id-*.log: These logs correspond to Google App Engine apps that AppScale is hosting. You'll want to check these out if you're running into problems with your App Engine apps, like if you want to include special libraries that App Engine doesn't normally support or are debugging your application at high load. You'll find one of these for each App Server process that runs on each machine running the "App Engine" role (see which machines are running this service by running "appscale status").
- datastore_server-400*.log: The logs from the implementation of the AppScale datastore.