[00:01:32] If you put jankness ontop of jankness I think you need a 'janky' office song [00:03:18] Damianz: To the tune of http://www.youtube.com/watch?v=qObzgUfCl28 [00:03:40] Oh yes! [00:03:55] Janky, Janky, Janky, Janky... [00:05:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:05] !log Running refreshFileHeaders.php on all wikis [00:19:13] Logged the message, Master [00:19:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.244 seconds [00:20:44] !log removed some mariadb packages from brewster: precise-wikimedia libmysqlclient18, mariadb-common, mysql-common, all in precise-wikimedia. They were causing dependency problems that broke the install of (for example) generic::mysql::server [00:20:52] Logged the message, Master [00:33:16] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [00:33:16] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [00:33:17] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [00:33:17] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [00:33:17] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:33:17] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:33:17] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [00:33:18] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [00:37:02] !log kaldari synchronized php-1.21wmf6/extensions/Echo 'Update Echo extension' [00:37:10] Logged the message, Master [00:47:22] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [00:53:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.436 seconds [01:13:10] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [01:14:40] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 201 seconds [01:14:58] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 214 seconds [01:16:28] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 7 seconds [01:20:35] !log kaldari synchronized php-1.21wmf6/extensions/Echo 'Update Echo extension' [01:20:44] Logged the message, Master [01:42:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:15] !log kaldari synchronized php-1.21wmf6/extensions/Echo 'Update Echo extension' [01:48:23] Logged the message, Master [01:52:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:52:35] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:53:20] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [01:53:56] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [01:55:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.169 seconds [02:08:38] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:11:20] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 194 seconds [02:11:47] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 204 seconds [02:12:14] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 192 seconds [02:12:50] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 203 seconds [02:27:34] !log LocalisationUpdate completed (1.21wmf6) at Thu Dec 20 02:27:33 UTC 2012 [02:27:43] Logged the message, Master [02:28:35] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 2 seconds [02:29:02] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 11 seconds [02:29:38] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 6 seconds [02:30:15] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [02:30:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:24] New patchset: SPQRobin; "abusefilter-log-detail right from sysop to autoconfirmed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32681 [02:38:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.576 seconds [02:44:29] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 181 seconds [02:44:57] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 184 seconds [02:45:23] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 206 seconds [02:45:59] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 192 seconds [02:55:26] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 187 seconds [02:56:02] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 193 seconds [03:05:47] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 183 seconds [03:06:50] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 206 seconds [03:08:20] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [03:08:38] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [03:08:56] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [03:09:41] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:21:32] RECOVERY - Puppet freshness on erzurumi is OK: puppet ran at Thu Dec 20 03:21:23 UTC 2012 [04:00:32] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 206 seconds [04:00:50] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 215 seconds [04:02:20] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [04:03:05] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [04:03:50] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [04:08:02] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [04:10:35] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [06:15:50] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [06:18:50] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:26:47] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [06:36:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.894 seconds [07:04:44] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [07:12:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.353 seconds [07:59:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:11:49] hellooo [08:13:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.623 seconds [08:27:43] hi [08:29:05] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [08:29:51] !log going to deploy Mediawiki config changes related to the multiple datacenter support. [08:30:02] Logged the message, Master [08:31:09] !log hashar synchronized multiversion/MWRealm.php [08:31:17] Logged the message, Master [08:31:27] !log hashar synchronized multiversion/MWRealm.sh [08:31:34] Logged the message, Master [08:31:45] !log hashar synchronized tests/multiversion/MWRealmTest.php [08:31:53] Logged the message, Master [08:47:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:51:07] New patchset: Hashar; "use $wmfRealm in switches instead of legacy $cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39057 [08:51:07] New patchset: Hashar; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [08:51:07] New patchset: Hashar; "rename some files to use '-labs' instead of '-wmflabs'" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39059 [08:51:08] New patchset: Hashar; "ext file now use realm for consistency" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39058 [08:51:08] New patchset: Hashar; "file included by CS.php now uses -labs instead of -wmflabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39060 [08:51:47] New review: Hashar; "rebased on top of merged change https://gerrit.wikimedia.org/r/#/c/39191/ which introduces the MWRea..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/32167 [09:02:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.999 seconds [09:13:38] New patchset: Hashar; "start requiring MWRealm.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39543 [09:16:57] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39543 [09:19:46] !log hashar synchronized refresh-dblist 'now requires multiversion/MWRealm.php {{gerrit|39543}}' [09:19:55] Logged the message, Master [09:20:23] !log hashar synchronized multiversion/backupWikiversions 'now requires multiversion/MWRealm.php {{gerrit|39543}}' [09:20:32] Logged the message, Master [09:20:48] !log hashar synchronized multiversion/activeMWVersions 'now requires multiversion/MWRealm.php {{gerrit|39543}}' [09:20:57] Logged the message, Master [09:21:16] !log hashar synchronized multiversion/MWMultiVersion.php 'now requires multiversion/MWRealm.php {{gerrit|39543}}' [09:21:24] Logged the message, Master [09:25:15] !g 39057 [09:25:15] https://gerrit.wikimedia.org/r/#q,39057,n,z [09:25:39] New patchset: Hashar; "use $wmfRealm in switches instead of legacy $cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39057 [09:26:29] hashar: why not just run sync-dir multiversion ? [09:26:52] Reedy: newbiness ? :-D [09:27:15] lol [09:27:23] I am doing lot of configuration changes this morning [09:27:37] related to the new wmfRealm / wmfDatacenter settings [09:27:45] testing them on labs first then test.wp.org then deploy [09:30:52] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39057 [09:35:02] !log hashar synchronized live-1.5/robots.php 'use $wmfRealm in switches instead of legacy $cluster {{gerrit|39057}}' [09:35:11] Logged the message, Master [09:35:13] Reedy: also hume has an invalid host key for its IPv6 address. Are you aware of that issue ? [09:35:20] Yeah [09:37:03] wich me luck [09:37:09] I hate deploying nowadays [09:37:25] at worst, just go onto hume and run sync-common afterwards [09:37:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:28] I am too afraid about making a mistake and serving blank pages [09:38:10] !log hashar synchronized wmf-config 'use $wmfRealm in switches instead of legacy $cluster {{gerrit|39057}}' [09:38:18] Logged the message, Master [09:38:55] !log ran sync-common on hume [09:39:03] Logged the message, Master [09:42:15] Reedy: so hmm [09:42:18] Reedy: the $cluster is phased out! [09:44:06] New patchset: Hashar; "ext file now use realm for consistency" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39058 [09:45:18] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39058 [09:46:18] Reedy: do we have a way to sync a file deletion? [09:47:34] sync-dir will do it [09:47:45] doing at the level above of course [09:47:53] sync-file missingfile.php doesn't work IIRC [09:48:30] sync-dir!! [09:49:42] !log hashar synchronized wmf-config/ 'ext file now uses realm for consistency {{gerrit|39058}}. That makes CheckUser to load only in production. Checked on test.wp.o' [09:49:50] Logged the message, Master [09:53:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.079 seconds [09:58:24] paravoid: regarding bug 43188 can you post a full command of avconv or convert that hangs, would want to collect some sample files and try to reproduce the issue in a vm to test alternatives [09:59:57] j^: hello :) I have updated puppet conf on deployment-video05 earlier this morning. Seems to still be running [10:00:47] hashar: the bug is regarding imagescalers [10:01:00] video transcoding should be still ok [10:02:19] j^: it was a convert process, not avconv [10:02:31] I mentioned avconv because of the issue we were seeing together before [10:02:36] with multiple threads [10:02:53] so, while we fixed that with -threads 1 [10:03:06] the fact that the processes didn't die but kept stacking up in the servers is a problem [10:03:28] we should contain them better imho [10:03:59] we can probably put a timeout wrapper on them (as opposed to cpu timeout) [10:04:00] mediawiki should kill them when they timeout, not rely on cpu ulimits [10:04:22] yes, that's what I said on the bug report. [10:08:10] will look a bit more into cgroups and see if that could be an option [10:08:54] otherwise some timeout wrapper should be possible to put into the shell script that is currently used [10:09:01] so one approach is cgroups [10:09:20] (and/or lxc) [10:09:33] issue is see is that we would need a group per job [10:09:41] since limits are per group [10:09:42] the other is more robust handling of runaway processes by mediawiki [10:09:44] not per process [10:09:58] ideally I think we should do both [10:10:10] cgroups also help with security containment, like your apparmor work [10:10:35] rihgt now there is a double limit that makes debugging a bit complicated: jobs-loop limits memory for all jobs and each job limits memory [10:10:52] replacing the memory limit in jobs-loop.sh with cgroups might help [10:11:59] nod [10:12:26] you might want to look at lxc-execute too [10:12:37] lxc is basically wrappers around cgroups [10:12:44] and namespaces [10:13:09] anyway, I think someone just needs to rethink this through [10:13:22] I think what we have now obviously works but is fragile [10:13:48] I was debugging an issue this Sun/Mon when someone basically DoSed the cluster accidentally [10:14:06] an animated gif made convert hang and there were tons of them until apache threads were exhausted [10:14:32] during the course of debugging I ran a "ps" and found various hanged convert processes back from November [10:26:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:34:02] paravoid: do you remember what file caused the issues with convert? [10:34:43] j^: see pm [10:34:44] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [10:34:45] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [10:34:45] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [10:34:45] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [10:34:45] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [10:34:45] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [10:34:45] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:34:46] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [10:35:37] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/37223 [10:42:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [10:42:53] New patchset: Mark Bergsma; "Handle many more error conditions" [operations/software] (master) - https://gerrit.wikimedia.org/r/37231 [10:43:03] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/37231 [10:43:58] New patchset: Mark Bergsma; "Randomize the order of containers" [operations/software] (master) - https://gerrit.wikimedia.org/r/37405 [10:44:13] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/37405 [10:44:56] New patchset: Mark Bergsma; "Use connection pooling for every Swift operation" [operations/software] (master) - https://gerrit.wikimedia.org/r/37406 [10:45:10] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/37406 [10:45:38] New patchset: Mark Bergsma; "Don't unnecessarily rerequest dst containers" [operations/software] (master) - https://gerrit.wikimedia.org/r/37407 [10:45:46] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/37407 [10:46:09] New patchset: Mark Bergsma; "Get rid of the useless HEAD request on every object creation" [operations/software] (master) - https://gerrit.wikimedia.org/r/37408 [10:46:16] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/37408 [10:46:34] New patchset: Mark Bergsma; "Stop threads from falling asleep" [operations/software] (master) - https://gerrit.wikimedia.org/r/39219 [10:46:41] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/39219 [10:48:41] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [10:56:56] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 193 seconds [10:57:05] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 197 seconds [10:57:14] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [10:59:02] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [10:59:29] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 183 seconds [11:00:50] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 227 seconds [11:02:02] PROBLEM - MySQL Replication Heartbeat on db1041 is CRITICAL: CRIT replication delay 183 seconds [11:02:02] PROBLEM - MySQL Slave Delay on db1024 is CRITICAL: CRIT replication delay 183 seconds [11:02:29] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 190 seconds [11:02:38] PROBLEM - MySQL Slave Delay on db1041 is CRITICAL: CRIT replication delay 193 seconds [11:02:47] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 196 seconds [11:15:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:14] RECOVERY - MySQL Slave Delay on db1024 is OK: OK replication delay 0 seconds [11:27:15] RECOVERY - MySQL Replication Heartbeat on db1041 is OK: OK replication delay 0 seconds [11:27:41] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [11:28:08] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [11:28:08] RECOVERY - MySQL Slave Delay on db1041 is OK: OK replication delay 0 seconds [11:30:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [11:30:30] New patchset: Hashar; "pep8 configuration file" [operations/software] (master) - https://gerrit.wikimedia.org/r/39556 [11:37:50] New review: Hashar; "recheck" [operations/software] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39556 [11:38:44] mark: got python pep8 to run on 'operations/software.git' ;D [11:39:08] loooot of failures https://integration.mediawiki.org/ci/job/operations-software-pep8/1/violations/? , Jenkins will not verify-1 though [11:39:24] yeah... we're not really following pep8 [11:41:00] mark: we can ignore any checks. Did so to ignore tabs which you are using for indentation. ex: (unmerged) https://gerrit.wikimedia.org/r/#/c/39556/1/.pep8,unified [11:41:26] cool [11:41:34] feel free to merge that change :-] [11:41:50] simply adds a /.pep8 file in the repo which is harmless [11:42:57] i set +2 but it's not merging [11:44:56] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:45:05] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [11:46:13] i think chad changed some access permissions and I'm not really sure what his intentions were, so I won't muck around further than I already have [11:48:32] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [11:48:32] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [11:49:51] mark: will follow up with ^demon. Thanks! [11:54:49] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:54:50] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:00:22] New patchset: Faidon; "Repurpose eqiad ms-fe/ms-be for Ceph" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39560 [12:01:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39560 [12:03:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:07] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:17:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.482 seconds [12:19:43] RECOVERY - Puppet freshness on ms-be1002 is OK: puppet ran at Thu Dec 20 12:19:31 UTC 2012 [12:19:43] RECOVERY - Puppet freshness on ms-be1001 is OK: puppet ran at Thu Dec 20 12:19:31 UTC 2012 [12:23:50] for some reason /etc/ssh/ssh_known_hosts is readable only by root :/ Is that intended ? [12:24:03] when running sync scripts I get asked for fingerprints for all hosts [12:27:13] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Thu Dec 20 12:27:00 UTC 2012 [12:28:35] paravoid: do you remember if the hanging process had OMP_NUM_THREADS=1 defined? [12:28:48] it did [12:28:52] without that i see memory usage spiking based on the number of cpus [12:29:15] with it the values are 130Mb so not sure why it was hanging at all [12:30:04] it was hanging when spawned by MW, produced an error when I started it by hand (with ulimits) [12:30:09] no idea why it was different though [12:30:46] OMP_NUM_THREADS='1' MAGICK_TMPDIR='/a/magick-tmp' /usr/bin/convert -background white '/tmp/localcopy_5446288d61ab-1.gif' -coalesce -thumbnail '120x120!' -depth 8 -rotate -0 -fuzz 5% -layers optimizeTransparency '/tmp/test.gif' [12:30:55] with ulimit -t 50 -v 409600 -f 102400 [12:31:16] -v is the important part here [12:34:32] ah ok somehow my version was doing input.gif[0] right now [12:34:46] hm? [12:35:39] that just takes the first frame [12:35:57] without it it uses > 400mb [12:36:12] ah [12:38:32] paravoid: hi :-D Running sync I am getting asked to validate the hosts fingerprints. I have noticed /etc/ssh/ssh_known_hosts to be readable only by root, is that expected? [12:39:04] RECOVERY - NTP on ms-be1002 is OK: NTP OK: Offset -0.02027249336 secs [12:39:50] RECOVERY - NTP on ms-be1001 is OK: NTP OK: Offset -0.02079582214 secs [12:46:34] RECOVERY - NTP on ms-be1003 is OK: NTP OK: Offset -0.02089953423 secs [12:49:13] New patchset: Hashar; "ensure /etc/ssh/ssh_known_hosts is world readeable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39563 [12:49:24] I think that ^^^^ should cover it [12:50:46] New patchset: Hashar; "ensure /etc/ssh/ssh_known_hosts is world readeable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39563 [12:51:41] mark: paravoid: could you look at https://gerrit.wikimedia.org/r/39563 that makes /etc/ssh/ssh_known_hosts world readable and would let us sync MediaWiki conf again :) [12:53:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.958 seconds [13:11:38] lame ass puppet devs [13:13:50] I am not sure why nobody complained before [13:14:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39563 [13:21:34] hashar: I know I could view it before (not being any specific date in the past) [13:22:18] reedy: maybe it got deleted and recreated? [13:22:26] Possibly [13:22:32] Stuff happens ;) [13:22:37] waiting for puppet now :-D [13:23:01] !log reconfiguring all Jenkins jobs [13:23:09] Logged the message, Master [13:27:36] mark: ^demon fixed the perm issue on 'operations/software.git' . You should be able to [Submit] https://gerrit.wikimedia.org/r/#/c/39556/ now :-) [13:27:58] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/39556 [13:28:02] \O/ [13:41:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:36] !log hashar synchronized php-1.21wmf6/extensions/WikimediaMaintenance 'Bring in {{gerrit|32169}} : per-realm and per-datacenter configuration' [13:45:45] Logged the message, Master [13:53:23] !log Jenkins: deleting the old MediaWiki-* jobs which used the Gerrit Trigger plugin. [13:53:31] Logged the message, Master [13:57:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [14:01:34] New patchset: Dereckson; "(bug 43297) New setting: $wmgWebFontsEnabledByDefault" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39572 [14:11:36] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [14:20:00] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 186 seconds [14:21:21] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 217 seconds [14:24:58] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [14:25:25] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [14:29:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:06] New patchset: MaxSem; "Enable GeoData jobs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39576 [14:42:20] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39576 [14:45:11] New patchset: Dereckson; "(bug 39381) Enable WebFonts on Javanese projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39578 [14:45:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [14:46:41] New review: Dereckson; "Please wait I0fbae921 is merged before submit this configuration change." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/39578 [14:49:29] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [14:49:29] Offending key for IP in /etc/ssh/ssh_known_hosts:5731 [14:49:29] Matching host key in /etc/ssh/ssh_known_hosts:603 [14:50:21] <^demon> We pinged about that yesterday several times, nobody fixed it :( [14:50:31] <^demon> (And it's like the 3rd time this week it's happened...) [14:50:55] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/39576/' [14:51:04] Logged the message, Master [14:55:07] * MaxSem waits for job queue to explode in his face, but it won't explode [15:00:16] maxsem@hume:/home/wikipedia/common/wmf-config$ ssh fluorine [15:00:16] ssh: Could not resolve hostname fluorine: Name or service not known [15:00:21] eh?^^^ [15:00:31] oh, on hume [15:02:34] New patchset: Demon; "Initial commit of plugins we're deploying" [operations/gerrit/plugins] (master) - https://gerrit.wikimedia.org/r/39580 [15:03:13] New review: Demon; "This is a work in progress. We're actually waiting for https://gerrit-review.googlesource.com/#/c/40..." [operations/gerrit/plugins] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/39580 [15:04:38] New patchset: Demon; "Go ahead and ensure the plugins directory for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38798 [15:17:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:28] New patchset: Demon; "Install plugins on Gerrit hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38798 [15:34:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.784 seconds [15:56:02] PROBLEM - MySQL Replication Heartbeat on db26 is CRITICAL: CRIT replication delay 202 seconds [15:56:11] PROBLEM - MySQL Slave Delay on db26 is CRITICAL: CRIT replication delay 208 seconds [16:00:41] RECOVERY - MySQL Replication Heartbeat on db26 is OK: OK replication delay 0 seconds [16:00:59] RECOVERY - MySQL Slave Delay on db26 is OK: OK replication delay 0 seconds [16:08:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:44] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [16:19:35] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:24:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [16:27:41] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [16:36:25] ping ^demon, can you have a look at https://gerrit.wikimedia.org/r/#/c/39590/ [16:37:27] <^demon> I don't have the ability to deploy it or anything to gallium. [16:37:45] only hasher has? [16:41:20] <^demon> drdee: I don't know how he's got it setup. It's probably puppetized, but I don't know if merging that change will deploy it automagically or if something needs doing on gallium. [16:43:12] ok [16:56:01] New patchset: MaxSem; "Disable GeoData jobs for now" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39595 [16:56:19] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39595 [16:56:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:19] !log maxsem synchronized wmf-config/InitialiseSettings.php 'https://gerrit.wikimedia.org/r/39595' [16:58:27] Logged the message, Master [17:05:38] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [17:09:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.466 seconds [17:15:06] !log Cleaned up job queue from stuck solrUpdate jobs [17:15:16] Logged the message, Master [17:19:50] PROBLEM - SSH on solr1001 is CRITICAL: Connection refused [17:27:47] RECOVERY - SSH on solr1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:33:02] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39495 [17:39:47] PROBLEM - NTP on solr1001 is CRITICAL: NTP CRITICAL: Offset unknown [17:41:02] maxsem: solr1001 and solr1003 are ready...still waiting on dell for new main board on solr1002 and network on tampa's servers [17:41:40] cmjohnson1, wonderful! [17:42:32] sbernardin: please update ticket when you have fixed the cables [17:43:11] cmjohnson1: will do [17:45:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:02] RECOVERY - NTP on solr1001 is OK: NTP OK: Offset -0.003308773041 secs [17:55:32] !log authdns update adding pappas mgmt to zone file [17:55:40] Logged the message, Master [18:00:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.067 seconds [18:32:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:42] hey guys, ganglia seems funky [18:37:44] not loading js and css [18:38:13] looks like jquery stuff is 404ing [18:38:42] blame LeslieCarr ;) [18:38:52] I think a few files went AWOL in the upgrade [18:42:06] ergh, no css dir [18:42:10] and no jquery min file [18:42:42] heya, apergos, since you are on person-to-bug duty :) [18:42:53] who should I poke about ganglia? lesliecarr is not around for the day [18:45:03] notpeter maybe can help? [18:48:02] Might be a good person to try, he was working on it too [18:48:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [18:53:29] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [18:59:56] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [19:06:18] how many herps could a herp derp herp [19:06:21] ottomata: ^ [19:06:22] I mean [19:06:23] what's up? [19:10:24] ganglia's funky? [19:10:27] maybe it needs a shower [19:10:34] I think it has a mighty neckbeard.... [19:11:23] Where's mutante gone? [19:11:32] he's gone to his desk [19:11:58] Tell him bugzilla is a tarp [19:13:45] Reedy: haven't you yet learned that the best way to communicate with mutante is via rt? [19:13:48] he's all over that shit [19:14:05] heh [19:20:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:36] ^demon: hume fingerprint should hopefully maybe be fixed [19:21:43] I don't know why it's so persnickety [19:21:54] but I fucking purged it from the puppet exported resources [19:22:02] and removed from the fingerprints file [19:22:04] <^demon> You just like causing me grief ;-) [19:22:17] I don't know where else the bad one should be at this point [19:22:19] also that [19:23:07] <^demon> Seems to work without the warnings now, thanks. [19:23:15] <^demon> (Although git has magically disappeared...) [19:24:48] ^demon: that's a feature [19:25:05] but srsly, it might all be fucked again if we hit a phare of the moon again [19:25:18] if so I will continue to try to make the warning go away [19:25:21] so let me know :) [19:28:14] <^demon> Is removing git from hume really a feature? [19:29:47] notpeter jaja [19:29:54] ganglia's got too many gangles [19:30:07] 404s on its favorite jquery and css files [19:30:26] phase of the moon? we got intergalactic alignment of the moon the sun and the core of the milky way like once in 27k years or something right now:) [19:31:05] oh cool! [19:31:07] right now? [19:31:29]